Can PhayaThaiBERT Improve by Adding Unassimilated Loanwords?

Original title: PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Authors: Panyut Sriwirote, Jalinee Thapiang, Vasan Timtong, Attapol T. Rutherford

The article explores improvements to a widely used Thai language model, WangchanBERTa. It addresses the model’s limitations in understanding unassimilated English loanwords in Thai contexts. By enhancing the model’s vocabulary with foreign words from XLM-R, a new model named PhayaThaiBERT is pre-trained on a larger dataset than its predecessor. PhayaThaiBERT exhibits superior performance over WangchanBERTa across various tasks and datasets. This enhancement provides a significant leap forward in the model’s ability to comprehend and process unassimilated loanwords, offering better language understanding and performance in diverse applications.

Original article: https://arxiv.org/abs/2311.12475