Enter your keyword

2-s2.0-85081088390

[vc_empty_space][vc_empty_space]

Effective use of augmentation degree and language model for synonym-based text augmentation on Indonesian text classification

Abdurrahmana, Purwarianti A.b

a Prosa Solusi Cerdas, Bandung, Indonesia
b Institut Teknologi Bandung, Bandung, Indonesia

[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2019 IEEE.Machine learning based text processing relies on a qualified text dataset. Text augmentation research aims to enrich text dataset in order to gain higher performance compared to the one using original text dataset. We have conducted text augmentation process on Indonesian text classification by replacing certain words with their synonyms. The process consists of determining the number of words to be substituted in the sentence and selecting the substitute word from the synonym list. The first process, determining the number of words to be substituted, is done using augmentation degree. The second process, selecting the best substitute word, is done using language model. The synonym list is built from thesaurus. We compared several options in building language model. Statistical model is built using combinations of n-gram and smoothing while simple neural model is built using gram value of 3 and 5. The neural model uses pre trained word embedding as input. 5-gram neural model excels other language model setup by significant value of perplexity. Using the best language model, augmented dataset is generated and applied on two classification task of aspect-based sentiment analysis: aspect categorization and sentiment classification. Experiments were done using augmentation degree of 0.1 to 1. The best augmentation degree yields a better 3-4% on classification model’s performance.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Augmentation,Augmentation degree,Language model,Text,Text classification[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Augmentation,Augmentation degree,Language model,Text,Text classification[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text]ACKNOWLEDGMENT This work is a part of “Intelligent System to Monitor Gadget Usage in Teenagers using Machine Learning Technique” research and partially funded by the Ministry of Research and Higher Education of Indonesia.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICACSIS47736.2019.8979733[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]