[vc_empty_space][vc_empty_space]
Indonesian-Japanese term extraction from bilingual corpora using machine learning
Nassirudin M.a, Purwarianti A.a
a Department of Informatics Engineering, Bandung Institute of Technology, Bandung, Indonesia
[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2015 IEEE.As bilateral relation between Indonesia and Japan strengthens, the need of consistent term usage for both languages becomes important. In this paper, a new method for Indonesian-Japanese term extraction is presented. In general, this is done in 3 steps: (1) n-gram extraction for each language, (2) n-gram cross-pairing between both languages, and (3) classification. This method is aimed to be able to handle term extraction from both parallel corpora and comparable corpora. In order to use this method, we have to build a classification model first using machine learning. There are 4 types of feature we take into consideration. They are dictionary based features, cognate based features, combined features, and statistic features. The first three features are linguistic features. Dictionary based features consider word-pair existence in a predefined dictionary, cognate based features consider morpheme level similarity, combined features consider both dictionary and cognate based features altogether, and statistic features is used in case the first 3 features fail. The only statistic feature we use is context heterogeneity similarity, which consider the variety of words that can precede or follow a term. For learning algorithm, we use SVM (Support Vector Machine). In the experiment, we compared several scenarios: only linguistic features, only statistic features, or both features combined. The classification model was built from parallel corpora since plenty of term pairs can be extracted from parallel corpora. The size of training data was 5,000 term pairs. The best result was achieved by using only linguistic features and without the preprocessing step. The accuracy was up to 90.98% and recall 92.14%. A testing from comparable corpora was also done with size of 37,392 term pairs where 94 were equivalent translation and 37,298 were not. Evaluation using test set gave accuracy of 98.63% precision, but with low recall score of 24.47%.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Bilateral relations,Classification models,comparable,Linguistic features,parallel,Pre-processing step,SVM(support vector machine),Term extraction[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]comparable,linguistic,machine learning,parallel,statistic,SVM,term extraction[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICACSIS.2015.7415180[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]