Enter your keyword

2-s2.0-84992126238

[vc_empty_space][vc_empty_space]

Combination of Latent Dirichlet Allocation (LDA) and Term Frequency-Inverse Cluster Frequency (TFxICF) in Indonesian text clustering with labeling

Suadaa L.H.a, Purwarianti A.a

a School of Electrical Engineering and Informatics, Bandung Institute of Technology, Bandung, Indonesia

[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2016 IEEE.Due to the limited labeled data, clustering is a solution for classifying documents that do not have prior knowledge. The combination of Latent Dirichlet Allocation (LDA) in grouping documents by topic and Term Frequency-Inverse Cluster Frequency (TFxICF) in the labeling was proposed to resolve the problem of classification using clustering completed with a description of the cluster results. Indonesian text preprocessing has been done by the extraction of abbreviations and acronyms, tokenization, stemming and stopwords elimination. Experiments were conducted using Indonesian digital library documents, 113 documents from digital library of STIS and 60 documents from digital library of ITB, to examine the effects of text preprocessing, to compare the cluster results of LDA with other clustering algorithms and to compare the use of word and phrase tokens in the clustering and labeling. The cluster quality was measured by using precision, recall, and F-measure and the label quality was determined by similarity with the keywords that most frequently appear in the clusters. Based on the experimental results, preprocessing techniques can improve the cluster quality. LDA algorithm produces documents cluster by topic with cluster quality better than K-Means and Lingo. Word based LDA generates cluster with better quality than phrase based LDA. Moreover, the labeling by using word based TFxICF is more descriptive than phrase based TFxICF. Therefore, the use of word based LDA for clustering and phrase based TFxICF for labeling was proposed.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Cluster frequencies,Cluster labeling,Latent Dirichlet allocation,Latent dirichlet allocations,Preprocessing techniques,Text Clustering,Text preprocessing,tfxidf[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]cluster labeling,latent dirichlet allocation,text clustering,tfxidf[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICoICT.2016.7571885[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]