Enter your keyword

2-s2.0-85011298612

[vc_empty_space][vc_empty_space]

Detecting vandalism on English Wikipedia using LNSMOTE resampling and Cascaded Random Forest classifier

Shulhan M.a, Widyantoro D.H.a

a School of Electrical and Informatics Engineering, Institut Teknologi Bandung, Bandung, 40132, Indonesia

[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2016 IEEE.Wikipedia.org is an online encyclopedia which can be edited by anyone. This feature makes the article in Wikipedia rapidly increased in size and can be fixed subsequently, but also makes it prone to vandalism in the forms of invalid information, deletion, ads, or meaningless content. This paper propose a framework for detecting vandalism on English Wikipedia using machine learning technique by training Cascaded Random Forest (CRF) classifier on PAN Wikipedia Vandalism Corpus 2010 (PAN-WVC-10) English dataset that has been resampled using Local Neighbourhood Synthetic Minority Oversampling Technique (LNSMOTE). These two techniques then compared with Random Forest (RF) for classifier and Synthetic Minority Oversampling Technique (SMOTE) for resampling. The result of classifiers that has been tested on PAN Wikipedia Vandalism Corpus 2011 (PAN-WVC-11) English dataset showed that dataset resampled using LNSMOTE increase the true-positive rate (TPR) better than SMOTE in both classifiers. CRF on SMOTE with 200 stages and 1 tree gave the better result among others with TPR value 0.9904. From training computation time, CRF 1.6 times faster than RF in resampled dataset.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Computation time,Machine learning techniques,Neighbourhood,Online encyclopedia,Random forest classifier,Random forests,Synthetic minority over-sampling techniques,True positive rates[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICAICTA.2016.7803106[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]