[vc_empty_space][vc_empty_space]
Extracting main content-blocks from blog posts
Akbar S.a,b, Slaughter L.a, Nytro O.a
a Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Norway
b School of Electrical Engineering and Informatics, Institut Teknologi Bandung (ITB), Indonesia
[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]A blog post typically contains defined blocks containing different information such as the main content, a blogger profile, links to blog archives, comments, and even advertisements. Thus, identifying and extracting the main/content block of blog posts or web pages in general is important for information extraction purposes before further processing. This paper describes our approach for extracting main/content block from blog posts with disparate types of blog mark-up. Adapting the Content Structure Tree (CST)-based approach, our approach proposed a new consideration in calculating the importance of HTML content nodes and in definition of the attenuation quotient suffered by HTML item/block nodes. Performance using this approach is increased because posts published in the same domain tend to have similar page template, such that a general main content marker could be applied for them. The approach consists of two steps. In the first step, the approach employs the modified CST approach for detecting the primary and secondary markers for page cluster. In the next step, it uses HTMLFilter to extract the main block of a page, based on the detected markers. When HTMLFilter cannot find the main block, the modified CST is used as the second alternative. Some experiments showed that the approach can extract main block with an accuracy of more than 94%.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Blog post,Content extractor,Content structure,Information Extraction,Informative block,Main block,Web page[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Blog post,Content extractor,Informative block,Main block[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]