[vc_empty_space][vc_empty_space]
Table Extraction from Web Pages Using Conditional Random Fields to Extract Toponym Related Data
Luthfi Hanifah H.a, Akbar S.a
a School of Electrical Engineering and Informatics, Bandung Institute of Technology, Indonesia
[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© Published under licence by IOP Publishing Ltd.Table is one of the ways to visualize information on web pages. The abundant number of web pages that compose the World Wide Web has been the motivation of information extraction and information retrieval research, including the research for table extraction. Besides, there is a need for a system which is designed to specifically handle location-related information. Based on this background, this research is conducted to provide a way to extract location-related data from web tables so that it can be used in the development of Geographic Information Retrieval (GIR) system. The location-related data will be identified by the toponym (location name). In this research, a rule-based approach with gazetteer is used to recognize toponym from web table. Meanwhile, to extract data from a table, a combination of rule-based approach and statistical-based approach is used. On the statistical-based approach, Conditional Random Fields (CRF) model is used to understand the schema of the table. The result of table extraction is presented on JSON format. If a web table contains toponym, a field will be added on the JSON document to store the toponym values. This field can be used to index the table data in accordance to the toponym, which then can be used in the development of GIR system.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Conditional random field,gazetteer,Geographic information retrievals (GIR),Information retrieval research,Rule-based approach,toponym,Web tables[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]CRF,gazetteer,GIR,table extraction,toponym[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1088/1742-6596/801/1/012064[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]