[vc_empty_space][vc_empty_space]
Towards Robust Indonesian Speech Recognition with Spontaneous-Speech Adapted Acoustic Models
Hoesen D.a, Satriawan C.H.a, Lestari D.P.a, Khodra M.L.a
a Institut Teknologi Bandung, Bandung, 40115, Indonesia
[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2016 The Authors.This paper presents our work in building an Indonesian speech recognizer to handle both spontaneous and dictated speech. The recognizer is based on the Gaussian Mixture and Hidden Markov Models (GMM-HMM). The model is first trained on 73 hours of dictated speech and 43.5 minutes of spontaneous speech. The dictated speech is read from prepared transcripts by a diverse group of 244 Indonesian speakers. The spontaneous speech is manually labelled from recordings of an Indonesian parliamentary meeting, and is interspersed with noises and fillers. The resulting triphone model is then adapted only to the spontaneous speech using the Maximum A-posteriori Probability (MAP) method. We evaluate the adapted model using separate dictated and spontaneous evaluation sets. The dictated set consists of speech from 20 speakers totaling 14.5 hours. The spontaneous set is derived from the recording of a regional government meeting, consisting of 1085 utterances totaling 48.5 minutes. Evaluation of a MAP-adapted spontaneous set yields a 2.60% absolute increase in Word Accuracy Rate (WAR) over the un-adapted model, outperforming MMI adaptation. Conversely, MMI adaption of the dictated set outperforms the MAP adaptation by achieving an absolute increase of 1.48% in WAR over the un-adapted model. We also demonstrate that fMLLR speaker adaptation is unsuitable for our task due to limited adaptation data.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Gaussian mixtures,GMM-HMM,MAP adaptation,Maximum A posteriori probabilities,MMI adaptation,Regional government,Speaker adaptation,Spontaneous speech[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]GMM-HMM,MAP adaptation,MMI adaptation,Spontaneous speech[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1016/j.procs.2016.04.045[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]