Enter your keyword

2-s2.0-85020221351

[vc_empty_space][vc_empty_space]

Adaptation of acoustic model for Indonesian using varying ratios of spontaneous speech data

Hoesen D.a, Lestari D.P.a, Khodra M.L.a

a Department of Informatics, Institut Teknologi Bandung, Bandung, Indonesia

[vc_row][vc_column][vc_row_inner][vc_column_inner][vc_separator css=”.vc_custom_1624529070653{padding-top: 30px !important;padding-bottom: 30px !important;}”][/vc_column_inner][/vc_row_inner][vc_row_inner layout=”boxed”][vc_column_inner width=”3/4″ css=”.vc_custom_1624695412187{border-right-width: 1px !important;border-right-color: #dddddd !important;border-right-style: solid !important;border-radius: 1px !important;}”][vc_empty_space][megatron_heading title=”Abstract” size=”size-sm” text_align=”text-left”][vc_column_text]© 2016 IEEE.This paper presents our work in determining the ratio/amount of speech adaptation data that gives the optimum recognition error rate. Ten triphone acoustic models are first built in a similar-to-10-fold cross-validation manner using the dictated speech corpus. The whole dictated speech is read from 10 prepared transcripts by a diverse group of 301 Indonesian speakers. Each round of training uses utterances from one of the prepared transcripts. The resulting triphone models are also evaluated against its corresponding spontaneous speech evaluation set. The models that yield the lowest, the highest, and the closest-to-mean recognition error are then adapted to its corresponding spontaneous speech adaptation data using the Maximum A-posteriori Probability (MAP) method. The amount of the corresponding spontaneous speech adaptation data for each model is varied from 10% to 100% with a 10% increment. Thus for each of the 3 triphone models, there will be 10 resulting adapted models. The resulting adapted models are evaluated against their corresponding spontaneous and dictated speech evaluation set. The trend in the results shows that around 30-60% spontaneous speech adaptation data (roughly translates to 11.5 to 24 hours of speech) gives the most optimum (low) recognition error rate.[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Author keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Acoustic model,Adaptation-data ratio,Cross validation,GMM-HMM,MAP adaptation,Maximum A posteriori probabilities,Recognition error,Spontaneous speech[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Indexed keywords” size=”size-sm” text_align=”text-left”][vc_column_text]Adaptation-data ratio,GMM-HMM,MAP adaptation,Spontaneous speech[/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”Funding details” size=”size-sm” text_align=”text-left”][vc_column_text][/vc_column_text][vc_empty_space][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][vc_empty_space][megatron_heading title=”DOI” size=”size-sm” text_align=”text-left”][vc_column_text]https://doi.org/10.1109/ICSDA.2016.7918981[/vc_column_text][/vc_column_inner][vc_column_inner width=”1/4″][vc_column_text]Widget Plumx[/vc_column_text][/vc_column_inner][/vc_row_inner][/vc_column][/vc_row][vc_row][vc_column][vc_separator css=”.vc_custom_1624528584150{padding-top: 25px !important;padding-bottom: 25px !important;}”][/vc_column][/vc_row]