Keiichiro Suzuki: Automatic Speech Recognition of Japanese numeral-numeral classifier combinations

The goal of automatic speech recognition (ASR) systems is to achieve complete, or human-level mapping of an acoustic signal to a string of words. Dealing with natural speech, ASR systems encounter various linguistically significant issues such as coarticulation and morphophonemic alternations as well as linguistically insignificant issues such as speech rate and environmental noise.

This paper presents a study to improve the performance of Japanese large vocabulary continuous speech recognizer (LVCSR) by modeling morphophonemic alternation and pronunciation variation (i.e. free variation). In particular, I report the results of performance tests ran on numeral-numeral classifier combinations in Japanese (e.g. ni-hon 'two stick-type object', san-bon 'three stick-type objects'), showing how the accuracy of our LVCSR engine was improved through modeling morphophonemic alternation and pronunciation variation of the target lexical items. On one hand, these numeral-numeral classifier combinations can be a typical subject of phonological/morphological study, displaying linguistically significant, "regular" voicing alternation patterns. On the other hand, the same set of data shows linguistically insignificant, pronunciation variation involving voicing. I demonstrate that these two seemingly unrelated phenomena are indeed resolved by the same process of statistical adjustment.