Abstract
Speech technologies—particularly Text-to-Speech (TTS) and Speech-to-Text (STT)—play an essential role in modern artificial intelligence systems. While these technologies have reached high maturity for languages such as English and Mandarin, Khmer remains a low-resource language in the speech technology ecosystem. Academic research over the past two decades shows gradual progress in Khmer automatic speech recognition, grapheme-to-phoneme modeling, and speech dataset development. However, limitations in datasets, pronunciation lexicons, and benchmarking infrastructures still constrain the maturity of Khmer speech technologies.
This review summarizes the current research landscape of Khmer TTS and STT, highlighting key research papers, datasets, enabling technologies, and remaining challenges.
1. Linguistic Challenges for Khmer Speech Technology
Khmer presents several computational challenges that directly affect speech AI systems.
Word Segmentation Difficulty
Khmer writing typically does not mark word boundaries using spaces, making word segmentation a fundamental preprocessing task. This complicates language modeling, tokenization, and speech synthesis pipelines.
Complex Orthography
The Khmer script uses:
consonant stacking
diacritics above/below characters
complex vowel combinations
These characteristics increase the difficulty of grapheme-to-phoneme conversion and acoustic modeling.
Limited Language Resources
Researchers repeatedly highlight that Khmer is an under-resourced language, meaning that both text corpora and speech corpora remain limited compared with major languages.
2. Khmer Speech-to-Text (Automatic Speech Recognition)
Early Khmer ASR Research
One of the earliest major Khmer ASR studies is:
First Broadcast News Transcription System for Khmer Language (LREC 2008)
https://aclanthology.org/L08-1123/
This research introduced a large-vocabulary continuous speech recognition (LVCSR) system for Khmer broadcast news transcription. The study addressed challenges such as limited language resources and segmentation issues while proposing the use of both word and sub-word modeling units for Khmer speech recognition.
Acoustic and Language Modeling Research
Another important work explores modeling units in Khmer ASR:
Which Units for Acoustic and Language Modeling for Khmer Automatic Speech Recognition?
https://www.isca-archive.org/sltu_2008/seng08_sltu.pdf
The study discusses strategies for building pronunciation dictionaries automatically and explores hybrid modeling approaches combining words and sub-word units for Khmer speech recognition.
Modern Research Direction
More recent research trends explore:
multilingual speech models
neural network acoustic modeling
transformer-based architectures
These methods attempt to overcome Khmer’s limited dataset sizes by leveraging cross-lingual learning and multilingual datasets.
3. Khmer Text-to-Speech Research
Compared with ASR, peer-reviewed Khmer TTS research is more limited, but several foundational works exist.
Building TTS Voices for Low-Resource Languages
A widely cited study is:
A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala and Sundanese
https://www.isca-archive.org/sltu_2018/sodimana18_sltu.pdf
This research provides open resources for building TTS systems in several low-resource languages including Khmer. The dataset includes:
speech recordings
pronunciation lexicons
phonology definitions
These resources allow researchers to build basic TTS voices using open frameworks.
Text Normalization for TTS
Another supporting study focuses on text normalization:
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems
https://www.isca-archive.org/sltu_2018/sodimana18b_sltu.pdf
This research developed normalization grammars to process numbers, abbreviations, and symbols before speech synthesis—an essential step in TTS pipelines.
4. Khmer Speech Datasets
One major limitation of Khmer speech technology is the lack of large publicly available datasets.
A small but useful speech dataset is available through OpenSLR:
OpenSLR Khmer Speech Dataset
https://www.openslr.org/42/
This dataset includes speech recordings used for research on low-resource speech synthesis and recognition.
These datasets are often only a few hours long, which is significantly smaller than datasets used for large speech models in major languages.
5. Supporting Khmer NLP Research
Speech technologies depend heavily on general natural language processing research.
A key contribution in this area is:
Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
https://dl.acm.org/doi/10.1145/3464378
This work describes Khmer tokenization and POS-tagging datasets and highlights the importance of linguistic resources for building language technologies.
Such infrastructure research helps improve:
speech recognition accuracy
language modeling
pronunciation modeling
speech synthesis quality
6. Practical Khmer TTS Initiatives
Beyond academic papers, practical development efforts are also emerging.
One example is:
Khmer Text-to-Speech Research – IDRI
https://www.idri.edu.kh/wp-content/uploads/2025/05/Khmer-TTS-1.pdf
This report discusses how Khmer TTS can be used for accessibility, education, and digital services in Cambodia.
7. Maturity Assessment
Based on current literature, Khmer speech technologies can be evaluated as follows:
| Technology | Current Maturity |
|---|---|
| Khmer Speech-to-Text (ASR) | Medium |
| Khmer Text-to-Speech (TTS) | Low–Medium |
| Supporting Khmer NLP | Medium |
STT research is slightly ahead because speech recognition tasks have received more academic attention.
However, both areas remain constrained by:
limited datasets
lack of large pronunciation dictionaries
small research communities
8. Future Research Directions
The literature suggests several key directions for advancing Khmer speech technologies:
Larger Speech Corpora
High-quality speech datasets with hundreds or thousands of hours are required.
Standardized Pronunciation Dictionaries
Pronunciation resources are critical for both ASR and TTS.
Benchmarking and Evaluation
Public benchmarks and evaluation datasets would allow researchers to compare Khmer speech systems more effectively.
Cross-Lingual Transfer Learning
Using multilingual speech models may help overcome Khmer’s data limitations.
Conclusion
Research over the past two decades demonstrates that Khmer speech technology is steadily evolving. Early work focused on building fundamental speech recognition systems and linguistic resources. More recent research emphasizes open datasets, multilingual modeling, and speech synthesis frameworks for low-resource languages.
Although Khmer TTS and STT technologies remain less mature than those for widely spoken languages, ongoing research efforts continue to expand the Khmer speech technology ecosystem.
Full Reference Links
Speech Recognition (STT)
First Broadcast News Transcription System for Khmer Language
https://aclanthology.org/L08-1123/Which Units for Acoustic and Language Modeling for Khmer ASR
https://www.isca-archive.org/sltu_2008/seng08_sltu.pdfDevelopment of Speech Recognition System Based on CMUSphinx for Khmer Language
https://www.researchgate.net/publication/354435668Multi-lingual Transformer Training for Khmer Automatic Speech Recognition
https://www.sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/SOK-APSIPA19.pdf
Speech Datasets
Khmer Speech Translation Corpus
https://sap.ist.i.kyoto-u.ac.jp/EN/bib/intl/SOK-COCOSDA21.pdfOpenSLR Khmer Speech Dataset
https://www.openslr.org/42/
Text-to-Speech Research
A Step-by-Step Process for Building TTS Voices for Low-Resource Languages
https://www.isca-archive.org/sltu_2018/sodimana18_sltu.pdfText Normalization for Low-Resource TTS Systems
https://www.isca-archive.org/sltu_2018/sodimana18b_sltu.pdf
Linguistic Resources
Building WFST based Grapheme to Phoneme Conversion for Khmer
https://ksoky.github.io/static/pdf/wfst_g2p.pdfApplying Linguistic G2P Knowledge on Khmer
https://www.researchgate.net/publication/338354884
Khmer NLP Infrastructure
Khmer Word Segmentation Using Conditional Random Fields
https://att-astrec.nict.go.jp/member/tei/KhNLP2015-SEG.pdfTowards Tokenization and POS Tagging for Khmer
https://dl.acm.org/doi/10.1145/3464378
No comments:
Post a Comment