Wednesday, March 11, 2026

Khmer Text-to-Speech (TTS) and Speech-to-Text (STT): Academic Literature Review

Abstract

Speech technologies—particularly Text-to-Speech (TTS) and Speech-to-Text (STT)—play an essential role in modern artificial intelligence systems. While these technologies have reached high maturity for languages such as English and Mandarin, Khmer remains a low-resource language in the speech technology ecosystem. Academic research over the past two decades shows gradual progress in Khmer automatic speech recognition, grapheme-to-phoneme modeling, and speech dataset development. However, limitations in datasets, pronunciation lexicons, and benchmarking infrastructures still constrain the maturity of Khmer speech technologies.

This review summarizes the current research landscape of Khmer TTS and STT, highlighting key research papers, datasets, enabling technologies, and remaining challenges.

1. Linguistic Challenges for Khmer Speech Technology

Khmer presents several computational challenges that directly affect speech AI systems.

Word Segmentation Difficulty

Khmer writing typically does not mark word boundaries using spaces, making word segmentation a fundamental preprocessing task. This complicates language modeling, tokenization, and speech synthesis pipelines.

Complex Orthography

The Khmer script uses:

consonant stacking
diacritics above/below characters
complex vowel combinations

These characteristics increase the difficulty of grapheme-to-phoneme conversion and acoustic modeling.

Limited Language Resources

Researchers repeatedly highlight that Khmer is an under-resourced language, meaning that both text corpora and speech corpora remain limited compared with major languages.

2. Khmer Speech-to-Text (Automatic Speech Recognition)

Early Khmer ASR Research

One of the earliest major Khmer ASR studies is:

First Broadcast News Transcription System for Khmer Language (LREC 2008)
https://aclanthology.org/L08-1123/

This research introduced a large-vocabulary continuous speech recognition (LVCSR) system for Khmer broadcast news transcription. The study addressed challenges such as limited language resources and segmentation issues while proposing the use of both word and sub-word modeling units for Khmer speech recognition.

Acoustic and Language Modeling Research

Another important work explores modeling units in Khmer ASR:

Which Units for Acoustic and Language Modeling for Khmer Automatic Speech Recognition?
https://www.isca-archive.org/sltu_2008/seng08_sltu.pdf

The study discusses strategies for building pronunciation dictionaries automatically and explores hybrid modeling approaches combining words and sub-word units for Khmer speech recognition.

Modern Research Direction

More recent research trends explore:

multilingual speech models
neural network acoustic modeling
transformer-based architectures

These methods attempt to overcome Khmer’s limited dataset sizes by leveraging cross-lingual learning and multilingual datasets.

3. Khmer Text-to-Speech Research

Compared with ASR, peer-reviewed Khmer TTS research is more limited, but several foundational works exist.

Building TTS Voices for Low-Resource Languages

A widely cited study is:

A Step-by-Step Process for Building TTS Voices Using Open Source Data and Frameworks for Bangla, Javanese, Khmer, Nepali, Sinhala and Sundanese
https://www.isca-archive.org/sltu_2018/sodimana18_sltu.pdf

This research provides open resources for building TTS systems in several low-resource languages including Khmer. The dataset includes:

speech recordings
pronunciation lexicons
phonology definitions

These resources allow researchers to build basic TTS voices using open frameworks.

Text Normalization for TTS

Another supporting study focuses on text normalization:

Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala and Sundanese Text-to-Speech Systems
https://www.isca-archive.org/sltu_2018/sodimana18b_sltu.pdf

This research developed normalization grammars to process numbers, abbreviations, and symbols before speech synthesis—an essential step in TTS pipelines.

4. Khmer Speech Datasets

One major limitation of Khmer speech technology is the lack of large publicly available datasets.

A small but useful speech dataset is available through OpenSLR:

OpenSLR Khmer Speech Dataset
https://www.openslr.org/42/

This dataset includes speech recordings used for research on low-resource speech synthesis and recognition.

These datasets are often only a few hours long, which is significantly smaller than datasets used for large speech models in major languages.

5. Supporting Khmer NLP Research

Speech technologies depend heavily on general natural language processing research.

A key contribution in this area is:

Towards Tokenization and Part-of-Speech Tagging for Khmer: Data and Discussion
https://dl.acm.org/doi/10.1145/3464378

This work describes Khmer tokenization and POS-tagging datasets and highlights the importance of linguistic resources for building language technologies.

Such infrastructure research helps improve:

speech recognition accuracy
language modeling
pronunciation modeling
speech synthesis quality

6. Practical Khmer TTS Initiatives

Beyond academic papers, practical development efforts are also emerging.

One example is:

Khmer Text-to-Speech Research – IDRI
https://www.idri.edu.kh/wp-content/uploads/2025/05/Khmer-TTS-1.pdf

This report discusses how Khmer TTS can be used for accessibility, education, and digital services in Cambodia.

7. Maturity Assessment

Based on current literature, Khmer speech technologies can be evaluated as follows:

Technology	Current Maturity
Khmer Speech-to-Text (ASR)	Medium
Khmer Text-to-Speech (TTS)	Low–Medium
Supporting Khmer NLP	Medium

STT research is slightly ahead because speech recognition tasks have received more academic attention.

However, both areas remain constrained by:

limited datasets
lack of large pronunciation dictionaries
small research communities

8. Future Research Directions

The literature suggests several key directions for advancing Khmer speech technologies:

Larger Speech Corpora

High-quality speech datasets with hundreds or thousands of hours are required.

Standardized Pronunciation Dictionaries

Pronunciation resources are critical for both ASR and TTS.

Benchmarking and Evaluation

Public benchmarks and evaluation datasets would allow researchers to compare Khmer speech systems more effectively.

Cross-Lingual Transfer Learning

Using multilingual speech models may help overcome Khmer’s data limitations.

Conclusion

Research over the past two decades demonstrates that Khmer speech technology is steadily evolving. Early work focused on building fundamental speech recognition systems and linguistic resources. More recent research emphasizes open datasets, multilingual modeling, and speech synthesis frameworks for low-resource languages.

Although Khmer TTS and STT technologies remain less mature than those for widely spoken languages, ongoing research efforts continue to expand the Khmer speech technology ecosystem.

Full Reference Links

Speech Recognition (STT)

First Broadcast News Transcription System for Khmer Language
https://aclanthology.org/L08-1123/
Which Units for Acoustic and Language Modeling for Khmer ASR
https://www.isca-archive.org/sltu_2008/seng08_sltu.pdf
Development of Speech Recognition System Based on CMUSphinx for Khmer Language
https://www.researchgate.net/publication/354435668
Multi-lingual Transformer Training for Khmer Automatic Speech Recognition
https://www.sap.ist.i.kyoto-u.ac.jp/lab/bib/intl/SOK-APSIPA19.pdf

Speech Datasets

Khmer Speech Translation Corpus
https://sap.ist.i.kyoto-u.ac.jp/EN/bib/intl/SOK-COCOSDA21.pdf
OpenSLR Khmer Speech Dataset
https://www.openslr.org/42/

Text-to-Speech Research

A Step-by-Step Process for Building TTS Voices for Low-Resource Languages
https://www.isca-archive.org/sltu_2018/sodimana18_sltu.pdf
Text Normalization for Low-Resource TTS Systems
https://www.isca-archive.org/sltu_2018/sodimana18b_sltu.pdf

Linguistic Resources

Building WFST based Grapheme to Phoneme Conversion for Khmer
https://ksoky.github.io/static/pdf/wfst_g2p.pdf
Applying Linguistic G2P Knowledge on Khmer
https://www.researchgate.net/publication/338354884

Khmer NLP Infrastructure

Khmer Word Segmentation Using Conditional Random Fields
https://att-astrec.nict.go.jp/member/tei/KhNLP2015-SEG.pdf
Towards Tokenization and POS Tagging for Khmer
https://dl.acm.org/doi/10.1145/3464378

State of the Art of Khmer Text-to-Speech (TTS) and Speech-to-Text (STT)

Introduction

Speech technology has become an important field within artificial intelligence, enabling computers to interact with humans through spoken language. Two core technologies drive this interaction:

Text-to-Speech (TTS) – converting written text into spoken audio
Speech-to-Text (STT) – converting spoken language into written text

For global languages such as English, Chinese, and Spanish, these technologies have reached a highly advanced stage. However, Khmer remains a low-resource language, meaning that the amount of available training data, linguistic resources, and technological infrastructure is still limited.

Because of this, the development of Khmer speech technologies is still evolving. Researchers continue to explore methods to improve both Khmer TTS and Khmer STT systems so they can achieve levels of quality and reliability comparable to major languages.

Khmer Language Characteristics and Technical Challenges

One of the main reasons speech technology development is more difficult for Khmer is due to the linguistic structure of the language.

Lack of Clear Word Boundaries

Unlike many languages that separate words using spaces, Khmer text does not consistently mark word boundaries. This makes it difficult for computational systems to perform tasks such as:

word segmentation
text normalization
language modeling

As a result, many preprocessing steps must be implemented before speech systems can effectively process Khmer text.

Complex Writing System

Khmer script is structurally complex. Characters can include:

consonant clusters
dependent vowels
diacritics positioned above, below, or around the base character

These properties increase the complexity of transforming written text into phonetic representations required for speech synthesis and recognition.

Khmer Text-to-Speech (TTS)

Text-to-Speech technology converts written Khmer text into spoken audio.

In general, a Khmer TTS system involves several processing steps:

Text preprocessing
Cleaning and normalizing text input
Word segmentation
Identifying individual words in continuous Khmer text
Grapheme-to-phoneme conversion
Converting Khmer characters into phonetic units
Speech synthesis
Generating the final speech waveform

Historically, early Khmer TTS systems relied on rule-based or concatenation approaches where recorded speech fragments were combined to generate spoken output.

More recent developments attempt to improve naturalness and intelligibility by applying machine learning methods and speech corpora.

Khmer Speech-to-Text (STT)

Speech-to-Text, also known as automatic speech recognition (ASR), performs the reverse process of TTS.

It converts spoken Khmer audio into written text.

A Khmer STT system generally involves:

capturing audio input from a microphone or recording
processing acoustic signals
mapping sound patterns to phonemes
generating the corresponding text output

Speech recognition systems require several components:

acoustic models that interpret speech signals
language models that estimate word probabilities
pronunciation dictionaries linking phonemes to words

Developing these components for Khmer is difficult because of the limited amount of annotated speech data available.

Research has demonstrated that Khmer speech recognition systems can be built using open-source toolkits such as CMUSphinx, achieving recognition accuracy close to 90% under controlled experimental conditions.

Available Data and Research Resources

One of the biggest challenges for Khmer speech technologies is the lack of large datasets.

Speech models require thousands of hours of recorded audio to achieve high accuracy. For Khmer, available datasets are still relatively small.

Some datasets do exist, such as speech corpora collected for multilingual research projects and open-source speech resources. These datasets contain recorded audio paired with transcriptions that allow researchers to train TTS and STT models.

Research initiatives and academic institutions in Cambodia are actively working on building these resources to support Khmer AI development.

Current Maturity of Khmer Speech Technology

Compared with high-resource languages, Khmer speech technologies are still developing.

The maturity of Khmer TTS and STT can generally be described as:

Functional but limited in quality
Dependent on relatively small datasets
Under active research and improvement

Current systems can perform speech synthesis and speech recognition, but they often struggle with:

pronunciation variations
background noise
dialect differences
complex linguistic structures

Despite these challenges, progress continues as more datasets and research initiatives emerge.

Future Development

To improve Khmer speech technologies, several areas require continued effort:

Expansion of Speech Datasets

More recorded Khmer speech data is necessary to train accurate models.

Improved Language Processing Tools

Better word segmentation, phoneme dictionaries, and linguistic resources will enhance both TTS and STT systems.

Research Collaboration

Collaboration between universities, technology companies, and government institutions will accelerate progress in Khmer speech technology.

Conclusion

Khmer Text-to-Speech and Speech-to-Text technologies are advancing but remain less mature compared with those available for widely spoken languages. The main challenges stem from the Khmer language’s structural complexity and the limited availability of speech datasets.

Nevertheless, ongoing research and technological development continue to improve these systems. As more linguistic resources and speech data become available, Khmer speech technologies are expected to become increasingly accurate and widely adopted in areas such as education, accessibility, and digital services.

References

Development of Speech Recognition System Based on CMUSphinx for Khmer Language
https://www.researchgate.net/publication/354435668_Development_of_Speech_Recognition_System_Based_on_CMUSphinx_for_Khmer_Language
OpenSLR Khmer Speech Dataset
https://www.openslr.org/42/
Re-collected via: https://storm.genie.stanford.edu/article/state-of-the-art-of-khmer-tts-and-khmer-stt%2C-provide-academically-summarize-and-detail-on-how-mature-about-them-1552789

Pages

Wednesday, March 11, 2026

Khmer Text-to-Speech (TTS) and Speech-to-Text (STT): Academic Literature Review

Abstract

1. Linguistic Challenges for Khmer Speech Technology

Word Segmentation Difficulty

Complex Orthography

Limited Language Resources

2. Khmer Speech-to-Text (Automatic Speech Recognition)

Early Khmer ASR Research

Acoustic and Language Modeling Research

Modern Research Direction

3. Khmer Text-to-Speech Research

Building TTS Voices for Low-Resource Languages

Text Normalization for TTS

4. Khmer Speech Datasets

5. Supporting Khmer NLP Research

6. Practical Khmer TTS Initiatives

7. Maturity Assessment

8. Future Research Directions

Larger Speech Corpora

Standardized Pronunciation Dictionaries

Benchmarking and Evaluation

Cross-Lingual Transfer Learning

Conclusion

Full Reference Links

Speech Recognition (STT)

Speech Datasets

Text-to-Speech Research

Linguistic Resources

Khmer NLP Infrastructure

State of the Art of Khmer Text-to-Speech (TTS) and Speech-to-Text (STT)

Introduction

Khmer Language Characteristics and Technical Challenges

Lack of Clear Word Boundaries

Complex Writing System

Khmer Text-to-Speech (TTS)

Khmer Speech-to-Text (STT)

Available Data and Research Resources

Current Maturity of Khmer Speech Technology

Future Development

Expansion of Speech Datasets

Improved Language Processing Tools

Research Collaboration

Conclusion

References