























Abstract:Vietnamese exhibits substantial dialectal phonetic variation across Northern, Central, and Southern regions, where identical lexical items may be realized with markedly different pronunciations. Such variation poses challenges for automatic speech recognition (ASR) and remains difficult to model computationally due to the complex relationship between Vietnamese orthography and phonology. Existing approaches typically address dialect variability at the word level, assuming dialect-invariant mappings between spelling and pronunciation, which limits their ability to capture systematic phonetic differences. We propose a dialect-aware phonetic framework that explicitly models Vietnamese phonological structure and dialectal variation at both the vocabulary and decoding levels. The framework introduces a phonetic vocabulary that decomposes each syllable into structured phonetic components and maps them to dialect-specific IPA representations, together with a phonetic-structure decoder that jointly predicts these components. Experiments on the UIT-ViMD, a only-available dataset for multi-dialect in Vietnamese, show that the proposed approach outperforms various pre-trained baselines, \textbf{especially matches the performance of the strongest pretrained wav2ve2-base-vi-250h} across dialects while \textbf{using substantially fewer parameters and no external pretraining}. Code for experimental reproducibility will be publicly available upon the acceptance of this paper.
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2605.24451 [cs.CL] |
| (or arXiv:2605.24451v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2605.24451 arXiv-issued DOI via DataCite (pending registration) |
From: Nghia Hieu Nguyen [view email]
[v1]
Sat, 23 May 2026 08:00:26 UTC (917 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。