Impact of Voice-to-Text Errors in Regional Language Telemedicine Services

Published Paper PDF: View PDF

Confirmation Letter: View

Dr Rupesh Kumar Mishra
School of Computer Science and Engineering
SR University
Warangal – 506371, Telangana, India
rupeshmishra80@gmail.com

Abstract

Telemedicine has become a core channel for care delivery in multilingual countries where millions of patients consult in regional languages. In such settings, clinicians often rely on voice-to-text (automatic speech recognition, ASR) to document encounters, generate prescriptions, and send instructions. While ASR improves speed and coverage, transcription errors—especially in code-switched speech, dialectal variants, and noisy home environments—can distort clinical intent. This manuscript analyzes how voice-to-text errors propagate into patient safety and service quality risks in regional-language telemedicine. We synthesize prior work on ASR error patterns, code-switching, and clinical documentation, and propose a methodology that combines (i) a simulation framework that injects realistic substitution, deletion, and insertion errors at varying word error rates (WER) across five Indian languages and (ii) statistical modeling to estimate the effect of WER on clinically consequential outcomes (e.g., wrong-dosage instructions). We operationalize semantic error rate (SER) and entity-level F1 for medication names, dosage, route, frequency, and follow-up date extraction as outcome variables linked to WER, noise type, and code-switching intensity. In simulation (10,000 dialogues), each 5-point increase in WER increased the odds of a clinically consequential instruction error by 14% (OR = 1.14; 95% CI: 1.10–1.18). Code-switching and background noise independently elevated risk, while domain-adapted language models and structured confirmation prompts cut risk substantially. We discuss design guidelines—confirmation UX patterns, constrained templates for prescriptions, pronunciation-robust lexicons, and continuous learning from post-visit corrections—to mitigate harm. The paper closes with implementation recommendations for public telemedicine programs and future research directions for low-resource regional languages.

Keywords

Telemedicine, Automatic Speech Recognition, Word Error Rate, Regional Languages, Code-Switching, Patient Safety, Clinical NLP, India, Usability, Simulation

References

Amodei, D., Ananthanarayanan, S., Anubhai, R., et al. (2016). Deep Speech 2: End-to-end speech recognition in English and Mandarin. Proceedings of the 33rd International Conference on Machine Learning, 173–182.
Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., … Morris, M. R. (2020). Common Voice: A massively-multilingual speech corpus. Proceedings of LREC 2020, 4218–4222.
Board of Governors in supersession of the Medical Council of India. (2020). Telemedicine Practice Guidelines: Enabling Registered Medical Practitioners to Provide Healthcare Using Telemedicine. Ministry of Health & Family Welfare, Government of India.
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. ICASSP 2016, 4960–4964.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML 2006, 369–376.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., … Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed., draft).
Kakwani, D., Kunchukuttan, A., Golla, S., Gokul, N. C., Bhogale, A., Khapra, M. M., & Pratyush, K. (2022). Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Findings of ACL 2022, 2717–2734.
Këpuska, V., & Bohouta, G. (2017). Comparing speech recognition systems (Microsoft API, Google API, and CMU Sphinx). International Journal of Engineering Research and Applications, 7(3), 20–24.
Koonin, L. M., Hoots, B., Tsang, C. A., Leroy, Z., Farris, K., Jolly, B., … Harris, A. M. (2020). Trends in the use of telehealth during the emergence of the COVID-19 pandemic — United States, January–March 2020. MMWR, 69(43), 1595–1599.
Li, J., Deng, L., Haeb-Umbach, R., & Gong, Y. (2014). Robust automatic speech recognition: A review. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audiobooks. ICASSP 2015, 5206–5210.
Prabhavalkar, R., McGraw, I., Alvarez, R., et al. (2017). On-device streaming speech recognition with recurrent neural network transducer. ASRU 2017, 251–258.
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., … Collobert, R. (2020). MLS: A large-scale multilingual dataset for speech research. Interspeech 2020, 2757–2761.
Rao, K., & Sak, H. (2015). Multi-accent speech recognition with hierarchical grapheme based models. ICASSP 2015, 1678–1682.
Sitaram, S., Chandu, K. R., Hoffmann, M., & Black, A. W. (2019). A survey of code-switched speech and language processing. arXiv:1904.00784.
Wang, C., Chain, F., Pino, J., et al. (2021). VoxPopuli: A large-scale multilingual speech corpus for representation learning, ASR, and ST. ACL 2021, 993–1003.
World Health Organization. (2020). Global strategy on digital health 2020–2025. WHO.
Yu, D., & Deng, L. (2016). Automatic Speech Recognition: A Deep Learning Approach. Springer.
Zhou, L., Blackley, S. V., Keselman, A., & Shankaranarayanan, G. (2012). Analysis of errors in dictated clinical documents generated by speech recognition. International Journal of Medical Informatics, 81(2), 120–128.