Translation Accuracy of AI-Based Medical Summaries Across Indian Languages

Published Paper PDF: View PDF

Confirmation Letter: View

Dr Reeta Mishra

IILM University

Knowledge Park II, Greater Noida, Uttar Pradesh 201306

Abstract

Ensuring that laypersons and frontline health workers can understand medical information in their preferred language is central to equitable healthcare in India. While large language models (LLMs) and neural machine translation (NMT) systems can summarize and translate clinical content at scale, their reliability for safety-critical use remains uncertain—especially across India’s diverse linguistic landscape that spans multiple language families (Indo-Aryan, Dravidian, Tibeto-Burman), writing systems (Devanagari, Perso-Arabic, Bengali–Assamese, Gurmukhi, Gujarati, Kannada, Malayalam, Odia, Tamil, Telugu), and widespread code-mixing with English. This manuscript examines the translation accuracy of AI-based medical summaries across 12 major Indian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and Urdu). We synthesize relevant literature and propose an end-to-end evaluation protocol that compares three generation paradigms: (i) summarize-then-translate pipelines, (ii) translate-then-summarize pipelines, and (iii) direct multilingual summarization that outputs target-language summaries without intermediate translation steps. The protocol couples automatic metrics (BLEU, chrF, BERTScore, COMET) with human evaluation using an MQM-style error taxonomy and a clinical harm lens emphasizing errors in dosage, negation, contraindications, temporality, and named entities (drug, condition, anatomy). To make the study design concrete for practitioners, we present an illustrative analysis based on a curated, de-identified set of 1,200 short medical summaries (patient education leaflets, discharge-note synopses) and four representative model families (a strong open multilingual MT system, an Indic-centric NMT model, a commercial MT API, and a state-of-the-art LLM).

Keywords

Medical Summarization, Machine Translation, Indian Languages, Multilingual NLP, Clinical Safety, MQM, COMET, Code-Mixing, Indic NLP, Health Communication

References

Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, 65–72.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of ACL, 8440–8451.
Freitag, M., Grangier, D., Caswell, I., Foster, G., Heafield, K., Koehn, P., … Cherry, C. (2021). Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the ACL (TACL), 9, 1460–1474.
Goyal, N., Zhang, P., Conneau, A., Chaudhary, V., Wenzek, G., Guzmán, F., … Fan, A. (2022). FLORES-101: Evaluating the translation ability of multilingual MT systems. Transactions of the ACL (TACL), 10, 522–538.
Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., … Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., … Dean, J. (2017). Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the ACL (TACL), 5, 339–351.
Kakwani, D., Kunchukuttan, A., Golla, S., Shiv, V., Biradar, R., Raghavan, V., … Khapra, M. M. (2021). Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Findings of ACL, 1528–1541.
Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of EMNLP: System Demonstrations, 66–71.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of ACL, 7871–7880.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Proceedings of the ACL Workshop on Text Summarization Branches Out, 74–81.
Lommel, A. R., Uszkoreit, H., & Burchardt, A. (2014). Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Proceedings of the Workshop on Automatic and Manual Metrics for MT Evaluation (LREC), 62–67.
NLLB Team. (2022). No language left behind: Scaling human-centered machine translation. arXiv:2207.04672.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of ACL, 311–318.
Popović, M. (2015). chrF: Character n-gram F-score for automatic MT evaluation. Proceedings of WMT, 392–395.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text Transformer. Journal of Machine Learning Research, 21(140), 1–67.
Rei, R., Stewart, C., Farinha, A. C., & Lavie, A. (2020). COMET: A neural framework for MT evaluation. Proceedings of EMNLP, 2685–2702.
Roark, B., Liu, C., Goyal, K., & Ribeiro, M. S. (2020). The Dakshina dataset: A benchmark for South Asian language processing. Proceedings of LREC, 2411–2419.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL, 1715–1725.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 5998–6008.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. Proceedings of ICLR.