



























AI: Powered on Indian languages | Photo Credit: Rawf8
India’s ambitions in artificial intelligence are growing rapidly. Governments are investing in AI infrastructure, startups are attracting capital, and research institutions are building increasingly capable language models. Yet one fundamental challenge remains largely overlooked: India’s sovereign AI goal cannot be met without a strong knowledge infrastructure for its own languages.
Today, AI systems perform best when trained on large volumes of high-quality digital content. For English, such content exists in abundance. For most Indian languages, it does not. This is emerging as the single biggest bottleneck in the development of truly inclusive and effective Indian-language AI.
While Hindi enjoys a relatively rich digital footprint, languages such as Tamil, Telugu, Bengali and Marathi have more limited resources, and many others remain severely underrepresented. As a result, AI systems often struggle with accuracy, reasoning, summarisation and translation in these languages. The problem is not primarily one of computing power or model architecture. It is the lack of clean, diverse and digitised text that reflects India’s linguistic and cultural richness.
The implications extend far beyond technology. Large-scale digitisation is essential for modern governance, education, legal systems and cultural preservation. Government records, court judgments, land documents, textbooks, research papers and historical archives all need to become machine-readable if AI is to deliver meaningful public value.
At the heart of this challenge lies Optical Character Recognition (OCR) — the technology that converts scanned documents into searchable and usable text. OCR is often taken for granted in English, but for many Indian scripts it remains a significant hurdle.
Even printed documents present difficulties. Government records are frequently available only as low-quality scanned PDFs. Newspapers and books often use non-standard fonts. Complex page layouts containing multiple columns, tables and scripts reduce accuracy further. For languages such as Tamil, Malayalam and Urdu, OCR performance remains uneven.
The challenge becomes even greater when dealing with handwritten material. Millions of government records, historical archives and institutional documents remain handwritten. Regional variations in handwriting, the absence of large labelled datasets and older writing styles make automated recognition extremely difficult.
India’s vast manuscript heritage presents another frontier. Palm-leaf manuscripts, copper-plate inscriptions and ancient texts contain centuries of knowledge in fields ranging from mathematics and astronomy to medicine and philosophy. Unlocking these resources requires not only OCR but also image restoration, script identification and linguistic expertise. This is as much a national knowledge mission as a technology project.
Fortunately, important foundations already exist.
AI4Bharat at IIT Madras has emerged as one of India’s most significant open-source initiatives, contributing multilingual datasets, evaluation benchmarks and language models for Indian languages. IIIT Hyderabad has undertaken important work in OCR and document analysis. Meanwhile, companies such as Sarvam AI, BharatGPT, Microsoft, Google and Meta are investing in deployment and innovation.
Government initiatives have also made notable contributions. Bhashini has advanced speech and translation technologies. The National Manuscripts Mission has surveyed millions of manuscripts. The National Digital Library has assembled a large collection of digital resources. Several states, including Tamil Nadu, Kerala and Karnataka, have launched valuable digitisation programmes.
Yet these efforts remain fragmented. India still lacks common standards, interoperable datasets, AI-ready pipelines and a coordinated national strategy. The result is duplication of effort and slower progress than the country requires.
What India needs now is a National Knowledge Infrastructure for Indic AI.
First, a National Text Recognition Mission should be launched to accelerate development of OCR systems, handwriting recognition technologies, manuscript digitisation capabilities and next-generation vision-language models tailored to Indian scripts.
Second, a National Corpus Authority should establish standards for metadata, data quality, storage and interoperability while coordinating contributions from governments, universities, libraries and cultural institutions.
Third, India requires a modern licensing framework that balances public access, intellectual property protection and fair compensation for publishers and content creators.
Finally, stronger collaboration between government, academia, industry and civil society is essential. India’s linguistic diversity is unmatched globally. No single institution can solve this challenge alone.
As Nandan Nilekani recently argued, India has already shown through Digital Public Infrastructure such as UPI how open, interoperable public platforms can create transformative national outcomes. AI can follow a similar path.
The real race in AI is not merely about building bigger models or acquiring more GPUs. It is about creating the knowledge foundations on which those models can learn. If India succeeds in digitising, organising and democratizing access to its linguistic wealth, it can build AI systems that serve not only English-speaking elites but also the hundreds of millions who communicate in Indian languages every day.
The future of Indian AI will clearly epend on how effectively we unlock India’s knowledge treasure-house and make it accessible to machines — and to people.
Seshasayee is Principal Adviser and Ramachandran is President of BIF. Views expressed are personal
Published on June 18, 2026
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。