Tarento Joins Ekstep To Build The Pillar For National Language Translation Mission Via ULCA Platform

Client overview: EkStep Foundation

EkStep Foundation is a Bengaluru-based non-profit building large-scale digital public goods. Co-founded by Nandan Nilekani, Rohini Nilekani and Shankar Maruwada, EkStep brings experience from Aadhaar-scale digital infrastructure into education and Indic language AI.

The challenge: building a shared foundation for Indian language AI

India’s National Language Translation Mission needed a common digital foundation for 22 official languages, many of which remain low-resource for NLP.

To build reliable machine translation, speech recognition, text-to-speech and OCR capabilities, India needed a single platform to collect datasets, host reference models, attribute contributors and benchmark progress. The platform also required standard API contracts so that datasets, models and systems from different research labs could interoperate.

Why EkStep chose Tarento

Tarento had already partnered with EkStep on Anuvaad, a document translation platform used by judicial bodies for legal translations. This experience in Indic NLP, open architecture and government-grade data handling made Tarento a strong engineering partner for ULCA, the Universal Language Contribution API, under MeitY’s National Language Translation Mission.

What Tarento built

Tarento designed and delivered ULCA as an open, scalable and platform-agnostic data layer for the BHASHINI ecosystem.

The work covered:

  • Architecture and API contracts A common specification for submitting, describing, attributing, searching, retrieving and benchmarking datasets and models.

  • Contributor and submission flows Tooling for research labs, MSMEs and individual contributors to publish datasets and models in a standard format.

  • Curation and benchmarking Pipelines for sanity checks, record-level attribution and benchmark datasets to evaluate models against shared metrics.

  • Open-source foundation ULCA was published under the MIT licence and became the maintained code base behind the BHASHINI platform.

Tarento also worked with India’s NLP research community, including teams from IITs, IIITs, IISc, CDAC and AI4Bharat, to bring early datasets and models into the ULCA-compliant format.

What ULCA hosts

ULCA supports datasets and models for machine translation, ASR, TTS, OCR, transliteration, named entity recognition and language identification across Indic languages.

By BHASHINI’s launch milestone, ULCA hosted around 215 million parallel translation pairs across 12 Indic languages, roughly 9,800 hours of ASR audio, hundreds of hours of studio-quality TTS data, around 6 million transliteration entries across 19 languages, and more than 240 models across translation, ASR, TTS, OCR and transliteration.

The catalogue has continued to grow through contributions from the wider BHASHINI ecosystem.

Why ULCA matters

BHASHINI now exposes more than 300 pre-trained AI models through Open Bhashini APIs. It has supported high-visibility public use cases, including the Prime Minister’s real-time Tamil speech translation in December 2023 and the Finance Minister’s 2024 Union Budget address.

ULCA provides the data and model backbone for this ecosystem, enabling startups, researchers and government bodies to build Indic language AI products on a shared open foundation.

Technology stack

CategoryTechnologies
Programming LanguagesJava, Python
Frontend / UIReact
API & GatewayOpenAPI, Zuul
Databases / Data StoresMongoDB, Redis
Streaming / Real-time Data / AnalyticsApache Kafka, Apache Druid
DevOps / CI-CDJenkins
Cloud PlatformsMicrosoft Azure, Amazon Web Services (AWS)
Infrastructure / DeliveryContent Delivery Network (CDN)
Think your idea makes lives simpler?

We can help you transform your business.

< previous
Transforming User Experience To Improving User Adoption For UIDAI’s mAadhaar App
Next >
Paradiset: Democratising Healthy Eating
Next >
logo
Thor Bot Avatar