Bengaluru-based Sarvam AI Releases 1st Hindi LLM 'OpenHathi' with GPT-3.5 Like Performance

Bengaluru-based Sarvam AI has recently announced the released OpenHathi-Hi-v0.1, the first Hindi LLM from its OpenHathi series of models. Trained under compute and data constraints, this Generative AI model can get GPT-3.5-like performance on Indic languages with a frugal budget, claims the 5 months old Sarvam AI, which recently raised funds from Lightspeed Ventures, Peak XV and Khosla Ventures.

The OpenHathi series of work at Sarvam AI is to make contributions to the ecosystem with open models and datasets to encourage innovation in Indian language AI.

OpenHathi is developed by Sarvam AI in partnership with AI4Bharat, a research lab at IIT Madras which works on developing open-source datasets, tools, models and applications for Indian languages.

AI4Bharat has contributed language resources and cross-lingual benchmark.

Build on top of Meta AI's Llama2-7B, the AI model used by Sarvam AI has a 48,000-token extension of Llama2-7B’s tokenizer and undergoes a two-phase training process. and extend its tokenizer to 48K tokens –

1) Embedding alignment: aligns the randomly initialised Hindi embeddings

2) Bilingual language modeling: teaches the model to attend cross-lingually across tokens.



The Tokenizion of Hindi

Tokenization refers to splitting text into smaller parts for easier machine analysis, helping machines understand human language.

To add Hindi skills to Llama-2, Sarvam AI firstly decreases the fertility score (the average number of tokens a word is split into) of its tokeniser on Hindi text. This made both training and inferencing faster and more efficient. It then train a sentence-piece tokeniser from a subsample of 100K documents from the Sangraha corpus, created at AI4Bharat, with a vocabulary size of 16K. It then merge this with the Llama2 tokeniser and create a new tokeniser with a 48K vocabulary (32K original vocabulary plus our added 16K).

For datasets of the OpenHathi's base model, Sarvam AI partnered – VerSe, hindi social media platform 'Koo' , and Kissan AI, (previously known as KissanGPT) an advanced multilingual AI chatbot engineered to help Indian farmers.

This open-sourced base model of OpenHathi has been trained with bilingual language modelling and thus needs fine-tuning on tasks for it to be used as an instruction-following model. It has also not been aligned and thus can occasionally generate inappropriate content seen in its original pretraining.

The company is inviting people to innovate on top of this latest release of OpenHathi series, by building fine-tuned models for different use-cases. Sarvam AI will additionally release enterprise-grade models on its full stack GenAI platform, which will launch soon. The base model is at HuggingFace – Here

The start-up company claims that latest release of OpenHathi model works as well as, if not better than GPT-3.5 on various Hindi tasks while maintaining its English performance. Along with standard NLG tasks, we also evaluate on a bunch of non-academic, real-world tasks.


Advertisements

Post a Comment

Previous Post Next Post