IIT Madras develops an easy OCR for 9 Indian languages
Taking a cue from European languages, several of which have the same (Roman letter–based) script, Srinivasa Chakravathy’s team at IIT Madras has, over the last decade, developed a unified script for nine Indian languages, named the Bharati script. The team has now gone a step further since developing the script: it has developed a method for reading documents in Bharati script using a multi-lingual optical character recognition (OCR) scheme. The team has also created a finger-spelling method that can be used to generate a sign language for hearing-impaired persons. In collaboration with TCS Mumbai, the researchers have found a way for persons with hearing disability to generate signatures using this finger-spelling technique.
The scripts that have been integrated include Devnagari, Bengali, Gurmukhi, Gujarati, Oriya, Telugu, Kannada, Malayalam and Tamil. English and Urdu have not been integrated so far. Dr Chakravarthy says, “Urdu and English alphabet systems have a very different phonetic organisation. But that does not mean a mapping is not possible. It is quite possible and can be done.”
In general, optical character recognition schemes involve first separating (or segmenting) the document into text and non-text. The text is then segmented into paragraphs, sentences words and letters. Each letter has to be recognised as a character in some recognisable format such as ASCII or Unicode. The letter has various components such as the basic consonant, consonant modifiers, vowels etc.
Easy to read
The scripts of Indian languages pose a problem for such a character recognition because the vowel and consonant-modifier components are attached to the main consonant part. This difficulty is removed in the Bharati script which can be easily read. “In Bharati characters, these different components are segmentable by design. So OCR works quite accurately. Our OCR engines gives almost 100% accuracy even with mild noise added,” says Dr Chakravarthy.
The ease in design comes about because the Bharati characters are made up of three tiers stacked vertically. The consonant at the root of the letter is placed in the centre and the modifiers are in the top and bottom tiers.
In collaboration with Sunil Kopparapu of Innovation Labs, TCS, Mumbai, the team has developed a universal finger-spelling language for the nine Indian languages. They are working on a system that can help people sign documents using a finger-spelling method, and future plans include developing a new Braille system with the Bharati script.