For several years, we have witnessed a massive digitization of handwritten collections, archives and documents. The digitization of resources and their storage in the Cloud contribute to the preservation and long-term archiving of data. While access to documents can therefore be achieved remotely, their real accessibility is not guaranteed. Without reliable and quality character recognition on these documents, keywords search, name entity recognition or automatic classification of documents are impossible, hence resulting in manual search work.
An OCR (Optical Character Recognition) or HTR (Handwritten Text Recognition) is an automatic text recognition software, that will analyze a scan to extract the text with value. An effective pipeline consists of 3 steps: (1) layout analysis (identifying text regions, finding lines of text, attributing semantic tags to the lines and the regions), (2) simulteanous character recognition and word recognition and (3) post-processing with language models adapted to ensure the reliability and quality of the prediction.
There are several open source and proprietary solutions for extracting text from documents, especially for printed documents. Tesseract is the best-known, free open source software solution available in most languages, and Abbyy is the market leader for print recognition.
Each document has its specificity: particular layout, different states of conservation, specific handwriting, press format, etc. There is not always an OCR or HTR that suits your needs, or a versatile software solution. Moreover, text recognition for languages with non-Latin scripts such as Armenian, Syriac, Arabic, Chinese, Georgian, etc., is still in its infancy. The proposed architectures are rarely adapted to their specificity (ligatures, abbreviations, text direction, etc.). These languages belong to the family of so-called digitally under-resourced languages.
Today it is possible to train neural networks to analyze a very specific layout or process a very particular set of documents. However, to be effective, these neural networks need to be trained with large datasets to be efficient and robust. It is therefore necessary to annotate, often manually, documents similar to those that we wish to recognize (what is called the "creation of ground truth").
Manually annotating documents, choosing a neural architecture suited to your needs, and monitoring / evaluating the learning of a neural network to create a relevant model are costly and time-consuming activities, and which often require investment and machine learning experience, which is not adapted to a massive and rapid processing of documents. Especially when results are not satisfying enough.
Since 2014, Calfa has acquired expertise in tailor-made text recognition for oriental languages (printed and handwritten), and in the processing of under-resourced languages. Notably, we have developed polyvalent layout analysis and robust text recognition models for oriental languages. Calfa is supporting research in Document Analysis: We are releasing our annotation platform, Calfa Vision, to support you in your annotation projects and quickly create quality ground truth, compatible with most modern neural architectures.
Calfa Vision is a free web-based assisted annotation tool, that includes several models of automatic understanding of a printed or handwritten document. Models can be fine tunable in real time to match your needs very quickly. Basic steps consist in annotate documents (1) at the region level and line level, and (2) at the transcription level.
In summary, Calfa Vision:
Yes, we provide our partners and customers with an integrated OCR / HTR to speed up the transcription work, in the languages already processed by Calfa (other languages on request). This OCR / HTR is also fine-tunable to perfectly match your script or the state of preservation of your documents, with very small dataset to quickly get a competitive model. Contact us to assess your OCR / HTR needs with you.
Calfa present at the next ICDAR 2021 (presentation of Calfa Vision at the main conference September 9th).
2021/06 - Intelligence artificielle et khaṭṭ maghribī. Résultats d'un hackathon (Inalco, Bulac) (French) :
2021/01 - Digital Perspectives for Corpus Processing of Texts written in Armenian (Oxford Centre for Byzantine Research) (English) :
2020/12 - HTR/OCR pour graphies non latines : approches et bonnes pratiques (GIS MOMM - CNRS) (French) :