Optical Character Recognition (OCR) is an essential step in converting documents into usable, editable, and searchable data. This technology is now widely used by the general public, industries, and heritage institutions, such as digital libraries.
For Armenian, there are several available solutions, the most well-known being:
For more information on OCR performance for Armenian, click here
Tesseract | Abbyy | Calfa OCR |
---|---|---|
Package | Software | Service |
Free | Licence | Prepaid offer |
Print and Handwritten | ||
Common fonts, clean documents | Common documents | Complex fonts and documents |
We trained a new version of Tesseract’s Armenian model, which we’re releasing as open source. This choice is driven by our desire to support the digitization of Armenian, offering a lightweight and easy-to-implement solution for everyday needs.
We improved it using data that better represents 20th-century Armenian print production, with a particular focus on damaged documents and old fonts—areas where Tesseract typically struggles. The model supports texts in Classical, Western, and Eastern Armenian. Here are a few recognition examples:
Example 1 : Blurry Armenian newspaper (-20% of mistakes)
Tesseract default | Tesseract Calfa | |
---|---|---|
Character Error Rate (CER) | 28,95 | 8,61 |
Word Error Rate (WER) | 95,96 | 52,22 |
Example 2 : Noisy Armenian Book from the 20th century (-28% of mistakes)
Tesseract default | Tesseract Calfa | |
---|---|---|
Character Error Rate (CER) | 36,64 | 8,11 |
Word Error Rate (WER) | 101,22 | 44,38 |
Example 3 : Binarized Armenian Book (-8% of mistakes)
Tesseract default | Tesseract Calfa | |
---|---|---|
Character Error Rate (CER) | 11,75 | 3,99 |
Word Error Rate (WER) | 50,07 | 21,51 |
Example 4> : Historical Armenian font (-35% of mistakes)
Tesseract default | Tesseract Calfa | |
---|---|---|
Character Error Rate (CER) | 48,93 | 13,47 |
Word Error Rate (WER) | 134,90 | 67,63 |
✅ What this model does:
❌ What this model doesn't do:
This model can be used locally by installing Tesseract-OCR, or integrated into a service using the pytesseract API. It’s also available through our processing tool at ocr.calfa.fr.
OCR technology is now more accessible than ever, including within generative AI models. This model is extremely lightweight (3 MB), with inference times under one second for standard pages. It offers a fast, efficient, and cost-effective alternative for a wide range of OCR projects.
For more specialized needs, feel free to contact us for a personalized assessment of costs and feasibility.