Calfa releases a digital version of the Adjarian’s etymological dictionary

Version française

Among the major works of the prominent Armenian linguist Hrachia Adjarian (1876–1953), the Etymological Dictionary of Armenian (Հայերէն արմատական բառարան, Hayerēn armatakan baṙaran) is undoubtedly the most renowned. Published between 1926 and 1935, this dictionary contains over 11,000 entries, enriched with examples from Armenian literature and manuscripts, as well as dialectal variants and information on word roots in Hebrew, Syriac, Middle Persian, Georgian, Greek, Latin, and many other languages.

This work is indispensable for Armenian studies. Its integration into Calfa’s lexical databases is part of our long-term goals for the Armenian–French and Armenian–English online dictionaries, which we have been maintaining and expanding since 2014. In 2024, we made our lexical databases available as open source via our Git repository.

To learn more about Calfa's lexical databases open-access, click here.

Technical pipeline

The dictionary has a simple yet variable structure. Each entry includes the definition, dialectal equivalents, etymological sources, indications of loanwords, and the derivations from other entries. The layout of the dictionary reflects these different layers of information. We developed a specialized analysis model capable of automatically detecting and storing this data in a database, without any human intervention.

To do this, we implemented a fine-tuning strategy based on one of our generic layout analysis models, which we adapted using samples from Adjarian’s dictionary. This enabled us to extract 11,300 entries, including 2,600 paragraphs of dialectal equivalents.

Adjarian Dictionary - layout analysis
Example of detecting and analyzing the dictionary’s structure

The dictionary contains multiple alphabets—Armenian, Arabic, Greek, Hebrew, Georgian, Middle Persian, Cyrillic, Syriac, Tamil, and over twenty in total. Some are even combined within a single word. We focused the recognition efforts on Armenian, Greek, Georgian, Latin, Cyrillic, and specialized linguistic characters. The other alphabets are detected but not transcribed; they are replaced with [alphabet word]. This editorial choice is based on three reasons: (i) reducing model preparation time, (ii) ensuring high performance for the selected scripts, and (iii) the rarity of the excluded scripts at this stage, which are systematically translated into Armenian. The specialized multilingual recognition model we trained first identifies the script type, then applies the most probable transcription. The average recognition accuracy is 98.78%, ranging from 94.3% for Georgian to 99.2% for Armenian. Other scripts will gradually be handled through the dictionary’s collaborative correction process.

Adjarian dictionary - scripts
Script detection during the text recognition process

The dictionary is now available and searchable through our open-access databases on GitHub, in the form of a working document that we will strengthen and clean up in the coming months (your contribution is welcome, using Github issues or mail), so that the scientific community can already benefit from it. It will be integrated into dictionary.calfa.fr by 2026. The initial development of the dictionary in 2014 and 2016 was supported by the Calouste Gulbenkian Foundation. This development is carried out in partnership with ANR DALiH (ANR-21-CE38-0006), as part of a joint effort to develop a digital corpus for the Armenian language.

Calfa Team