The catalogue of manuscripts of the Mekhitarists of Venice (also known as Մայր ցուցակ հայերէն ձեռագրաց մատենադարանին Մխիթարեանց ի Վենետիկ) is a fundamental work for the Armenian studies. It lists more than 2000 manuscripts from the collection of the Mekhitarist monastery from the San Lazaro Island in Venice, one of the libraries with the most Armenian books and manuscripts.
Each manuscript is described in detail: title, date and site of copy, name of the copyist, description of the content, number of pages, dimensions, etc. The characteristic of this catalogue lies in its thematic structure. The works are categorized by them such as bibles, chronicles, hymnals, etc., that may complicate the search when the category’s theme is unknown. Today, close to a third of the manuscripts from San Lazaro have not yet been added to this catalogue.
So far, the catalogue, that holds 1,3 million words over 6,250 pages divided into 8 volumes, was available under PDF format, digitized and made accessible thanks to the Fundamental Scientific Library of NASRA. The technical challenge met by Calfa, in partnership with the Mekhitarist Fathers, consisted in converting this PDF into a structured database, ready to be published online, and a search engine.
One of the major obstacles was to work with old scans of poor quality. Despite some blurred areas and a pixelized text, due to prior compressions of the PDF file, the OCRization (automatic recognition of the text) was efficiently achieved thanks to our OCR models specialized in printed Armenian, capable to deal with these difficulties. Our models have been adjusted to the structure of the catalogue in order to identify and retrieve information at first glance like the title of the manuscript, its date of copy, the essential data and the original description written by the Fathers.
The other main task consisted to structure text in such a way to create the database automatically. It required to identify the semantic fields within the pages through detection of the information from the bibliographical notice (title, date of copy, etc.) as well as the named entities (name of the copyist, of the sponsor, of the bookbinder, of the illuminator, site of copy, etc.) in order to categorize them and assign them automatically in the database. This step was made possible through the development of a hybrid analysis model for Classical and Modern Armenian specialized for this catalogue.
This method was a success with a character recognition rate of 99.2% for the old and damaged PDFs in printed Armenian. A first set of human proofreading was undertaken for the key metadata (title, dates and quantitative information like pages or dimensions). The few residual errors, often caused by the frequent confusion between some letters, will be gradually corrected over time.
The catalogue is now available online on the Mekhitarists website with a search engine capable of finding a semantic field or a word directly in plain text. This functionality improves largely the search by overcoming the limitation of the initial thematic classification.
This project is a good example of the use of AI for the automatization of structured data retrieval and enhancement, and the method is now proven and ready for the processing of Armenian catalogues.
The raw data of the project have also been published under open access on GitHub.
A project carried out between 2022 and 2024 with the help and support of the congregation of the Mekhitarist Fathers of Venice.