Automatic Text Processing and Digital Humanities for Ethiopian Language and Culture

20th International Conference of Ethiopian Studies (ICES20) – “Regional and Global Ethiopia – Interconnections and Identities”

Mekelle town, Tigray, Ethiopia - 1 to 5 October 2018

Panel ID: 0801: Automatic Text Processing and Digital Humanities for Ethiopian Language and Culture

Thu 4 October Room08 [IPHC Hall 3rd floor]

Convenors: Cristina Vertan (HLCES), Solomon Teferra Abate (AAU)

The developments during the last decade in processing natural language open new perspectives for preservation of cultural heritage, extraction of information from large amounts of data as well as access to multilingual content.

Although included in the set of so called less resourced languages, languages of Ethiopia (Amharic, Tigrinya, Oromo and Ge’ez) are slowly enriching the number of available resources and tools (see https://www.researchgate.net/project/Development-of-Ethiopian-Languages-Resources-NaturalLanguage-Applications-and-Speech-Processing-Tools)
Morphological analyzers, speech recognition systems as well as electronic dictionaries, and PoS-taggers are already available. The project TraCES (https://www.traces.uni-hamburg.de/) is currently building the first digital tools for Ge’ez and will have a major contribution for diachronic analysis of Ge’ez. The project Beta maṣāḥǝft is creating the largest database of descriptions of
manuscripts which will be searchable through technologies of Semantic Web and computational linguistics.
However, until now, no action was taken in order to collect all efforts and resources around digital resources and tools for Ethiopian Languages. The aim of this panel is to bring together researchers working in the domain of computational linguistics, digital humanities as well as potential users in order to:
- Identify existing technologies and resources
- Identify gaps and still missing bricks for automatic processing of Ethiopian languages
- Identify possibilities of adaptation for tools across languages in Ethiopia
We foresee two sections, one on computational linguistics tools for languages of Ethiopia and the second one targeted to Digital Humanities projects and activities.

*****************

Programme

Session 1: Digital Humanities

09:00–09:10 – Opening remarks

09:10–09:40 – Addressing Ethiopic layout requirements at the world wide web consortium (Daniel Yacob)

09:40–10:10 – Keeping Ethiopia digitally in Sync: the role of a coordinating agency for progress on digital support for Ethiopian languages (Isabelle A. Zaugg)

10:10–10:40 – Electronic publication of Ethiopian manuscript archives. What’s involved? (Anaïs Wion)

Coffee break

11:00–11:20 – Beta maṣāḥǝft: manuscripts of Ethiopia and Eritrea (Pietro Liuzzo)

11:20–11:50 – Digital Lexicon linguae Aethiopicae (Pietro Liuzzo)

11:50–12:20 – Place Names in the Chronicle of King Gälawdewos (1540-1559): a prototype geo-annotated text of the Ethiopic tradition (Solomon Gebreyes and Pietro Liuzzo)

12:20–12:45 – GeTa – A multi-level annotation tool for classical Ethiopic texts (Cristina Vertan)

12:45–13:00 Discussions

Lunch break

Session 2: Computational Linguistics

14:00–14:30 – Opportunities and challenges in the digitizationof the Yaredic corpus (Daniel Yacob)

14:30–15:00 – Web Corpora for four major Ethiopian languages (Derib Ado, Feda Negesse, Simelis, Mazengia, Girma Mengistu, Ahmed Yusuf Hirad, Janne Bondi Johannessen)

15:00–15:30 – Tigrinya Orthography: Materials on written Tigrinya language standardization (Yaroslav Gutgarts)

15:30–16:00 – Colligation of phrases and lexical items in Afan-Oromo (Zinawork Assefa)

Coffee break

16:30–17:00 – Computational linguistics resources for Ethiopic (Asmelash Teka Hadgu)

17:00–17:30 – Somali Corpus: A framework for linguistic annotation (Jama Musse Jama)

*****************

List of abstracts

****************

WEB CORPORA FOR FOUR MAJOR ETHIOPIAN LANGUAGES [Abstract ID: 0801-06]
DERIB Ado, Addis Ababa University, Ethiopia
FEDA Negesse, Addis Ababa University, Ethiopia
SHIMELIS Mazengia, Addis Ababa University
GIRMA Mengistu, Addis Ababa University
AHMED Yusuf Hirad, Jigjiga University
Janne Bondi JOHANNESSEN, University of Oslo

This paper describes web text corpora for the four major languages of Ethiopia: Amharic (17,000,000 words), Oromo (4,000,000 words), Somali (72,000,000 words), and Tigrinya (2,000,000 words). The development of the corpora was made possible through a joint venture of two projects: Linguistic Capacity Building, tools for the inclusive development of Ethiopia, a joint project between four Norwegian and Ethiopian universities; and the Czech-Norwegian HaBiT project. The technical development of the corpora, including harvesting the web texts for the four languages, was fully undertaken by the Centre for Natural Language Processing, at Masaryk University, Czech Republic, whereas the linguistic aspect of the corpora, which includes revision of 350-450 seed bi-grams for language detection, quality checking and evaluation of the corpora, was done by the Department of Linguistics at AAU. The corpora are presented in the Habit System (Kala et al. 2017). The search system of the four corpora has options for simple and advanced concordances down to character level, provides frequency per million of search items, generates list of words, allows for advanced search using regular expressions, word sketches, and thesauruses. The Amharic corpus was POS tagged using the tagset developed by Demeke & Getachew (2006). The main challenges in developing the corpora were bigger citations of other languages, such as Ge’ez in Amharic, and lack of balance between domains, as the on-line content of the four languages is skewed towards religion and politics. The raw text of all the four web-text corpora are available for download and will also be available in the Glossa corpus management system, (Johannessen et al. 2008).

ADDRESSING ETHIOPIC LAYOUT REQUIREMENTS AT THE WORLD WIDE WEB CONSORTIUM [Abstract ID: 0801-11]
DANIEL Yacob, Ge'ez Frontier Foundation

Founded in 1994, The World Wide Web Consortium (W3C) is responsible for the standards that make the World Wide Web possible. Standards from the W3C specify how the text in digital documents must be presented to the reader under the expectations of their cultural conventions. In 2015 the W3C launched a task force to address the layout, formatting and other presentation requirements of Ethiopic literature (https://www.w3.org/TR/elreq/). This effort to develop a recommendation that software companies can apply for Ethiopian and Eritrean publishing faces a significant challenge stemming from the absence of a similar specification for printed literature. While some recommendations may be found, none are comprehensive enough to cover all aspects of layout and formatting needed and will not include the minutiae that may seem intuitive to a human author but must be expressed in a formalization that a logic processing system can apply. Thus the task force finds itself in the unexpected position of producing the first comprehensive specification for Ethiopic publishing. As society’s adoption and reliance on desktop publishing grows and as digital devices and e-Readers proliferate, a good quality specification is ever more essential. Lest Ethiopic publishing be left to presentation rules devised for external literature. The paper will present the current status of the effort, stakeholder involvement, and the standardization process. While reviewing every detail of the draft recommendation is beyond the scope of what the paper attempts to address, examples are drawn from the developing work to illustrate the difficulties in layout specification and the importance of the involvement of all parties with a stake in Ethiopic literature past and present.

BETA MAṢĀḤƎFT: MANUSCRIPTS OF ETHIOPIA AND ERITREA [Abstract ID: 0801-02]
Pietro LIUZZO, Universität Hamburg, Germany

The project "Beta maṣāḥǝft: Manuscripts of Ethiopia and Eritrea" (BM) aims at creating a portal to data related to the living manuscript tradition of the Ethiopian and Eritrean Highlands. This means encoding and semantically relating descriptions of manuscripts, editions of literary works, records about ancient and modern places as well as ancient and modern persons. The presentation will give a presentation of the current workflow and of the website as well as an overview of the main challenges encountered until now. Treatment of images, ancient places, texts, catalogue descriptions will be presented together with some of the presentation choices made to date. Special attention will be given to the choices made to make this resource accessible, reusable and open for contributions from the largest possible community of interested stakeholders, hoping to set the project in connection not only to the TraCES, PAThs, IslHornAfr and Syriaca.org projects but also with many other resources for the production and publication of codicological and philological data. Beta maṣāḥǝft supports already the publication of IIIF manifests extracted from the TEI encoding of the manuscripts, and it supports complex philological critical edition view in the website. Also Comparison, relations and visualization tools for the study of manuscripts are supported and will be presented in this paper.

COLLIGATION OF PHRASES AND LEXICAL ITEMS IN AFAN- OROMO [Abstract ID: 0801-05]
ZINAWORK Assefa, Linguistics PhD Student at Addis Ababa University, Ethiopia

The study has examined phrasal structure of Afan- Oromo colligation. The corpus has taken from social, economic and political domains. Natural language processing toll kit (nltk) and python 3.4.1 have been carried out to analyze colligation phrases or lexical items relationship at a syntactic level. Particularly, the study has focused on the relationship between lexical item and in the grammatical context, lexical item and a particular syntactic function in which the item can be used and lexical item-the position in a phrase, clause, sentence, text or discourse in which the item can be used. Most frequent colligation phrases are taken from the corpus by using Chi Square Test method.The finding of the study shows that different word forms of the same lexeme have often noticeably different distributional patterns, different inflectional forms of adjectives colligate with adjectival and verbal phrases and different inflectional adjectives have been shown to prefer different syntactic position.

COMPUTATIONAL LINGUISTIC RESOURCES FOR ETHIOPIC [Abstract ID: 0801-08]
ASMELASH Teka Hadgu, L3S Research Center

There is a scarcity of publicly available linguistic resource to perform Ethiopic research on computational linguistics in Ethiopic. In this paper, an attempt is made to bridge this gap by building computational linguistic resources for Ethiopic from the Web. The study has gathered a large scale linguistic corpus through web scraping heterogeneous web-pages for Bibles, news media articles and blog posts as well as popular social media sites such as Twitter for social feeds. Performed preliminary experiments on two tasks (i) language identification on Amharic, Tigrinya and Ge'ez and (ii) learning word embedding for Amharic and Tigrinya. Achieved a state-of-the-art result on the language identification task.The contributions of this work are three fold: raw corpus for Ethiopic based languages, a language identification tool for these languages and pre-trained word vectors for Amharic and Tigrinya. The study contributes to make the computational tools and resources to the research community.

DIGITAL LEXICON LINGUAE AETHIOPICAE [Abstract ID: 0801-01]
Pietro LIUZZO, Universität Hamburg, Germany

Alessandro Bausi, Andreas Ellwardt and many other contributors had been working for year on a digitized version of the Lexicon Linguae Aethiopicae by Augustus Dillmann. This was also available as a navigable website hosting PDF of the pages of the 1865 edition, in Ran Ha-Cohen (http://www.tau.ac.il/~hacohen/Lexicon.html). We have now produced a fully editable version of the lexicon in TEI, available at http://betamasaheft.eu/Dillmann . This digital edition has several features which make it an easily usable resource both for human end users and other applications. In this presentation I would like to present this resource, with its current capabilities and potential for further development. Firstly a demo will be given of the website features for the end users. Secondly the advantages of this TEI encoded digital edition of the dictionary powering the application will be presented. Also the already implemented automatic enrichment provided from the Beta maṣāḥǝft Corpus of texts and other external resource will be showcased. Thirdly the importance of the encoding to support the search and navigation functionality will be described with examples of automatic, assisted and hand encoding. Main aim of the paper is to set this Digital edition of the Lexicon Linguae Aethiopicae in the largest possible interconnection with other digital resources, for which it can constitute a rich base of data to be analysed and not only a website for end users.

ELECTRONIC PUBLICATION OF ETHIOPIAN MANUSCRIPT ARCHIVES: WHAT’S INVOLVED? [Abstract ID: 0801-13]
Anaïs WION, CNRS, IMAF

The publication of Ethiopian manuscript archives (EMA) is a collaborative project carried out by historians and philologists working on manuscript documents produced by the Ethiopian Christian kingdom between the tenth and the twentieth centuries. Ethiopian manuscript “archives” is a general term encompassing administrative, juridical and historical texts, which were produced by the Ethiopian political and religious authorities to proclaim their laws, rules and traditions. The term “archives” is to be thought of in a very wide sense and also as standing in juxtaposition to religious and literary texts. The producers of these documents were the royal, and, to a lesser degree, religious administrations. Private acts, often in Amharic, were issued comparatively late, from the beginning of the eighteen century. Several thousands of documents of diverse character constitute a coherent corpus of primary sources so far largely under-exploited . Establishing ways of publishing and analyzing these documents is thus part of an approach, innovative in so far as it draws on digital technologies, and classical by its situation within the tradition of diplomatics. The encoding of texts and the structuring of the data and the metadata adhere to the broadly accepted standards of XML-TEI. The electronic publication of these documents has a number of objectives in mind. First of all, publishing the documents in transcription, in translation and, when it’s possible, as images, will make them accessible. Then the digital environment will allow the manipulation and analysis of the texts, by multiplying research tools. The construction of thematic indexes, including the technical terms belonging to the specific vocabulary of the charters, the ability to search complete texts, and multi-entry search engines all offer points of entry into and the means of navigating within the documents. The encoding of the texts also allows us to bring out the structural elements of the legal documents and to construct tools for analyzing diplomatic discourse. The data processing tools allow the construction of ontological relationships which might, for example, advance prosopographical studies. Yet another advantage of electronic publication is the possibility of continuing to publish documents on-line at any time. Nevertheless, the use of digital tools does not answer all questions, it also raises problems and confront scholars with new choices that this presentation would also like to discuss.

KEEPING ETHIOPIA DIGITALLY IN SYNC: THE ROLE OF A COORDINATING AGENCY FOR PROGRESS ON DIGITAL SUPPORTS FOR ETHIOPIAN LANGUAGES [Abstract ID: 0801-07]
Isabelle A. ZAUGG, Postdoctoral Research Scholar at Columbia University's Institute for Comparative Literature and Society

This presentation proposes the vital need for a coordinating organization to help synchronize, organize, and communicate the important progress in Ethiopian language computing that are ongoing and crucially needed. This proposal is based on insights gained from my PhD dissertation entitled “Digitizing Ethiopic: Coding for Linguistic Continuity in the Face of Digital Extinction.” My research documented the important contributions of many individuals to the digital supports that Ethiopian languages currently enjoy. Through this research I identified a number of gaps that a coordinating organization could bridge to spur progress further. For example, much of the work on digital supports for Ethiopian languages has been done by individuals around the globe motivated by a sense of cultural responsibility, patriotism, or pride. While inspiring, unfortunately at times individuals have duplicated the efforts of others due to a lack of awareness of parallel efforts. A coordinating organization could not only help prevent duplication of effort, but could also help connect people with common interests to speed their work through collaboration. Furthermore, it could potentially play a role in helping to recognize and reward efforts that have historically often gone unpaid and underappreciated. A coordinating organization could also act as a liaison between Ethiopian universities, the Ethiopian IT sector, and the international IT companies looking to better support Ethiopian languages but lacking the linguistic knowledge in-house. A coordinating organization could also make recommendations to government agencies about IT policy and standards. It could also play the important role of informing and educating the public about the digital language supports that exist, since widespread lack of awareness has slowed the adoption of technologies like Amharic “voice to text” that have the potential to ease communication and spur the economy. Finally, a coordinating organization could help bring together linguists and the IT sector to ensure that digital vitality does not come at the expense of linguistic and cultural degradation. I will present an evidence-based lecture on this topic and then lead a discussion about how such a coordinating organization could be established.

OPPORTUNITIES AND CHALLENGES IN THE DIGITIZATION OF THE YAREDIC CORPUS [Abstract ID: 0801-12]
DANIEL Yacob, Ge'ez Frontier Foundation

The Zēma chant practice of the Ethiopian Orthodox Church has been both a vocal and calligraphic tradition since its inception in the 6th century by Ethiopia's most celebrated of saints, Saint Yared. The Yaredic corpus represents its own class of literature not only from the nature of its content but also from the complex system of internal referencing that it relies upon for comprehension. The system of referencing plays an important role in defining the document structure and layout that is not used by any other class of Ethiopic or chant literature. The complexities of its written expression pose particular challenges to electronic typesetting systems, desktop publishing software and standards based e-media layout engines. The paper will present a meronymic perspective of Zēma document structure and proposes markup syntax to support both the referencing and layout requirements of the Yaredic class of literature. The paper goes further to address the most challenging aspect of presenting Zēma digitally –the multiple levels of interlinear chant notation. A three-dimensional model is developed, and tested, that allows the binding of separate Zēma chant notation levels with hymn text with correct presentation under existing e-media standards. The paper concludes with an assessment for the prospects supporting Yaredic documents in standards driven software and reviews entirely new possibilities for presenting Zēma in interactive digital media.

PLACE NAMES IN THE CHRONICLE OF KING GÄLAWDEWOS (1540–1559): A PROTOTYPE GEO-ANNOTATED TEXT OF THE ETHIOPIC TRADITION [Abstract ID: 0801-03]
SOLOMON Gebreyes, University of Hamburg, Germany
Pietro LIUZZO, University of Hamburg, Germany

Although Ethiopia is remarkable in ancient and medieval history, well documented in classical Gǝʿǝz texts, its historical geography has never been fully documented through collecting toponyms from historical texts. Infact, the abundant royal chronicles and local hagiographies, as well as numerous traveler accounts, have a considerable amount of information about local places that represent the historical geography of the county in various periods of time. In this context, to fill this gap, the study aimed to document and analyse place names in the chronicle of King Gälawdewos (1540–1559) focusing on the prototype geo-annotated text of the Ethiopic tradition.

SOMALI CORPUS: A FRAMEWORK FOR LINGUISTIC ANNOTATION [Abstract ID: 0801-09]
JAMA Musse Jama, Institute of Research, Heritage Preservation and Development, Redsea Cultural Foundation, Hargeysa, Somaliland

Developing IT resources for language mainly focuses on well-described languages with long standing written traditions and with a large number of speakers. One of the main challenges for the languages with more recent written traditions is the lack of enough data for successful statistical approaches. This descriptive paper aims to present the state of the art of the construction of the Red sea Cultural Foundation’s Somali Corpus (RCF-SC), in collaboration with Oriental University of Naples (Italy), and the development of a series of computer programs with which to analyze the corpus data for various purposes. The core of RCF-SC is unique in Somali speaking countries and wants to be, for Somali, a resource equivalent in quality to the British National Corpus. The first edition of the corpus, containing 5 million words tagged and grammatically annotated, is online at www.somalicorpus.com.

TIGRINYA ORTHOGRAPHY: MATERIALS ON WRITTEN TIGRINYA LANGUAGE STANDARDIZATION [Abstract ID: 0801-04]
Yaroslav GUTGARTS, International Committee of the Red Cross, Addis Ababa, Ethiopia

Tigrinya, being the largest language of the State of Eritrea, is also the most important of three working languages of the country. Moreover Tigrinya is the official language of Tigray Regional State of the Federal Democratic Republic of Ethiopia, where it is spoken by even more people. It is impossible to tell now the precise or especially reliable total number of speakers of the language, but it can be estimated to be approximately 10 million. Although Tigrinya is one of the oldest and widely spoken languages in the Horn of Africa, its written tradition is relatively young. At present the Tigrinya orthography is at an intermediate stage of the standardization process; it is in transition. The unrestricted orthography, often stipulated by rich dialectology, regularities of colloquial language, Amharic influence and other factors, goes along with unrestricted orthoepy. Some attempts on the standardization were already made by Tigray Language Council founded in 1944 by Edward Ullendorff in Eritrea. Following the example of T.L.C. many years later, the foundation of an Academy of Amharic language was projected by Haile Selassie, and the Academy was founded in 1972. In 1979 the Academy was renamed as the Academy of Ethiopian Languages. The Academy has been responsible for the standardization and (socio-) linguistic description of the Ethiopian languages. Very recently the Tigray Languages Academy was founded. In spite of various activities in this field, the outcome is rather modest. As it is well known, orthographical standardization is the only tool enabling adequate compilation and use of dictionaries (whether it be electronic or paper ones) and grammars. A fixed orthography must be codified in normative dictionaries and grammars. Whether these dictionaries and grammars are created by private individuals or by state institutions, they become standard if they are treated as authorities for correcting language. A fixed written form and subsequent codification make the standard variety more stable than purely spoken varieties. This variety becomes the norm for writing, is used in broadcasting and for official purposes, and is the form taught to non-native learners. In the case of Tigrinya this is the goal of paramount importance, but it is yet to be achieved. This research focuses on the orthographic features of the Tigrinya language both in Tigray (Ethiopia) and Eritrea. It has been carried out by means of analysis, comparison and systematizing of the relevant fragments of the modern (and, in some extent, old) written Tigrinya texts. The written literature, the press and legal documents served as sources of data collection. The work consists of the following parts or chapters: introduction, order transitions, various transitions, interchangeability, other inconsistencies, diachronic changes. The case in question is solely orthographic (i.e. written) standardization; dialectical standardization should not be under consideration. The primary aim of the acquired data is to provide the adequate means for further standardization of the written Tigrinya language that will enable, above all, to create morphological analyzers which will help to properly develop and use electronic, digital and online dictionaries, translators, including automatic translators and translators by means of scanning. All of these tools are already available for the largest and the most developed languages of the world. Successful standardization of the written Tigrinya will eventually favor the development of many fields of social life of the language speakers: education, literature, mass media, business and many others. The author of this research is a lexicographer dealing with Tigrinya, and all the materials included in the research were obtained during a long-term lexicographic practice.