Endangered Language Documentation Project

May 2018 - Present

Precedent
According to National Geographic, a language dies every two weeks. The decimation of languages is the result of an increasingly globalized world, but with increased connectivity also comes the loss of cultural records. I grew up speaking both Thai and English and frequently asked my parents why some words existed in one language, but not the other. I came to learn that direct translation between two languages is seldom perfect, and in studying linguistics in college, realized how special and essential language is as a fabric that ties together the changing values, concepts and expressions of all communities.

In May 2018, I joined Professor Khalil Iskarous' Endangered Language Documentation project -- an effort to build a computational model to efficiently learn the rules of any language given minimal speech and/or text data. This model would vastly expedite the language documentation process from current methods, and hopefully ensure that our greatest universal records of humanity's diversity and history aren't lost forever.

Part 1: Data Collection
I spent the month of June 2018 in the northern town of Bolzano, Italy. Myself and other undergraduates interviewed native speakers of the endangered language, Ladin, spoken by about 30,000 people in the Fassa Valley an hour bus ride away from Bolzano. Ladin possesses dialectal differences even across villages, with obvious pronunciation differences between northern valley residents and southern valley residents.

We first had our native speakers read aloud Ladin folk tales that we ourselves had learned and translated in order to build a corpus of speech data with existing transcription. Then, we directly interviewed native speakers to elicit diverse vocabulary in a wide range of topics, from sports, to local attire, to music. Finally, we recorded conversations between two native speakers on various topics.

Part 2: Data Preparation
Between us all, we had collected several hours worth of Ladin data and began translating it into Italian and English, and then transcribed it into IPA. The translation was a challenging process, especially given the intrusion of Italian words into the Ladin, as it is normal for native speakers of an endangered language to imbibe the dominant language around them and begin to replace less common words. Our native speakers were also young, and so introduced slang terms that typically had no direct translations. And, our IPA transcriptions greatly varied between speakers due to dialectical differences, which leads to the question: what is the most standard Ladin?

Part 3: Data Computation and Analysis
With the data transcribed, we began to build a chart parser using Python to determine the parts of speech of words not in our lexicon using the context of known words. This method isolates the unknown word and its neighbors to narrow down possible parts of speech, and continues expanding comparisons to neighboring words until the unknown word is assigned a part of speech. As the process continues, it becomes easier to label parts of speech as the lexicon of known parts of speech increases. We've implemented this parser on a small lexicon and small number of unknowns. We're continuously building on the parser to accommodate more ambiguous semantics and syntactic challenges, especially head movement that leads to verbs being swapped elsewhere in a phrase, which alters the rest of the phrase structure.

As of summer of 2019, I'm working on the morphological aspect of this project. Myself and a partner are attempting to apply Ladin morphology to John Goldsmith and Eric Rosen's geometrical morphology model (2018).






back to home