Nikolaus P. Himmelmann
Department of Linguistics - General Linguistics, University of Cologne
"Lexicon Discovery in Austronesian using Unsupervised Word Segmentation with Pitman-Yor Language Models - And What it Tells us About the Structure of Words"
In this paper, we report on experiments applying an unsupervised word segmentation algorithm based on a nested Pitman-Yor language model to two Austronesian languages, Wooi and Waima’a. We obtained a lexicon precision of 69.2% and 67.5% for Wooi and Waima’a, respectively, if single-letter words and words found less than three times were discarded. A comparison with an English word segmentation task showed comparable performance, verifying that the assumptions underlying the Pitman-Yor language model, the universality of Zipf’s law and the power of n-gram structures, do also hold for languages such as Wooi and Waima’a which show significantly different structures from Standard European languages. Importantly, from a linguistic point of view, the area where the algorithm runs into segmentation problems is not random but pertains to linguistically well-known problems of word segmentation. The presentation will focus here on problems relating to clitic elements, which by definition have an inherently ambiguous word status.
(Talk together with Reinhold Häb-Umbach)
Research Interest: Language typology and language universals, grammaticalization, prosody and grammar, language documentation and description (linguistic field work)