Department of Communications - Engineering, University of Paderborn, Germany
"Lexicon Discovery in Austronesian using Unsupervised Word Segmentation with Pitman-Yor Language Models - And What it Tells us About the Structure of Words"
In this paper, we report on experiments applying an unsupervised word segmentation algorithm based on a nested Pitman-Yor language model to two Austronesian languages, Wooi and Waima’a. We obtained a lexicon precision of 69.2% and 67.5% for Wooi and Waima’a, respectively, if single-letter words and words found less than three times were discarded. A comparison with an English word segmentation task showed comparable performance, verifying that the assumptions underlying the Pitman-Yor language model, the universality of Zipf’s law and the power of n-gram structures, do also hold for languages such as Wooi and Waima’a which show significantly different structures from Standard European languages. Importantly, from a linguistic point of view, the area where the algorithm runs into segmentation problems is not random but pertains to linguistically well-known problems of word segmentation. The presentation will focus here on problems relating to clitic elements, which by definition have an inherently ambiguous word status.
(Talk together with Nikolaus P. Himmelmann)
Research Interest: statistical signal processing as well as pattern recognition and machine learning, predominantly with applications in the field of speech processing. Examples of recent work include the development of blind speech separation and acoustic beamforming algorithms, investigations into robust automatic speech recognition, and the use of unsupervised learning techniques to build speech processing systems that do not require labeled training data.