Forschungszentrum Deutscher Sprachatlas, University Marburg
"Three of a Kind? Multialignment of Aminoacids, Sounds and Words"
In recent years, various linguists have realized that multialignment of aminoacids, one of the central methodological advancements of molecular biology, is highly similar to what is known as sound correspondences in historical-comparative linguistics. Consequently, the methods developed in bioinformatics to produce multialignments can very profitably be used in historical linguistics as well. A further circumstance in which multialignment can be of use in linguistics is morphosyntact language comparison. Specifically, by using multialigned massively parallel texts,* it is possible to induce functional-typological parameters. Within linguistics, the level of sounds and morphosyntax are known to have a crucially different status ("duality of patterning"), which makes these three different kinds of multialignment and interesting triad to reflect on the further generalization of the method of multialignment.
First, these three kinds of strings (consisting of aminoacids, sounds or words) have a rather different information structure. The central observation is that there are just four different aminoacids, while each language has dozens of sounds and hundreds to thousands of words. This trivial observation has various implications for the establishment of a multialignment. For example, the estimation of substitution models is much more difficult with higher number of entities, the more so because in linguistics the entities have a stringly skewed distribution ("zipfian"). Further, a single aminoacid has less self-information than a single sound, which in turn has less self-information than words. This is mirrored by the reliance on ordering, which is strongest in aminoacids and weakest with words. In contrast, the higher self-information makes it easier to estimate substition models from data.
There is a second crucial difference between these three kinds of strings, namely the cross-taxa comparability. Aminoacids are (almost) identical through all lifeforms, which simplifies comparison substantially. Words obviously have different forms throughout languages (except for the few homologous words, i.e. cognates and loanwords), which makes comparison much more difficult. Sounds represent an intermediate situation, as sounds are different between languages (like words, i.e the "phonemic" approach), but they can be mapped to a universal structure (like aminoacides, i.e. the "phonetic" approach).
As a generalization over these three different, but similar, kinds of multialignment, I will propose to consider multialignment as constrained partitioning, in which each cluster of the partitioning is a column of the multialignment. The constraints can formulate different preferences on the clustering purity.
* Massively parallel texts are the same content expressed in very many different languages, either collected through translation or through non-linguistic experimental stimuli.
Research Interest: Michael Cysouw’s research focusses on quantitative approaches to investigate the worldwide linguistic diversity. He approaches the variation between languages both from a diachronic and from a functional synchronic perspective. His main goal is to combine the in-depth knowledge from the philological linguistic tradition with the modern computational possibilities. The major challenge is to develop the philological linguistic tradition into a disciple that can profitably work with the large (and growing) data availability.