Turkish Morphology

A two-level morphological analyzer for Turkish.

This is not an official Google product.

Components

This implementation is composed of three layers:

The first level of the morphological analysis is implemented by the morphophonemic model, which takes a Turkish word and transforms it into the intermediate representation. The output of the first level is all possible hypotheses of word stem annotations with morphophonemic irregularities followed by the meta-morphemes that correspond to the suffixes that are realized in the surface form.

Input: affında
Output: af"+SH+NDA

Lexicon entries and morphotactic FST definitions are composed and compiled into a single FST which acts as the second level of the morphological analysis, namely the morphotactic model. Morphotactic model takes the intermediate tape as the input and transforms it to all possible human-readable morphological analyses that can be generated from the hypotheses generated by the first level.

Input: af"+SH+NDA
Output: (af[NN]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]

See Interpreting Human-Readable Morphological Analysis section for a description of such human-readable morphological analysis.

How to Parse Words

To morphologically parse a word, simply run below from the project root directory.

bazel run -c opt scripts:print_analyses -- --word=[WORD_TO_PARSE]

This will morphologically parse the input word against the two-level morphological analyzer and output a set of human-readable morphological analysis, as such:

bazel run -c opt scripts:print_analyses -- --word=geldiğinde
> Morphological analyses for the word 'geldiğinde':
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3pl])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([NOMP]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc]+[Copula=PresCop]+[PersonNumber=V3sg])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+Hn[Possessive=P2sg]+NDA[Case=Loc])+[Proper=True]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=False]
> (gel[VB]+[Polarity=Pos])([VN]-DHk[Derivation=PastNom]+[PersonNumber=A3sg]+SH[Possessive=P3sg]+NDA[Case=Loc])+[Proper=True]

If the input string is not accepted as a Turkish word, morphological analyzer outputs an empty result.

bazel run -c opt scripts:print_analyses -- --word=foo
> 'foo' is not accepted as a Turkish word

Interpreting Human-Readable Morphological Analysis

An example output human-readable morphological analysis is as follows;

Input Word (evlerindekilerin = those that belongs to ones in their homes):

bazel run -c opt scripts:print_analyses -- --word=evlerindekilerin

Sample Output Morphological Analysis String:

(ev[NN]+[PersonNumber=A3sg]+lArH[Possessive=P3pl]+NDA[Case=Loc])([PRF]-ki[Derivation=Pron]+lAr[PersonNumber=A3sg]+[Possessive=Pnon]+NHn[Case=Gen])+[Proper=False]

Human-readable morphological analyses can be decomposed into parts:

Python API

We also provide a Python API that can be used to morphologically analyze Turkish words, generate Turkish word forms from morphological analyses, parse human-readable morphological analyses into protobuf messages, validate their structural well-formedness and to generate human-readable analyses from them. You can see some example use cases in //examples.

If you are using Bazel, you can depend on this repository as an external dependency of your project by adding the following to your WORKSPACE file:

git_repository(
  name = "google_research_turkish_morphology",
  remote = "https://github.com/google-research/turkish-morphology.git",
  tag = "{version-tag}",
)

Then, you can simply use @google_research_turkish_morphology//turkish_morphology:analyze (or other modules of the API) as a dependecy of your relevant py_library or py_binary BUILD targets.

The API is also available on PyPi. To install the latest release from PyPi, run:

python3 -m pip install turkish-morphology

To install from source, run below from the project root directory (preferably within a Python virtual environment):

bazel build //...
bazel-bin/setup install

Requirements

To build and run the morphological analyzer install Bazel version 5.0.0 and Python 3.9. All other intrinsic dependencies will be imported, built and taken care of by Bazel according to the WORKSPACE setup throughout the first invocation of the morphological analyzer runtime. If you are installing from PyPi, you need pip.

Citing

If you use or discuss the code, data or tools from this repository in your work, please cite:

Öztürel, A., Kayadelen, T. & Demirşahin, I (2019, September). A syntactically expressive morphological analyzer for Turkish. In Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing (pp. 65-75).

@inproceedings{
    title = "A Syntactically Expressive Morphological Analyzer for Turkish",
    author = "\"{O}zt\"{u}rel, Adnan and Kayadelen, Tolga and Demir\c{s}ahin,
        I\c{s}{\i}n",
    booktitle = "Proceedings of the 14th International Conference on Finite-State
        Methods and Natural Language Processing",
    month = "23--25" # sep,
    year = "2019",
    address = "Dresden, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-3110",
    pages = "65--75",
}

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.