Using Morphosemantic Information in Construction of a Pilot Lexical Semantic Resource for Turkish

Morphological units carry vast amount of semantic information for languages with rich inﬂec-tional and derivational morphology. In this paper we show how morphosemantic information available for morphologically rich languages can be used to reduce manual effort in creating semantic resources like PropBank and VerbNet; to increase performance of word sense disam-biguation, semantic role labeling and related tasks. We test the consistency of these features in a pilot study for Turkish and show that; 1) Case markers are related with semantic roles and 2) Morphemes that change the valency of the verb follow a predictable pattern


Introduction
In recent years considerable amount of research has been performed on extracting semantic information from sentences. Revealing such information is usually achieved by identifying the complements (arguments) of a predicate and assigning meaningful labels to them. Each label represents the argument's relation to its predicate and is referred to as a semantic role and this task is named as semantic role labeling (SRL). There exists some comprehensive semantically interpreted corpora such as FrameNet and PropBank. These corpora, annotated with semantic roles, help researchers to specify SRL as a task, furthermore are used as training and test data for supervised machine learning methods (Giuglea and Moschitti, 2006). These resources differ in type of semantic roles they use and type of additional information they provide.
FrameNet (FN) is a semantic network, built around the theory of semantic frames. This theory describes a type of event, relation, or entity with their participants which are called frame elements (FEs). All predicates in same semantic frame share one set of FEs. A sample sentence annotated with FrameNet, VerbNet and PropBank conventions respectively, is given in Ex. 1. The predicate "buy" belongs to "Commerce buy", more generally "Commercial transaction" frame of FrameNet which contains "Buyer", "Goods" as core frame elements and "Seller" as a non-core frame element as in Ex. 1. FN also provides connections between semantic frames like inheritance, hierarchy and causativity. For example the frame "Commerce buy" is connected to "Importing" and "Shopping" frames with "used by" relation. Contrary to FN, VerbNet (VN) is a hierarchical verb lexicon, that contains categories of verbs based on Levin Verb classification (Schuler, 2006). The predicate "buy" is contained in "get-13.5.1" class of VN, among with the verbs "pick", "reserve" and "book". Members of same verb class share same set of semantic roles, referred to as thematic roles. In addition to thematic roles, verb classes are defined with different possible syntaxes for each class. One possible syntax for the class "get-13.5.1" is given in the second line of Ex. 1. Unlike FrameNet and VerbNet, PropBank (PB) (Palmer et al., 2005) does not make use of a reference ontology like semantic frames or verb classes. Instead semantic roles are numbered from Arg0 to Arg5 for the core arguments.
There doesn't exist a VerbNet, PropBank or a similiar semantically interpretable resource for Turkish (except for WordNet ). Also, the only available morphologically and syntactically annotated treebank corpus: METU-Sabanci Dependency Treebank (Eryigit et al., 2011), , (Atalay et al., 2003) has only about 5600 sentences, which has presumably a low coverage of Turkish verbs. VerbNet defines possible syntaxes for each class of verbs. However, due to free word order and excessive case marking system, syntactic information is already encoded with case markers in Turkish. Thus the structure of VerbNet does not fit well to the Turkish language. PropBank simplifies semantic roles, but defines neither relations between verbs nor all possible syntaxes for each verb. Moreover only Arg0 and Arg1 are associated with a specific semantic content, which reduces the consistency among labeled arguments. Due to lack of a large-scale treebank corpus, building a high coverage PropBank is currently not possible for Turkish. FrameNet defines richer relations between verbs, but the frame elements are extremely fine-grained and building such a comprehensive resource requires a great amount of manual work for which human resources are not currently available for Turkish.
In this paper, we discuss how the semantic information supplied by morphemes, named as morphosemantics, can be included in the construction of semantic resources for languages with less resources and rich morphologies, like Turkish. We try to show that we can decrease manual effort for building such banks and increase consistency and connectivity of the resource by exploiting derivational morphology of verbs; eliminate mapping costs by associating syntactic information with semantic roles and increase the performance of SRL and word sense disambiguation by directly using morphosemantic information supplied with inflectional morphology. Then, we perform a pilot study to build a lexical semantic resource that contains syntactic information as well as semantic information that is defined by semantic roles both in VerbNet and PropBank fashion, by exploiting morphological properties of Turkish language.

Related Work
In study by Agirre et al. (2006) and Aldezabal et al. (2010), the authors discuss the suitability of Prop-Bank model for Basque verbs. In addition to semantic role information, the case markers that are related to these roles are also included in the verb frames. It is stated that including case markers in Basque PropBank as a morphosemantic feature is useful for automatic tagging of semantic roles for Basque language which has 11 case markers. Hawwari et al. (2013) present a pilot study for building Arabic Morphological Pattern Net, that aims at representing a direct relationship between morphological patterns and semantic roles for Arabic language. Authors experiment with 10 different patterns and 2100 verb frames and analyze the structure and behavior of these Arabic verbs. The authors state that the results encourage them for a more comprehensive study. The SRL system for Arabic (Diab et al., 2008) and the light-verb detection system for Hungarian (Vincze et al., 2013) also benefited from the relation between case markers and semantic roles.
Furthermore, there are studies on exploiting morphosemantics in WordNets for different languages. Fellbaum et al. (2007), manually inspects WordNet's verb-noun pairs to find one-to-one mapping between an affix and a semantic role for English language. For example the nouns derived from the verbs with the suffixes −er and −or, like invent-inventor usually results as the agents of the event. However, it is stated that only two thirds of the pairs with this pattern could be classified as agents of the events. More patterns are examined and the regularity of these patterns are shown to be low for English language. In another work , authors propose a methodology, on exploiting morphosemantic information in languages where the morphemes are more regular. They perform a case study on Turkish, and propose application areas both monolingually and multilingually, such as globally enriching WordNets and auto detecting errors in WordNets. In a similiar work (Mititelu, 2012), morphosemantic information is added to Romanian WordNet and the proposed application areas in  are examined and shown to be feasible.
Previous studies based on building Basque PropBank focus on the building process of Basque Prop-Bank, rather than analysis of the regularity of case markers and the relation between semantic roles and case markers. Furthermore, the study related to building Arabic Morphological Pattern Net, aims to build a seperate dataset and map it to other resources such as Arabic VerbNet, WordNet and PropBank. Word-Net has rich cross-language morphosemantic links however it does not list all arguments of predicates, thus its structure is not convenient for NLP tasks like semantic role labeling. These studies either make use of case markers or derivational morphology of verbs, not both. Moreover, some of them requires extra mapping resources and some are diffucult to get utilized for semantic interpretation of sentences. Most important of all, none of the studies investigates Turkish language. To the best of our knowledge, this is the first attempt to build such a lexical semantic resource for Turkish and perform experiments on data to expose the relationship between semantic roles and morphemes known as case markers and valency changers in Turkish.

Morphosemantic Features
In morphologically rich languages, the meaning of a word is strongly determined by the morphemes that are attached to it. Some of these morphemes always add a predefined meaning while some differ, depending on the language. However, only regular features can be used for NLP tasks that require automatic semantic interpretation. Here, we determine two multilingual morphosemantic features: case markers and verb valency changing morphemes and analyze the regularity and usability of these features for Turkish.

Declension and Case Marking
Declension is a term used to express the inflection of nouns, pronouns, adjectives and articles for gender, number and case. It occurs in many languages such as Arabic, Basque, Sanskrit, Finnish, Hungarian, Latin, Russian and Turkish. In Table. 1, statistic performed by Iggesen (2013), shows that there are 86 Number of Cases vs Number of Languages 2 cases 3 cases 4 cases 5-7 cases 8-9 cases 10 or more cases 23 languages 9 languages 9 languages 39 languages 23 languages 24 languages Table 1: Case marking across languages languages with at least 5 case markings. An examplary morphological analysis for the Turkish word evlerinde "in his houses" is given in Ex. 2. In this analysis, ev is inflected with ler morpheme for plurality, i for third person singular and (n)de for locative (LOC) case. 2 Ex. 2 ev (-ler) (-i) (-nde) ev +Noun+ Pl + P3s + LOC Even though the languages differ, the same case markers are used to express similiar meanings with some variation. In order to exemplify this statement, sentences with similiar meanings and the same case markers are given in Table 2 for languages Turkish and Hungarian, which have rich case marking systems. Relation between semantic roles and case markers can assist researchers in solving some of the The hunter saw the rabbit. Jack went to school. I live in Ankara. I came from my mother.  • can supply prior information for disambiguating word senses, • can be used in language generation as such: Once the predicate and the sense is determined, the arguments can directly be inflected with the case markers associated with their roles.

Valency Changing Morphemes
The valency of a verb can be defined as the verb's ability to govern a particular number of arguments of a particular type. "In Turkish, verb stems govern relatively stable valency patterns or prototypical argument frames" as stated by Haig (1998). Consider the root verb giy "to wear". One can derive new verbs from the root giy "to wear" such as giy-in "to get dressed", giy-dir "to dress someone" and giy-il "to be worn". These verbs are referred to as verb stems and these special suffixes are referred to as valency changing morphemes. Some advantages of valency changing morphemes are • They exist for many languages.
• They are regular, easy to model and morphological analyzers available for such languages can analyze the valency of the verb stem.
• They are directly related to the number and type of the arguments, which are important for SRL related tasks.
By modeling the semantic role transformation from verb root to verb stem, we can automatically identify argument configuration of a new verb stem given the correct morphological analysis. By doing so, framing only the verb roots can guarantee to have frames of all verb stems derived from that root. This quickens the process of building a semantic resource, as well as automatizing and reducing the human error. In this section we present a pilot study for some available valencies in Turkish language. For the sake of simplicity, instead of thematic roles, argument labeling in the PropBank fashion is used.

Reflexive
As the word suggests, in reflexive verbs, the action defined by the verb has its effect directly on the person/thing who does the action (Hengirmen, 2002). The reflexive suffix triggers the suppression of one of the arguments. In Fig. 1 observed argument shift and in Table 3 some interesting reflexive Turkish verbs are given like besle "to feed" and besle-n "to eat -feed himself".

Reciprocal
Reciprocal verbs express actions done by more than one subject. The action may be done together or against each other. Reciprocal verbs may have a plural agent or two or more singular co-agents conjoined where one of them marked with COM case as shown in Fig 2. In both cases, the suppression of one of the arguments of the root verb is triggered. We have observed that the supressed argument may be in different roles (patient, theme, stimulus, experiencer, co-patient), but usually appears as Arg1 and rarely as Arg2. In Table 4, a small list of reciprocal verbs is given. Some semantic links are easy to see, whereas the link between döv "to beat" and döv-üş "to fight" is not that explicit.

Root
Stem Meaning küs (to offend) küs-üş (to get cross) with each other ode (to pay)öde-ş (to get even) with each other op (to kiss)öp-üş (to kiss) with each other sev (to love) sev-iş (to make love) with each other döv (to beat) döv-üş (to fight) with each other tanı (to know) tanı-ş (to get to know) each other

Causative
Causative category is the most common valence-changing category among Bybee's (1985) world-wide sample of 50 languages. Contrary to other morphemes, causative morpheme introduces of a new argument called causer to the valence pattern. In most of the languages, only intranstive verbs are causitivized (Haspelmath and Bardey, 1991). In this case, as shown in Fig. 3 the causee becomes the patient of the causation event. In other words, the central argument of the root verb, (Arg0 if exists, otherwise Arg1), is marked with ACC case and becomes an internal argument (usually Arg1) of the new causative verb. Some languages can have causatives from transitive verbs too, however the role and the mark of the causee may differ across languages. For the languages where the causee becomes an indirect object, like Turkish and Georgian, the central argument, Arg0 of the root verb, when transformed into a verb stem, receives the DAT case marker and serves as an indirect object (usually as Arg2), while Arg1 serves again as Arg1. This pattern for transitive verbs is given in Fig. 3. Some implicit relations exist in Table 5 such asöl "to die", and cause someone to dieöl-dür "to kill". Transformation for intransitive verb laugh and transitive verb wear, is causitivized as follows: [Kız] A0 gül-üyor.

Application Areas
Semantic Role Labeling (SRL) Semantic Role Labeling task is to identify the predicates and its arguments in the sentence, and then assign correct semantic roles to identified arguments. In Table 6, English sentences with different syntactic realizations and their translation into Turkish are given among with thematic roles annotated with VN convention. 3 In the second column, all words written in bold represent the arguments in destination roles. English sentences can not decribe a common syntax for the destination role; different prepositions such as into, at, onto precedes the argument. However, in Turkish sentences they are always marked with dative case. Similiarly, in the last column of Table 6, source and initial location roles are emphasized. Again, it is hard to find a distinguishing feature that reveals these roles in English sentences. There may be different prepositions out of, from or no preposition at all, before the argument in one of these roles, but they are naturally marked with ablative case in Turkish sentences.
Lang Destination Source #1.En She Ag loaded boxes Th into the wagon Dest . He Ag backed out of the trip Sou . #1.Tr Kutuları Th vagon-a Dest-DAT yükledi.
A subtask of automatic semantic role labeling is determining which features to extract from semantically annotated corpora. In recent studies, argument's relative position to predicate (before, after) and voice of the sentence (passive, active) were experimented as features for automatic SRL (Wu, 2013). However, there exist many features and finding the best features requires feature engineering and again extra time. These toy examples suggest that there may be a correlation between case markers and semantic roles. If that is the case, the SRL task can be reduced to predicate and argument identification task, since the labeling will be automatically or semi-automatically done by using case markers as features.

Word Sense Disambiguation
The task of finding the meaning of a word in the context in question is called word sense disambiguation. In Table 7 three senses of Turkish verb lemma ayır and their arguments with case markers are given. In the first sense, the arguments are marked with ACC and DAT, with ABL and NOM in the second and with ACC, ABL in the third. The second and the third senses are similiar. The action of reserving is usually performed on an indefinite object which usually appears in NOM form, where seperating is applied on a certain object that is usually marked with ACC case. After the arguments are identified, one can easily detect the sense of the verb "ayır" by looking at arguments' case markings.

Methodology
We have performed a feasibility study for using morphosemantic features in building a lexical semantic resource for Turkish. As discussed in Section 3.2, we assume we can automatically frame a verb (e.g sakla − n(ref lexive)) that is derived with a regular valency changing morpheme (e.g. n), if the argument configuration of the root verb (e.g. sakla) is known. Hence, we have only framed root verbs. We have framed 233 root verbs and 452 verb senses. We have calculated the total number of valence changing morphemes as 425. This means 425 verbs can be automatically framed by applying the valency patterns to 233 root verbs. In this analysis we have only considered one sense of the verb since there may be cases where valency changing morpheme can not be applied to another sense of the verb. This can Turkish is not among rich languages by means of computational resources as discussed before. Turkish Language Association (TDK) is a trustworthy source for lexical datasets and dictionaries. To run this pilot study, we have used the list of Turkish root verbs provided by TDK and the TNC corpus 4 . The interface built for searching the TNC corpus gives the possibility to see all sentences that were built with the verb the user is searching for (Aksan and Aksan, 2012). The senses of the verbs and case marking of their arguments are decided by manually investigating the sentences appear in search results of the TNC corpus. Then, the arguments of the predicates are labeled with VerbNet thematic roles and PropBank argument numbers, by checking the English equivalent of Turkish verb sense. This process is repeated for all verb senses.
For framing purposes, we have adjusted an already available open source software, cornerstone (Choi et al., 2010) 5 . To supply case marking information of the argument, a drop down menu containing six possible case markers in Turkish is added as shown in Fig 4a. Finally, another drop down menu that contains all possible suffixes that a Turkish verb can have is added, shown in Fig 4b. Theoretically, the number of possible derivations may be infinite for some Turkish verbs, due to its rich generative property. However, practically the average number of inflectional groups in a word is less than two . TDK provides a lexicon 6 for widely used verb stems derived from root verbs by a valency changing morpheme. To avoid framing a nonexisting verb, we have used a simple interface shown in Fig 4b to enter only the stems given by TDK. An example with the Turkish verb bin "to ride" is given in Fig 4b. The first line defines that one can generate a stem bin-il "to be ridden by someone" from the root bin by using the suffix l. Similiarly, second line illustrates a two layer derivational morphology, which can be interpreted as producing two verbs: bin-dir "cause someone to ride something" and bin-dir-il "to be caused by someone to ride something".

Experiments and Results
In Table 8, number of co-occurences of each thematic role with each case marker are given. Since in PropBank only Arg0 and Arg1 have a certain semantic interpretation, we have used VerbNet thematic roles in our analysis. Some roles look highly related with a case marker, while some look arbitrary. Results can be interpreted in two ways: 1) If the semantic roles are known and case marker information is needed, Agent will be marked with NOM, Destination with DAT, Source with ABL and Recipient with DAT case with more than 0.98 probability, furthermore Patient and Theme can be restricted to NOM or ACC cases; 2) If case markers are known and semantic role information is needed, only restrictions and prior probabilities can be provided. Highest probabilities occur with COM-instrument, LOC-location, DAT-destination, ACC-Theme and NOM-Agent pairs. We have applied our proposed argument trans-4 TNC corpus is a balanced and a representative corpus of contemporary Turkish with 50 million words 5 Cornerstone is also used for building English, Chinese and Hindi/Urdu PropBanks. 6 This lexicon is not computationally available   Table 9: Results of Argument Transformation formation on verbs with different valencies, and compared the argument configurations of the roots and stems. In Table 9, rows represent the valency changes applied to verb root, where Intransitive column contains the number of intransitive verbs that the pattern is applied to, and Transitive similiarly. The #Hold column shows the number of root verbs for which the proposed patterns hold, and #!Hold shows the number of times the pattern can not be observed. Reflexive pattern can only be applied to transitive verbs, while others can be applied to both. Experiments are done for reflexive, reciprocal and causative forms. Our preliminary results on a small set of root verbs show that proposed argument transformation can be seen as a regular transformation.

Conclusion and Future Work
In this study, we presented a pilot study for building a Turkish lexical semantic resource for 452 verb senses by making use of two morphosemantic features that appear to be useful for challenging NLP tasks. Our experimental results on 814 arguments showed that the first feature, case markers, are not arbitrarily linked with a semantic role. This brings us to a conclusion that they can be a distinguishing feature for SRL, word sense disambiguation and language generation tasks. We ran some experiments for the second feature, valency changing morphemes and observed that the transformation of the argument structures of root to stem follows a specific pattern, hence proposed transformation seems to be regular and predictable. The results suggest that argument configuration of the root verb may be enough to label any verb stem derived with valency changing morphemes. This gives us the ability to build a semantic resource in a shorter time and reduce the human error, as well as provide a direct relationship like "causativity", "reflexivity" and "reciprocity" between verbs except for some problematic cases explained in Sect. 5. To conclude, this study encourages us to continue using morphosemantic features and increase the size of this resource.