Ritiz Tambi


2020

pdf bib
Search Query Language Identification Using Weak Labeling
Ritiz Tambi | Ajinkya Kale | Tracy Holloway King
Proceedings of the Twelfth Language Resources and Evaluation Conference

Language identification is a well-known task for natural language documents. In this paper we explore search query language identification which is usually the first task before any other query understanding. Without loss of generalization, we run our experiments on the Adobe Stock search engine. Even though the domain is relatively generic because Adobe Stock queries cover a broad range of objects and concepts, out-of-the-box language identifiers do not perform well due to the extremely short text found in queries. Unlike other well-studied supervised approaches for this task, we examine a practical approach for the cold start problem for automatically getting large-scale query-language pairs for training. We describe the process of creating weak-labeled training data and then human-annotated evaluation data for the search query language identification task. The effectiveness of this technique is demonstrated by training a gradient boosting model for language classification given a query. We out-perform the open domain text model baselines by a large margin.