2024
pdf
bib
abs
MERA: A Comprehensive LLM Evaluation in Russian
Alena Fenogenova
|
Artem Chervyakov
|
Nikita Martynov
|
Anastasia Kozlova
|
Maria Tikhonova
|
Albina Akhmetgareeva
|
Anton Emelyanov
|
Denis Shevelev
|
Pavel Lebedev
|
Leonid Sinev
|
Ulyana Isaeva
|
Katerina Kolomeytseva
|
Daniil Moskovskiy
|
Elizaveta Goncharova
|
Nikita Savushkin
|
Polina Mikhailova
|
Anastasia Minaeva
|
Denis Dimitrov
|
Alexander Panchenko
|
Sergey Markov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Over the past few years, one of the most notable advancements in AI research has been in foundation models (FMs), headlined by the rise of language models (LMs). However, despite researchers’ attention and the rapid growth in LM application, the capabilities, limitations, and associated risks still need to be better understood. To address these issues, we introduce a new instruction benchmark, MERA, oriented towards the FMs’ performance on the Russian language. The benchmark encompasses 21 evaluation tasks for generative models covering 10 skills and is supplied with private answer scoring to prevent data leakage. The paper introduces a methodology to evaluate FMs and LMs in fixed zero- and few-shot instruction settings that can be extended to other modalities. We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system. We evaluate open LMs as baselines and find they are still far behind the human level. We publicly release MERA to guide forthcoming research, anticipate groundbreaking model features, standardize the evaluation procedure, and address potential ethical concerns and drawbacks.
pdf
bib
abs
Transformer Attention vs Human Attention in Anaphora Resolution
Anastasia Kozlova
|
Albina Akhmetgareeva
|
Aigul Khanova
|
Semen Kudriavtsev
|
Alena Fenogenova
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Motivated by human cognitive processes, attention mechanism within transformer architecture has been developed to assist neural networks in allocating focus to specific aspects within input data. Despite claims regarding the interpretability achieved by attention mechanisms, the extent of correlation and similarity between machine and human attention remains a subject requiring further investigation.In this paper, we conduct a quantitative analysis of human attention compared to neural attention mechanisms in the context of the anaphora resolution task. We collect an eye-tracking dataset based on the Winograd schema challenge task for the Russian language. Leveraging this dataset, we conduct an extensive analysis of the correlations between human and machine attention maps across various transformer architectures, network layers of pre-trained and fine-tuned models. Our aim is to investigate whether insights from human attention mechanisms can be used to enhance the performance of neural networks in tasks such as anaphora resolution. The results reveal distinctions in anaphora resolution processing, offering promising prospects for improving the performance of neural networks and understanding the cognitive nuances of human perception.
2022
pdf
bib
abs
TAPE: Assessing Few-shot Russian Language Understanding
Ekaterina Taktasheva
|
Alena Fenogenova
|
Denis Shevelev
|
Nadezhda Katricheva
|
Maria Tikhonova
|
Albina Akhmetgareeva
|
Oleg Zinkevich
|
Anastasiia Bashmakova
|
Svetlana Iordanskaia
|
Valentina Kurenshchikova
|
Alena Spiridonova
|
Ekaterina Artemova
|
Tatiana Shavrina
|
Vladislav Mikhailov
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE’s design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (https://tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.