Ankit Srivastava

Also published as: Ankit K. Srivastava, Ankit Kumar, Ankit Kumar Srivastava


2021

pdf bib
NARNIA at NLP4IF-2021: Identification of Misinformation in COVID-19 Tweets Using BERTweet
Ankit Kumar | Naman Jhunjhunwala | Raksha Agarwal | Niladri Chatterjee
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

The spread of COVID-19 has been accompanied with widespread misinformation on social media. In particular, Twitterverse has seen a huge increase in dissemination of distorted facts and figures. The present work aims at identifying tweets regarding COVID-19 which contains harmful and false information. We have experimented with a number of Deep Learning-based models, including different word embeddings, such as Glove, ELMo, among others. BERTweet model achieved the best overall F1-score of 0.881 and secured the third rank on the above task.

2020

pdf bib
Noisy Text Data: Achilles’ Heel of BERT
Ankit Kumar | Piyush Makhija | Anuj Gupta
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Owing to the phenomenal success of BERT on various NLP tasks and benchmark datasets, industry practitioners are actively experimenting with fine-tuning BERT to build NLP applications for solving industry use cases. For most datasets that are used by practitioners to build industrial NLP applications, it is hard to guarantee absence of any noise in the data. While BERT has performed exceedingly well for transferring the learnings from one use case to another, it remains unclear how BERT performs when fine-tuned on noisy text. In this work, we explore the sensitivity of BERT to noise in the data. We work with most commonly occurring noise (spelling mistakes, typos) and show that this results in significant degradation in the performance of BERT. We present experimental results to show that BERT’s performance on fundamental NLP tasks like sentiment analysis and textual similarity drops significantly in the presence of (simulated) noise on benchmark datasets viz. IMDB Movie Review, STS-B, SST-2. Further, we identify shortcomings in the existing BERT pipeline that are responsible for this drop in performance. Our findings suggest that practitioners need to be vary of presence of noise in their datasets while fine-tuning BERT to solve industry use cases.

pdf bib
hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization
Piyush Makhija | Ankit Kumar | Anuj Gupta
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

We present hinglishNorm - a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 segments annotated for text normalization. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.

2017

pdf bib
Event Detection and Semantic Storytelling: Generating a Travelogue from a large Collection of Personal Letters
Georg Rehm | Julian Moreno Schneider | Peter Bourgonje | Ankit Srivastava | Jan Nehring | Armin Berger | Luca König | Sören Räuchle | Jens Gerth
Proceedings of the Events and Stories in the News Workshop

We present an approach at identifying a specific class of events, movement action events (MAEs), in a data set that consists of ca. 2,800 personal letters exchanged by the German architect Erich Mendelsohn and his wife, Luise. A backend system uses these and other semantic analysis results as input for an authoring environment that digital curators can use to produce new pieces of digital content. In our example case, the human expert will receive recommendations from the system with the goal of putting together a travelogue, i.e., a description of the trips and journeys undertaken by the couple. We describe the components and architecture and also apply the system to news data.

pdf bib
Semantic Storytelling, Cross-lingual Event Detection and other Semantic Services for a Newsroom Content Curation Dashboard
Julian Moreno-Schneider | Ankit Srivastava | Peter Bourgonje | David Wabnitz | Georg Rehm
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

We present a prototypical content curation dashboard, to be used in the newsroom, and several of its underlying semantic content analysis components (such as named entity recognition, entity linking, summarisation and temporal expression analysis). The idea is to enable journalists (a) to process incoming content (agency reports, twitter feeds, reports, blog posts, social media etc.) and (b) to create new articles more easily and more efficiently. The prototype system also allows the automatic annotation of events in incoming content for the purpose of supporting journalists in identifying important, relevant or meaningful events and also to adapt the content currently in production accordingly in a semi-automatic way. One of our long-term goals is to support journalists building up entire storylines with automatic means. In the present prototype they are generated in a backend service using clustering methods that operate on the extracted events.

pdf bib
Multi-source Neural Automatic Post-Editing: FBK’s participation in the WMT 2017 APE shared task
Rajen Chatterjee | M. Amin Farajian | Matteo Negri | Marco Turchi | Ankit Srivastava | Santanu Pal
Proceedings of the Second Conference on Machine Translation

pdf bib
DFKI-DKT at SemEval-2017 Task 8: Rumour Detection and Classification using Cascading Heuristics
Ankit Srivastava | Georg Rehm | Julian Moreno Schneider
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe our submissions for SemEval-2017 Task 8, Determining Rumour Veracity and Support for Rumours. The Digital Curation Technologies (DKT) team at the German Research Center for Artificial Intelligence (DFKI) participated in two subtasks: Subtask A (determining the stance of a message) and Subtask B (determining veracity of a message, closed variant). In both cases, our implementation consisted of a Multivariate Logistic Regression (Maximum Entropy) classifier coupled with hand-written patterns and rules (heuristics) applied in a post-process cascading fashion. We provide a detailed analysis of the system performance and report on variants of our systems that were not part of the official submission.

2016

pdf bib
DFKI’s system for WMT16 IT-domain task, including analysis of systematic errors
Eleftherios Avramidis | Aljoscha Burchardt | Vivien Macketanz | Ankit Srivastava
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib
A Recurrent Neural Network Architecture for De-identifying Clinical Records
Shweta | Ankit Kumar | Asif Ekbal | Sriparna Saha | Pushpak Bhattacharyya
Proceedings of the 13th International Conference on Natural Language Processing

pdf bib
How to configure statistical machine translation with linked open data resources
Ankit Srivastava | Felix Sasaki | Peter Bourgonje. Julian Moreno-Schneider | Jan Nehring | Georg Rehm
Proceedings of Translating and the Computer 38

2015

pdf bib
An empirical study of segment prioritization for incrementally retrained post-editing-based SMT
Jinhua Du | Ankit Srivastava | Andy Way | Alfredo Maldonado-Guerra | David Lewis
Proceedings of Machine Translation Summit XV: Papers

2011

pdf bib
Preliminary Experiments on Using Users’ Post-Editions to Enhance a SMT System Oracle-based Training for Phrase-based Statistical Machine Translation
Ankit Srivastava | Yanjun Ma | Andy Way
Proceedings of the 15th Annual Conference of the European Association for Machine Translation

2010

pdf bib
MATREX: The DCU MT System for WMT 2010
Sergio Penkale | Rejwanul Haque | Sandipan Dandapat | Pratyush Banerjee | Ankit K. Srivastava | Jinhua Du | Pavel Pecina | Sudip Kumar Naskar | Mikel L. Forcada | Andy Way
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
English-Hindi Transliteration Using Context-Informed PB-SMT: the DCU System for NEWS 2009
Rejwanul Haque | Sandipan Dandapat | Ankit Kumar Srivastava | Sudip Kumar Naskar | Andy Way
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf bib
Using Percolated Dependencies for Phrase Extraction in SMT
Ankit Srivastava | Andy Way
Proceedings of Machine Translation Summit XII: Posters