Shomir Wilson


2024

pdf bib
Documenting the Unwritten Curriculum of Student Research
Shomir Wilson
Proceedings of the Sixth Workshop on Teaching NLP

Graduate and undergraduate student researchers in natural language processing (NLP) often need mentoring to learn the norms of research. While methodological and technical knowledge are essential, there is also a “hidden curriculum” of experiential knowledge about topics like work strategies, common obstacles, collaboration, conferences, and scholarly writing. As a professor, I have written a set of guides that cover typically unwritten customs and procedures for academic research. I share them with advisees to help them understand research norms and to help us focus on their specific questions and interests. This paper describes these guides, which are freely accessible on the web (https://shomir.net/advice), and I provide recommendations to faculty who are interested in creating similar materials for their advisees.

pdf bib
Automated Detection and Analysis of Data Practices Using A Real-World Corpus
Mukund Srinath | Pranav Narayanan Venkit | Maria Badillo | Florian Schaub | C. Giles | Shomir Wilson
Findings of the Association for Computational Linguistics ACL 2024

Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.

pdf bib
Creation and Analysis of an International Corpus of Privacy Laws
Sonu Gupta | Geetika Gopi | Harish Balaji | Ellen Poplavska | Nora O’Toole | Siddhant Arora | Thomas Norton | Norman Sadeh | Shomir Wilson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The landscape of privacy laws and regulations around the world is complex and ever-changing. National and super-national laws, agreements, decrees, and other government-issued rules form a patchwork that companies must follow to operate internationally. To examine the status and evolution of this patchwork, we introduce the Privacy Law Corpus, of 1,043 privacy laws, regulations, and guidelines, covering 183 jurisdictions. This corpus enables a large-scale quantitative and qualitative examination of legal focus on privacy. We examine the temporal distribution of when privacy laws were created and illustrate the dramatic increase in privacy legislation over the past 50 years, although a finer-grained examination reveals that the rate of increase varies depending on the personal data types that privacy laws address. Our exploration also demonstrates that most privacy laws respectively address relatively few personal data types. Additionally, topic modeling results show the prevalence of common themes in privacy laws, such as finance, healthcare, and telecommunications. Finally, we release the corpus to the research community to promote further study.

pdf bib
Sociodemographic Bias in Language Models: A Survey and Forward Path
Vipul Gupta | Pranav Narayanan Venkit | Shomir Wilson | Rebecca Passonneau
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Sociodemographic bias in language models (LMs) has the potential for harm when deployed in real-world settings. This paper presents a comprehensive survey of the past decade of research on sociodemographic bias in LMs, organized into a typology that facilitates examining the different aims: types of bias, quantifying bias, and debiasing techniques. We track the evolution of the latter two questions, then identify current trends and their limitations, as well as emerging techniques. To guide future research towards more effective and reliable solutions, and to help authors situate their work within this broad landscape, we conclude with a checklist of open questions.

2023

pdf bib
Nationality Bias in Text Generation
Pranav Narayanan Venkit | Sanjana Gautam | Ruchi Panchanadikar | Ting-Hao Huang | Shomir Wilson
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country’s economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.

pdf bib
Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models
Pranav Narayanan Venkit | Mukund Srinath | Shomir Wilson
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

We analyze sentiment analysis and toxicity detection models to detect the presence of explicit bias against people with disability (PWD). We employ the bias identification framework of Perturbation Sensitivity Analysis to examine conversations related to PWD on social media platforms, specifically Twitter and Reddit, in order to gain insight into how disability bias is disseminated in real-world social settings. We then create the Bias Identification Test in Sentiment (BITS) corpus to quantify explicit disability bias in any sentiment analysis and toxicity detection models. Our study utilizes BITS to uncover significant biases in four open AIaaS (AI as a Service) sentiment analysis tools, namely TextBlob, VADER, Google Cloud Natural Language API, DistilBERT and two toxicity detection models, namely two versions of Toxic-BERT. Our findings indicate that all of these models exhibit statistically significant explicit bias against PWD.

pdf bib
The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis
Pranav Venkit | Mukund Srinath | Sanjana Gautam | Saranya Venkatraman | Vipul Gupta | Rebecca Passonneau | Shomir Wilson
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological literature on sentiment, we unveil distinct conceptualizations of this term in domains such as finance, government, and medicine. Our study exposes a lack of explicit definitions and frameworks for characterizing sentiment, resulting in potential challenges and biases. To tackle this issue, we propose an ethics sheet encompassing critical inquiries to guide practitioners in ensuring equitable utilization of SA. Our findings underscore the significance of adopting an interdisciplinary approach to defining sentiment in SA and offer a pragmatic solution for its implementation.

2022

pdf bib
STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents
Nan Zhang | Shomir Wilson | Prasenjit Mitra
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

pdf bib
A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus
Siddhant Arora | Henry Hosseini | Christine Utz | Vinayshekhar Bannihatti Kumar | Tristan Dhellemmes | Abhilasha Ravichander | Peter Story | Jasmine Mangat | Rex Chen | Martin Degeling | Thomas Norton | Thomas Hupperich | Shomir Wilson | Norman Sadeh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decade, researchers have started to explore the use of NLP to develop tools aimed at helping the public, vendors, and regulators analyze disclosures made in privacy policies. With the introduction of new privacy regulations, the language of privacy policies is also evolving, and disclosures made by the same organization are not always the same in different languages, especially when used to communicate with users who fall under different jurisdictions. This work explores the use of language technologies to capture and analyze these differences at scale. We introduce an annotation scheme designed to capture the nuances of two new landmark privacy regulations, namely the EU’s GDPR and California’s CCPA/CPRA. We then introduce the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English (292K words) and 91 privacy policies in German (478K words), respectively with manual annotations for 8K and 19K fine-grained data practices. The annotations are used to develop computational methods that can automatically extract “disclosures” from privacy policies. Analysis of a subset of 59 “semi-parallel” policies reveals differences that can be attributed to different regulatory regimes, suggesting that systematic analysis of policies using automated language technologies is indeed a worthwhile endeavor.

pdf bib
A Study of Implicit Bias in Pretrained Language Models against People with Disabilities
Pranav Narayanan Venkit | Mukund Srinath | Shomir Wilson
Proceedings of the 29th International Conference on Computational Linguistics

Pretrained language models (PLMs) have been shown to exhibit sociodemographic biases, such as against gender and race, raising concerns of downstream biases in language technologies. However, PLMs’ biases against people with disabilities (PWDs) have received little attention, in spite of their potential to cause similar harms. Using perturbation sensitivity analysis, we test an assortment of popular word embedding-based and transformer-based PLMs and show significant biases against PWDs in all of them. The results demonstrate how models trained on large corpora widely favor ableist language.

2021

pdf bib
Breaking Down Walls of Text: How Can NLP Benefit Consumer Privacy?
Abhilasha Ravichander | Alan W Black | Thomas Norton | Shomir Wilson | Norman Sadeh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Privacy plays a crucial role in preserving democratic ideals and personal autonomy. The dominant legal approach to privacy in many jurisdictions is the “Notice and Choice” paradigm, where privacy policies are the primary instrument used to convey information to users. However, privacy policies are long and complex documents that are difficult for users to read and comprehend. We discuss how language technologies can play an important role in addressing this information gap, reporting on initial progress towards helping three specific categories of stakeholders take advantage of digital privacy policies: consumers, enterprises, and regulators. Our goal is to provide a roadmap for the development and use of language technologies to empower users to reclaim control over their privacy, limit privacy harms, and rally research efforts from the community towards addressing an issue with large social impact. We highlight many remaining opportunities to develop language technologies that are more precise or nuanced in the way in which they use the text of privacy policies.

pdf bib
Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies
Mukund Srinath | Shomir Wilson | C Lee Giles
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.

2019

pdf bib
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives
Abhilasha Ravichander | Alan W Black | Shomir Wilson | Thomas Norton | Norman Sadeh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Privacy policies are long and complex documents that are difficult for users to read and understand. Yet, they have legal effects on how user data can be collected, managed and used. Ideally, we would like to empower users to inform themselves about the issues that matter to them, and enable them to selectively explore these issues. We present PrivacyQA, a corpus consisting of 1750 questions about the privacy policies of mobile applications, and over 3500 expert annotations of relevant answers. We observe that a strong neural baseline underperforms human performance by almost 0.3 F1 on PrivacyQA, suggesting considerable room for improvement for future systems. Further, we use this dataset to categorically identify challenges to question answerability, with domain-general implications for any question answering system. The PrivacyQA corpus offers a challenging corpus for question answering, with genuine real world utility.

2018

pdf bib
Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents
Abhijith Athreya Mysore Gopinath | Shomir Wilson | Norman Sadeh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The text in many web documents is organized into a hierarchy of section titles and corresponding prose content, a structure which provides potentially exploitable information on discourse structure and topicality. However, this organization is generally discarded during text collection, and collecting it is not straightforward: the same visual organization can be implemented in a myriad of different ways in the underlying HTML. To remedy this, we present a flexible system for automatically extracting the hierarchical section titles and prose organization of web documents irrespective of differences in HTML representation. This system uses features from syntax, semantics, discourse and markup to build two models which classify HTML text into section titles and prose text. When tested on three different domains of web text, our domain-independent system achieves an overall precision of 0.82 and a recall of 0.98. The domain-dependent variation produces very high precision (0.99) at the expense of recall (0.75). These results exhibit a robust level of accuracy suitable for enhancing question answering, information extraction, and summarization.

2017

pdf bib
Identifying the Provision of Choices in Privacy Policy Text
Kanthashree Mysore Sathyendra | Shomir Wilson | Florian Schaub | Sebastian Zimmeck | Norman Sadeh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Websites’ and mobile apps’ privacy policies, written in natural language, tend to be long and difficult to understand. Information privacy revolves around the fundamental principle of Notice and choice, namely the idea that users should be able to make informed decisions about what information about them can be collected and how it can be used. Internet users want control over their privacy, but their choices are often hidden in long and convoluted privacy policy texts. Moreover, little (if any) prior work has been done to detect the provision of choices in text. We address this challenge of enabling user choice by automatically identifying and extracting pertinent choice language in privacy policies. In particular, we present a two-stage architecture of classification models to identify opt-out choices in privacy policy text, labelling common varieties of choices with a mean F1 score of 0.735. Our techniques enable the creation of systems to help Internet users to learn about their choices, thereby effectuating notice and choice and improving Internet privacy.

2016

pdf bib
The Creation and Analysis of a Website Privacy Policy Corpus
Shomir Wilson | Florian Schaub | Aswarth Abhilash Dara | Frederick Liu | Sushain Cherivirala | Pedro Giovanni Leon | Mads Schaarup Andersen | Sebastian Zimmeck | Kanthashree Mysore Sathyendra | N. Cameron Russell | Thomas B. Norton | Eduard Hovy | Joel Reidenberg | Norman Sadeh
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
This Table is Different: A WordNet-Based Approach to Identifying References to Document Entities
Shomir Wilson | Alan Black | Jon Oberlander
Proceedings of the 8th Global WordNet Conference (GWC)

Writing intended to inform frequently contains references to document entities (DEs), a mixed class that includes orthographically structured items (e.g., illustrations, sections, lists) and discourse entities (arguments, suggestions, points). Such references are vital to the interpretation of documents, but they often eschew identifiers such as “Figure 1” for inexplicit phrases like “in this figure” or “from these premises”. We examine inexplicit references to DEs, termed DE references, and recast the problem of their automatic detection into the determination of relevant word senses. We then show the feasibility of machine learning for the detection of DE-relevant word senses, using a corpus of human-labeled synsets from WordNet. We test cross-domain performance by gathering lemmas and synsets from three corpora: website privacy policies, Wikipedia articles, and Wikibooks textbooks. Identifying DE references will enable language technologies to use the information encoded by them, permitting the automatic generation of finely-tuned descriptions of DEs and the presentation of richly-structured information to readers.

2014

pdf bib
Determiner-Established Deixis to Communicative Artifacts in Pedagogical Text
Shomir Wilson | Jon Oberlander
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Toward Automatic Processing of English Metalanguage
Shomir Wilson
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
The Creation of a Corpus of English Metalanguage
Shomir Wilson
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2010

pdf bib
Distinguishing Use and Mention in Natural Language
Shomir Wilson
Proceedings of the NAACL HLT 2010 Student Research Workshop