Proceedings of the Natural Legal Language Processing Workshop 2024

Nikolaos Aletras, Ilias Chalkidis, Leslie Barrett, Cătălina Goanță, Daniel Preoțiuc-Pietro, Gerasimos Spanakis (Editors)


Anthology ID:
2024.nllp-1
Month:
November
Year:
2024
Address:
Miami, FL, USA
Venue:
NLLP
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2024.nllp-1
DOI:
10.18653/v1/2024.nllp-1
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2024.nllp-1.pdf

pdf bib
Proceedings of the Natural Legal Language Processing Workshop 2024
Nikolaos Aletras | Ilias Chalkidis | Leslie Barrett | Cătălina Goanță | Daniel Preoțiuc-Pietro | Gerasimos Spanakis

pdf bib
LeGen: Complex Information Extraction from Legal sentences using Generative Models
Chaitra C R | Sankalp Kulkarni | Sai Rama Akash Varma Sagi | Shashank Pandey | Rohit Yalavarthy | Dipanjan Chakraborty | Prajna Devi Upadhyay

Constructing legal knowledge graphs from unstructured legal texts is a complex challenge due to the intricate nature of legal language. While open information extraction (OIE) techniques can convert text into triples of the form subject, relation, object, they often fall short of capturing the nuanced relationships within lengthy legal sentences, necessitating more sophisticated approaches known as complex information extraction. This paper proposes LeGen – an end-to-end approach leveraging pre-trained large language models (GPT-4o, T5, BART) to perform complex information extraction from legal sentences. LeGen learns and represents the discourse structure of legal sentences, capturing both their complexity and semantics. It minimizes error propagation typical in multi-step pipelines and achieves up to a 32.2% gain on the Indian Legal benchmark. Additionally, it demonstrates competitive performance on open information extraction benchmarks. A promising application of the resulting legal knowledge graphs is in developing question-answering systems for government schemes, tailored to the Next Billion Users who struggle with the complexity of legal language. Our code and data are available at https://github.com/prajnaupadhyay/LegalIE

pdf bib
Summarizing Long Regulatory Documents with a Multi-Step Pipeline
Mika Sie | Ruby Beek | Michiel Bots | Sjaak Brinkkemper | Albert Gatt

Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.

pdf bib
Enhancing Legal Expertise in Large Language Models through Composite Model Integration: The Development and Evaluation of Law-Neo
Zhihao Liu | Yanzhen Zhu | Mengyuan Lu

Although large language models (LLMs) like ChatGPT have demonstrated considerable capabilities in general domains, they often lack proficiency in specialized fields. Enhancing a model’s performance in a specific domain, such as law, while maintaining low costs, has been a significant challenge. Existing methods, such as fine-tuning or building mixture of experts (MoE) models, often struggle to balance model parameters, training costs, and domain-specific performance. Inspired by composition to augment language models, we have developed Law-Neo, a novel model designed to enhance legal LLMs. This model significantly improves the model’s legal domain expertise at minimal training costs, while retaining the logical capabilities of a large-scale anchor model. Our Law-Neo model outperformed other models in comprehensive experiments on multiple legal task benchmarks, demonstrating the effectiveness of this approach.

pdf bib
uOttawa at LegalLens-2024: Transformer-based Classification Experiments
Nima Meghdadi | Diana Inkpen

This paper presents the methods used for LegalLens-2024, which focused on detecting legal violations within unstructured textual data and associating these violations with potentially affected individuals. The shared task included two subtasks: A) Legal Named Entity Recognition (L-NER) and B) Legal Natural Language Inference (L-NLI). For subtask A, we utilized the spaCy library, while for subtask B, we employed a combined model incorporating RoBERTa and CNN. Our results were 86.3% in the L-NER subtask and 88.25% in the L-NLI subtask. Overall, our paper demonstrates the effectiveness of transformer models in addressing complex tasks in the legal domain.

pdf bib
Quebec Automobile Insurance Question-Answering With Retrieval-Augmented Generation
David Beauchemin | Richard Khoury | Zachary Gagnon

Large Language Models (LLMs) perform outstandingly in various downstream tasks, and the use of the Retrieval-Augmented Generation (RAG) architecture has been shown to improve performance for legal question answering (Nuruzzaman and Hussain, 2020; Louis et al., 2024). However, there are limited applications in insurance questions-answering, a specific type of legal document. This paper introduces two corpora: the Quebec Automobile Insurance Expertise Reference Corpus and a set of 82 Expert Answers to Layperson Automobile Insurance Questions. Our study leverages both corpora to automatically and manually assess a GPT4-o, a state-of-the-art (SOTA) LLM, to answer Quebec automobile insurance questions. Our results demonstrate that, on average, using our expertise reference corpus generates better responses on both automatic and manual evaluation metrics. However, they also highlight that LLM QA is unreliable enough for mass utilization in critical areas. Indeed, our results show that between 5% to 13% of answered questions include a false statement that could lead to customer misunderstanding.

pdf bib
Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models
Shubham Kumar Nigam | Aniket Deroy | Subhankar Maity | Arnab Bhattacharya

This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.

pdf bib
The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal
Huiyuan Xie | Felix Steffek | Joana De Faria | Christine Carter | Jonathan Rutherford

This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.

pdf bib
Information Extraction for Planning Court Cases
Drish Mali | Rubash Mali | Claire Barale

Legal documents are often long and unstructured, making them challenging and time-consuming to apprehend. An automatic system that can identify relevant entities and labels within legal documents, would significantly reduce the legal research time. We developed a system to streamline legal case analysis from planning courts by extracting key information from XML files using Named Entity Recognition (NER) and multi-label classification models to convert them into structured form. This research contributes three novel datasets for the Planning Court cases: a NER dataset, a multi-label dataset fully annotated by humans, and newly re-annotated multi-label datasets partially annotated using LLMs. We experimented with various general-purpose and legal domain-specific models with different maximum sequence lengths. It was noted that incorporating paragraph position information improved the performance of models for the multi-label classification task. Our research highlighted the importance of domain-specific models, with LegalRoBERTa and LexLM demonstrating the best performance.

pdf bib
Automated Anonymization of Parole Hearing Transcripts
Abed Itani | Wassiliki Siskou | Annette Hautli-Janisz

Responsible natural language processing is more and more concerned with preventing the violation of personal rights that language technology can entail (CITATION). In this paper we illustrate the case of parole hearings in California, the verbatim transcripts of which are made available to the general public upon a request sent to the California Board of Parole Hearings. The parole hearing setting is highly sensitive: inmates face a board of legal representatives who discuss highly personal matters not only about the inmates themselves but also about victims and their relatives, such as spouses and children. Participants have no choice in contributing to the data collection process, since the disclosure of the transcripts is mandated by law. As researchers who are interested in understanding and modeling the communication in these hierarchy-driven settings, we face an ethical dilemma: publishing raw data as is for the community would compromise the privacy of all individuals affected, but manually cleaning the data requires a substantive effort. In this paper we present an automated anonymization process which reliably removes and pseudonymizes sensitive data in verbatim transcripts, while at the same time preserving the structure and content of the data. Our results show that the process exhibits little to no leakage of sensitive information when applied to more than 300 hearing transcripts.

pdf bib
Towards an Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries
Shao Min Tan | Quentin Grail | Lee Quartey

Long-form abstractive summarization is a task that has particular importance in the legal domain. Automated evaluation metrics are important for the development of text generation models, but existing research on the evaluation of generated summaries has focused mainly on short summaries. We introduce an automated evaluation methodology for generated long-form legal summaries, which involves breaking each summary into individual points, comparing the points in a human-written and machine-generated summary, and calculating a recall and precision score for the latter. The method is designed to be particularly suited for the complexities of legal text, and is also fully interpretable. We also create and release a small meta-dataset for the benchmarking of evaluation methods, focusing on long-form legal summarization. Our evaluation metric corresponds better with human evaluation compared to existing metrics which were not developed for legal data.

pdf bib
Enhancing Contract Negotiations with LLM-Based Legal Document Comparison
Savinay Narendra | Kaushal Shetty | Adwait Ratnaparkhi

We present a large language model (LLM) based approach for comparing legal contracts with their corresponding template documents. Legal professionals use commonly observed deviations between templates and contracts to help with contract negotiations, and also to refine the template documents. Our comparison approach, based on the well-studied natural language inference (NLI) task, first splits a template into key concepts and then uses LLMs to decide if the concepts are entailed by the contract document. We also repeat this procedure in the opposite direction - contract clauses are tested for entailment against the template clause to see if they contain additional information. The non-entailed concepts are labelled, organized and filtered by frequency, and placed into a clause library, which is used to suggest changes to the template documents. We first show that our LLM-based approach outperforms all previous work on a publicly available dataset designed for NLI in the legal domain. We then apply it to a private real-world legal dataset, achieve an accuracy of 96.46%. Our approach is the first in the literature to produce a natural language comparison between legal contracts and their template documents.

pdf bib
Attributed Question Answering for Preconditions in the Dutch Law
Felicia Redelaar | Romy Van Drie | Suzan Verberne | Maaike De Boer

In this paper, we address the problem of answering questions about preconditions in the law, e.g. “When can the court terminate the guardianship of a natural person?”. When answering legal questions, it is important to attribute the relevant part of the law; we therefore not only generate answers but also references to law articles. We implement a retrieval augmented generation (RAG) pipeline for long-form answers based on the Dutch law, using several state-of-the-art retrievers and generators. For evaluating our pipeline, we create a dataset containing legal QA pairs with attributions. Our experiments show promising results on our extended version for the automatic evaluation metrics from the Automatic LLMs’ Citation Evaluation (ALCE) Framework and the G-EVAL Framework. Our findings indicate that RAG has significant potential in complex, citation-heavy domains like law, as it helps laymen understand legal preconditions and rights by generating high-quality answers with accurate attributions.

pdf bib
Algorithm for Automatic Legislative Text Consolidation
Matias Etcheverry | Thibaud Real-del-Sarte | Pauline Chavallard

This study introduces a method for automating the consolidation process in a legal context, a time-consuming task traditionally performed by legal professionals. We present a generative approach that processes legislative texts to automatically apply amendments. Our method employs light quantized generative model, finetuned with LoRA, to generate accurate and reliable amended texts. To the authors knowledge, this is the first time generative models are used on legislative text consolidation. Our dataset is publicly available on HuggingFace. Experimental results demonstrate a significant improvement in efficiency, offering faster updates to legal documents. A full automated pipeline of legislative text consolidation can be done in a few hours, with a success rate of more than 63% on a difficult bill.

pdf bib
Measuring the Groundedness of Legal Question-Answering Systems
Dietrich Trautmann | Natalia Ostapuk | Quentin Grail | Adrian Pol | Guglielmo Bonifazi | Shang Gao | Martin Gajek

In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.

pdf bib
Transductive Legal Judgment Prediction Combining BERT Embeddings with Delaunay-Based GNNs
Hugo Attali | Nadi Tomeh

This paper presents a novel approach to legal judgment prediction by combining BERT embeddings with a Delaunay-based Graph Neural Network (GNN). Unlike inductive methods that classify legal documents independently, our transductive approach models the entire document set as a graph, capturing both contextual and relational information. This method significantly improves classification accuracy by enabling effective label propagation across connected documents. Evaluated on the Swiss-Judgment-Prediction (SJP) dataset, our model outperforms established baselines, including larger models with cross-lingual training and data augmentation techniques, while maintaining efficiency with minimal computational overhead.

pdf bib
Cross Examine: An Ensemble-based approach to leverage Large Language Models for Legal Text Analytics
Saurav Chowdhury | Lipika Dey | Suyog Joshi

Legal documents are complex in nature, describing a course of argumentative reasoning that is followed to settle a case. Churning through large volumes of legal documents is a daily requirement for a large number of professionals who need access to the information embedded in them. Natural language processing methods that help in document summarization with key information components, insight extraction and question answering play a crucial role in legal text processing. Most of the existing document analysis systems use supervised machine learning, which require large volumes of annotated training data for every different application and are expensive to build. In this paper we propose a legal text analytics pipeline using Large Language Models (LLM), which can work with little or no training data. For document summarization, we propose an iterative pipeline using retrieval augmented generation to ensure that the generated text remains contextually relevant. For question answering, we propose a novel ontology-driven ensemble approach similar to cross-examination that exploits questioning and verification principles. A knowledge graph, created with the extracted information, stores the key entities and relationships reflecting the repository content structure. A new dataset is created with Indian court documents related to bail applications for cases filed under Protection of Children from Sexual Offences (POCSO) Act, 2012 an Indian law to protect children from sexual abuse and offences. Analysis of insights extracted from the answers reveal patterns of crime and social conditions leading to those crimes, which are important inputs for social scientists as well as legal system.

pdf bib
LLMs to the Rescue: Explaining DSA Statements of Reason with Platform’s Terms of Services
Marco Aspromonte | Andrea Ferraris | Federico Galli | Giuseppe Contissa

The Digital Services Act (DSA) requires online platforms in the EU to provide “statements of reason” (SoRs) when restricting user content, but their effectiveness in ensuring transparency is still debated due to vague and complex terms of service (ToS). This paper explores the use of NLP techniques, specifically multi-agent systems based on large language models (LLMs), to clarify SoRs by linking them to relevant ToS sections. Analysing SoRs from platforms like Booking.com, Reddit, and LinkedIn, our findings show that LLMs can enhance the interpretability of content moderation decisions, improving user understanding and engagement with DSA requirements.

pdf bib
BLT: Can Large Language Models Handle Basic Legal Text?
Andrew Blair-Stanek | Nils Holzenberger | Benjamin Van Durme

We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs’ poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs’ reliability as-is for basic legal tasks.

pdf bib
Multi-Property Multi-Label Documents Metadata Recommendation based on Encoder Embeddings
Nasredine Cheniki | Vidas Daudaravicius | Abdelfettah Feliachi | Didier Hardy | Marc Wilhelm Küster

The task of document classification, particularly multi-label classification, presents a significant challenge due to the complexity of assigning multiple relevant labels to each document. This complexity is further amplified in multi-property multi-label classification tasks, where documents must be categorized across various sets of labels. In this research, we introduce an innovative encoder embedding-driven approach to multi-property multi-label document classification that leverages semantic-text similarity and the reuse of pre-existing annotated data to enhance the efficiency and accuracy of the document annotation process. Our method requires only a single model for text similarity, eliminating the need for multiple property-specific classifiers and thereby reducing computational demands and simplifying deployment. We evaluate our approach through a prototype deployed for daily operations, which demonstrates superior performance over existing classification systems. Our contributions include improved accuracy without additional training, increased efficiency, and demonstrated effectiveness in practical applications. The results of our study indicate the potential of our approach to be applied across various domains requiring multi-property multi-label document classification, offering a scalable and adaptable solution for metadata annotation tasks.

pdf bib
Comparative Study of Explainability Methods for Legal Outcome Prediction
Ieva Staliunaite | Josef Valvoda | Ken Satoh

This paper investigates explainability in Natural Legal Language Processing (NLLP). We study the task of legal outcome prediction of the European Court of Human Rights cases in a ternary classification setup, where a language model is fine-tuned to predict whether an article has been claimed and violated (positive outcome), claimed but not violated (negative outcome) or not claimed at all (null outcome). Specifically, we experiment with three popular NLP explainability methods. Correlating the attribution scores of input-level methods (Integrated Gradients and Contrastive Explanations) with rationales from court rulings, we show that the correlations are very weak, with absolute values of Spearman and Kendall correlation coefficients ranging between 0.003 and 0.094. Furthermore, we use a concept-level interpretability method (Concept Erasure) with human expert annotations of legal reasoning, to show that obscuring legal concepts from the model representation has an insignificant effect on model performance (at most a decline of 0.26 F1). Therefore, our results indicate that automated legal outcome prediction models are not reliably grounded in legal reasoning.

pdf bib
Bonafide at LegalLens 2024 Shared Task: Using Lightweight DeBERTa Based Encoder For Legal Violation Detection and Resolution
Shikha Bordia

In this work, we present two systems—Named Entity Resolution (NER) and Natural Language Inference (NLI)—for detecting legal violations within unstructured textual data and for associating these violations with potentially affected individuals, respectively. Both these systems are lightweight DeBERTa based encoders that outperform the LLM baselines. The proposed NER system achieved an F1 score of 60.01% on Subtask A of the LegalLens challenge, which focuses on identifying violations. The proposed NLI system achieved an F1 score of 84.73% on Subtask B of the LegalLens challenge, which focuses on resolving these violations by matching them with pre-existing legal complaints of class action cases. Our NER system ranked sixth and NLI system ranked fifth on the LegalLens leaderboard. We release the trained models and inference scripts.

pdf bib
LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of the European Court of Human Rights
Odysseas Chlapanis | Dimitris Galanis | Ion Androutsopoulos

We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.

pdf bib
Gaps or Hallucinations? Scrutinizing Machine-Generated Legal Analysis for Fine-grained Text Evaluations
Abe Hou | William Jurayj | Nils Holzenberger | Andrew Blair-Stanek | Benjamin Van Durme

Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps – as opposed to hallucinations in a strict erroneous sense – to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.

pdf bib
Classify First, and Then Extract: Prompt Chaining Technique for Information Extraction
Alice Kwak | Clayton Morrison | Derek Bambauer | Mihai Surdeanu

This work presents a new task-aware prompt design and example retrieval approach for information extraction (IE) using a prompt chaining technique. Our approach divides IE tasks into two steps: (1) text classification to understand what information (e.g., entity or event types) is contained in the underlying text and (2) information extraction for the identified types. Initially, we use a large language model (LLM) in a few-shot setting to classify the contained information. The classification output is used to select the relevant prompt and retrieve the examples relevant to the input text. Finally, we ask a LLM to do the information extraction with the generated prompt. By evaluating our approach on legal IE tasks with two different LLMs, we demonstrate that the prompt chaining technique improves the LLM’s overall performance in a few-shot setting when compared to the baseline in which examples from all possible classes are included in the prompt. Our approach can be used in a low-resource setting as it does not require a large amount of training data. Also, it can be easily adapted to many different IE tasks by simply adjusting the prompts. Lastly, it provides a cost benefit by reducing the number of tokens in the prompt.

pdf bib
Augmenting Legal Decision Support Systems with LLM-based NLI for Analyzing Social Media Evidence
Ram Mohan Rao Kadiyala | Siddartha Pullakhandam | Kanwal Mehreen | Subhasya Tippareddy | Ashay Srivastava

This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI). The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.

pdf bib
Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights
Maksym Taranukhin | Sahithya Ravi | Gabor Lukacs | Evangelos Milios | Vered Shwartz

The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot’s usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems.

pdf bib
Enhancing Legal Violation Identification with LLMs and Deep Learning Techniques: Achievements in the LegalLens 2024 Competition
Nguyen Tan Minh | Duy Ngoc Mai | Le Xuan Bach | Nguyen Huu Dung | Pham Cong Minh | Ha Thanh Nguyen | Thi Hai Yen Vuong

LegalLens is a competition organized to encourage advancements in automatically detecting legal violations. This paper presents our solutions for two tasks Legal Named Entity Recognition (L-NER) and Legal Natural Language Inference (L-NLI). Our approach involves fine-tuning BERT-based models, designing methods based on data characteristics, and a novel prompting template for data augmentation using LLMs. As a result, we secured first place in L-NER and third place in L-NLI among thirty-six participants. We also perform error analysis to provide valuable insights and pave the way for future enhancements in legal NLP. Our implementation is available at https://github.com/lxbach10012004/legal-lens/tree/main

pdf bib
LegalLens 2024 Shared Task: Masala-chai Submission
Khalid Rajan | Royal Sequiera

In this paper, we present the masala-chai team’s participation in the LegalLens 2024 shared task and detail our approach to predicting legal entities and performing natural language inference (NLI) in the legal domain. We experimented with various transformer-based models, including BERT, RoBERTa, Llama 3.1, and GPT-4o. Our results show that state-of-the-art models like GPT-4o underperformed in NER and NLI tasks, even when using advanced techniques such as bootstrapping and prompt optimization. The best performance in NER (accuracy: 0.806, F1 macro: 0.701) was achieved with a fine-tuned RoBERTa model, while the highest NLI results (accuracy: 0.825, F1 macro: 0.833) came from a fine-tuned Llama 3.1 8B model. Notably, RoBERTa, despite having significantly fewer parameters than Llama 3.1 8B, delivered comparable results. We discuss key findings and insights from our experiments and provide our results and code for reproducibility and further analysis at https://github.com/rosequ/masala-chai

pdf bib
Semantists at LegalLens-2024: Data-efficient Training of LLM’s for Legal Violation Identification
Kanagasabai Rajaraman | Hariram Veeramani

In this paper, we describe our system for LegalLens-2024 Shared Task on automatically identifying legal violations from unstructured text sources. We participate in Subtask B, called Legal Natural Language Inference (L-NLI), that aims to predict the relationship between a given premise summarizing a class action complaint and a hypothesis from an online media text, indicating any association between the review and the complaint. This task is challenging as it provides only limited labelled data. In our work, we adopt LLM based methods and explore various data-efficient learning approaches for maximizing performance. In the end, our best model employed an ensemble of LLM’s fine-tuned on the task-specific data, and achieved a Macro F1 score of 78.5% on test data, and ranked 2nd among all teams submissions.

pdf bib
LegalLens Shared Task 2024: Legal Violation Identification in Unstructured Text
Ben Hagag | Gil Gil Semo | Dor Bernsohn | Liav Harpaz | Pashootan Vaezipoor | Rohit Saha | Kyryl Truskovskyi | Gerasimos Spanakis

This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.

pdf bib
DeBERTa Beats Behemoths: A Comparative Analysis of Fine-Tuning, Prompting, and PEFT Approaches on LegalLensNER
Hanh Thi Hong Tran | Nishan Chatterjee | Senja Pollak | Antoine Doucet

This paper summarizes the participation of our team (Flawless Lawgic) in the legal named entity recognition (L-NER) task at LegalLens 2024: Detecting Legal Violations. Given possible unstructured texts (e.g., online media texts), we aim to identify legal violations by extracting legal entities such as “violation”, “violation by”, “violation on”, and “law”. This system-description paper discusses our approaches to address the task, empirically highlighting the performances of fine-tuning models from the Transformers family (e.g., RoBERTa and DeBERTa) against open-sourced LLMs (e.g., Llama, Mistral) with different tuning settings (e.g., LoRA, Supervised Fine-Tuning (SFT) and prompting strategies). Our best results, with a weighted F1 of 0.705 on the test set, show a 30 percentage points increase in F1 compared to the baseline and rank 2 on the leaderboard, leaving a marginal gap of only 0.4 percentage points lower than the top solution. Our solutions are available at github.com/honghanhh/lner.

pdf bib
LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks in English
Santosh T.y.s.s | Cornelius Weiss | Matthias Grabmair

In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at https://github.com/TUMLegalTech/LexSumm-LexT5.

pdf bib
Towards Supporting Legal Argumentation with NLP: Is More Data Really All You Need?
Santosh T.y.s.s | Kevin Ashley | Katie Atkinson | Matthias Grabmair

Modeling legal reasoning and argumentation justifying decisions in cases has always been central to AI & Law, yet contemporary developments in legal NLP have increasingly focused on statistically classifying legal conclusions from text. While conceptually “simpler’, these approaches often fall short in providing usable justifications connecting to appropriate legal concepts. This paper reviews both traditional symbolic works in AI & Law and recent advances in legal NLP, and distills possibilities of integrating expert-informed knowledge to strike a balance between scalability and explanation in symbolic vs. data-driven approaches. We identify open challenges and discuss the potential of modern NLP models and methods that integrate conceptual legal knowledge.