2024
pdf
bib
abs
A Privacy-Preserving Corpus for Occupational Health in Spanish: Evaluation for NER and Classification Tasks
Claudio Aracena
|
Luis Miranda
|
Thomas Vakili
|
Fabián Villena
|
Tamara Quiroga
|
Fredy Núñez-Torres
|
Victor Rocco
|
Jocelyn Dunstan
Proceedings of the 6th Clinical Natural Language Processing Workshop
Annotated corpora are essential to reliable natural language processing. While they are expensive to create, they are essential for building and evaluating systems. This study introduces a new corpus of 2,869 medical and admission reports collected by an occupational insurance and health provider. The corpus has been carefully annotated for personally identifiable information (PII) and is shared, masking this information. Two annotators adhered to annotation guidelines during the annotation process, and a referee later resolved annotation conflicts in a consolidation process to build a gold standard subcorpus. The inter-annotator agreement values, measured in F1, range between 0.86 and 0.93 depending on the selected subcorpus. The value of the corpus is demonstrated by evaluating its use for NER of PII and a classification task. The evaluations find that fine-tuned models and GPT-3.5 reach F1 of 0.911 and 0.720 in NER of PII, respectively. In the case of the insurance coverage classification task, using the original or de-identified corpus results in similar performance. The annotated data are released in de-identified form.
pdf
bib
abs
When Is a Name Sensitive? Eponyms in Clinical Text and Implications for De-Identification
Thomas Vakili
|
Tyr Hullmann
|
Aron Henriksson
|
Hercules Dalianis
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)
Clinical data, in the form of electronic health records, are rich resources that can be tapped using natural language processing. At the same time, they contain very sensitive information that must be protected. One strategy is to remove or obscure data using automatic de-identification. However, the detection of sensitive data can yield false positives. This is especially true for tokens that are similar in form to sensitive entities, such as eponyms. These names tend to refer to medical procedures or diagnoses rather than specific persons. Previous research has shown that automatic de-identification systems often misclassify eponyms as names, leading to a loss of valuable medical information. In this study, we estimate the prevalence of eponyms in a real Swedish clinical corpus. Furthermore, we demonstrate that modern transformer-based de-identification systems are more accurate in distinguishing between names and eponyms than previous approaches.
2023
pdf
bib
abs
Using Membership Inference Attacks to Evaluate Privacy-Preserving Language Modeling Fails for Pseudonymizing Data
Thomas Vakili
|
Hercules Dalianis
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Large pre-trained language models dominate the current state-of-the-art for many natural language processing applications, including the field of clinical NLP. Several studies have found that these can be susceptible to privacy attacks that are unacceptable in the clinical domain where personally identifiable information (PII) must not be exposed. However, there is no consensus regarding how to quantify the privacy risks of different models. One prominent suggestion is to quantify these risks using membership inference attacks. In this study, we show that a state-of-the-art membership inference attack on a clinical BERT model fails to detect the privacy benefits from pseudonymizing data. This suggests that such attacks may be inadequate for evaluating token-level privacy preservation of PIIs.
2022
pdf
bib
abs
Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data
Thomas Vakili
|
Anastasios Lamproudis
|
Aron Henriksson
|
Hercules Dalianis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Automatic de-identification is a cost-effective and straightforward way of removing large amounts of personally identifiable information from large and sensitive corpora. However, these systems also introduce errors into datasets due to their imperfect precision. These corruptions of the data may negatively impact the utility of the de-identified dataset. This paper de-identifies a very large clinical corpus in Swedish either by removing entire sentences containing sensitive data or by replacing sensitive words with realistic surrogates. These two datasets are used to perform domain adaptation of a general Swedish BERT model. The impact of the de-identification techniques is assessed by training and evaluating the models using six clinical downstream tasks. The results are then compared to a similar BERT model domain-adapted using an unaltered version of the clinical corpus. The results show that using an automatically de-identified corpus for domain adaptation does not negatively impact downstream performance. We argue that automatic de-identification is an efficient way of reducing the privacy risks of domain-adapted models and that the models created in this paper should be safe to distribute to other academic researchers.
pdf
bib
abs
Evaluating Pre-Trained Language Models for Focused Terminology Extraction from Swedish Medical Records
Oskar Jerdhaf
|
Marina Santini
|
Peter Lundberg
|
Tomas Bjerner
|
Yosef Al-Abasse
|
Arne Jonsson
|
Thomas Vakili
Proceedings of the Workshop on Terminology in the 21st century: many faces, many places
In the experiments briefly presented in this abstract, we compare the performance of a generalist Swedish pre-trained language model with a domain-specific Swedish pre-trained model on the downstream task of focussed terminology extraction of implant terms, which are terms that indicate the presence of implants in the body of patients. The fine-tuning is identical for both models. For the search strategy we rely on KD-Tree that we feed with two different lists of term seeds, one with noise and one without noise. Results shows that the use of a domain-specific pre-trained language model has a positive impact on focussed terminology extraction only when using term seeds without noise.
pdf
bib
abs
Utility Preservation of Clinical Text After De-Identification
Thomas Vakili
|
Hercules Dalianis
Proceedings of the 21st Workshop on Biomedical Language Processing
Electronic health records contain valuable information about symptoms, diagnosis, treatment and outcomes of the treatments of individual patients. However, the records may also contain information that can reveal the identity of the patients. Removing these identifiers - the Protected Health Information (PHI) - can protect the identity of the patient. Automatic de-identification is a process which employs machine learning techniques to detect and remove PHI. However, automatic techniques are imperfect in their precision and introduce noise into the data. This study examines the impact of this noise on the utility of Swedish de-identified clinical data by using human evaluators and by training and testing BERT models. Our results indicate that de-identification does not harm the utility for clinical NLP and that human evaluators are less sensitive to noise from de-identification than expected.
pdf
bib
abs
Cross-Clinic De-Identification of Swedish Electronic Health Records: Nuances and Caveats
Olle Bridal
|
Thomas Vakili
|
Marina Santini
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
Privacy preservation of sensitive information is one of the main concerns in clinical text mining. Due to the inherent privacy risks of handling clinical data, the clinical corpora used to create the clinical Named Entity Recognition (NER) models underlying clinical de-identification systems cannot be shared. This situation implies that clinical NER models are trained and tested on data originating from the same institution since it is rarely possible to evaluate them on data belonging to a different organization. These restrictions on sharing make it very difficult to assess whether a clinical NER model has overfitted the data or if it has learned any undetected biases. This paper presents the results of the first-ever cross-institution evaluation of a Swedish de-identification system on Swedish clinical data. Alongside the encouraging results, we discuss differences and similarities across EHR naming conventions and NER tagsets.