The Effects of De-identified Tokens on the Performance of Clinical Large Language Models

Session Number

CMPS 03

Advisor(s)

Paul Landes, Argonne National Laboratory

Discipline

Computer Science

Start Date

17-4-2024 11:05 AM

End Date

17-4-2024 11:20 AM

Abstract

Clinical Large-Language Models (LLMs) are essential to the biomedical industry: they can analyze and interpret physician notes, anonymize patient health information, and supplement diagnoses in various medical practices. Nowadays, most widely used clinical large-language models are trained on medical text that masks protected health information---corpora unlike the data that the models are used on in the field. Despite this, little to no research has been conducted to determine whether the masked corpora affect the performance of clinical large language models.

In our study, we trained models on commonly used anonymized corpora (MIMIC-III and MIMIC-IV) with the masked tokens replaced with pseudo-tokens (i.e., artificially generated names, dates, and locations) and compared their performance to the same models trained on unaltered corpora. The model's performance was based on their evaluation scores (f1, precision, and recall) on de- identification, inference, question-answering, and summarization tasks. Our data shows that the large-language models trained on our pseudo-corpora significantly outperform the ones trained on the original corpus in de-identification and inference-based tasks. The results of the study sustain that pseudo-generated data could provide a new framework for training corpora and creating stronger clinical large-language models.

Share

COinS
 
Apr 17th, 11:05 AM Apr 17th, 11:20 AM

The Effects of De-identified Tokens on the Performance of Clinical Large Language Models

Clinical Large-Language Models (LLMs) are essential to the biomedical industry: they can analyze and interpret physician notes, anonymize patient health information, and supplement diagnoses in various medical practices. Nowadays, most widely used clinical large-language models are trained on medical text that masks protected health information---corpora unlike the data that the models are used on in the field. Despite this, little to no research has been conducted to determine whether the masked corpora affect the performance of clinical large language models.

In our study, we trained models on commonly used anonymized corpora (MIMIC-III and MIMIC-IV) with the masked tokens replaced with pseudo-tokens (i.e., artificially generated names, dates, and locations) and compared their performance to the same models trained on unaltered corpora. The model's performance was based on their evaluation scores (f1, precision, and recall) on de- identification, inference, question-answering, and summarization tasks. Our data shows that the large-language models trained on our pseudo-corpora significantly outperform the ones trained on the original corpus in de-identification and inference-based tasks. The results of the study sustain that pseudo-generated data could provide a new framework for training corpora and creating stronger clinical large-language models.