The Effects of De-identified Tokens on the Performance of Clinical Large Language Models
Session Number
CMPS 03
Advisor(s)
Paul Landes, Argonne National Laboratory
Discipline
Computer Science
Start Date
17-4-2024 11:05 AM
End Date
17-4-2024 11:20 AM
Abstract
Clinical Large-Language Models (LLMs) are essential to the biomedical industry: they can analyze and interpret physician notes, anonymize patient health information, and supplement diagnoses in various medical practices. Nowadays, most widely used clinical large-language models are trained on medical text that masks protected health information---corpora unlike the data that the models are used on in the field. Despite this, little to no research has been conducted to determine whether the masked corpora affect the performance of clinical large language models.
In our study, we trained models on commonly used anonymized corpora (MIMIC-III and MIMIC-IV) with the masked tokens replaced with pseudo-tokens (i.e., artificially generated names, dates, and locations) and compared their performance to the same models trained on unaltered corpora. The model's performance was based on their evaluation scores (f1, precision, and recall) on de- identification, inference, question-answering, and summarization tasks. Our data shows that the large-language models trained on our pseudo-corpora significantly outperform the ones trained on the original corpus in de-identification and inference-based tasks. The results of the study sustain that pseudo-generated data could provide a new framework for training corpora and creating stronger clinical large-language models.
The Effects of De-identified Tokens on the Performance of Clinical Large Language Models
Clinical Large-Language Models (LLMs) are essential to the biomedical industry: they can analyze and interpret physician notes, anonymize patient health information, and supplement diagnoses in various medical practices. Nowadays, most widely used clinical large-language models are trained on medical text that masks protected health information---corpora unlike the data that the models are used on in the field. Despite this, little to no research has been conducted to determine whether the masked corpora affect the performance of clinical large language models.
In our study, we trained models on commonly used anonymized corpora (MIMIC-III and MIMIC-IV) with the masked tokens replaced with pseudo-tokens (i.e., artificially generated names, dates, and locations) and compared their performance to the same models trained on unaltered corpora. The model's performance was based on their evaluation scores (f1, precision, and recall) on de- identification, inference, question-answering, and summarization tasks. Our data shows that the large-language models trained on our pseudo-corpora significantly outperform the ones trained on the original corpus in de-identification and inference-based tasks. The results of the study sustain that pseudo-generated data could provide a new framework for training corpora and creating stronger clinical large-language models.