Towards a Scientific Language Processing Model

Session Number

Project ID: CMPS 04

Advisor(s)

Dr. Tanwi Mallick; Argonne National Laboratory

Discipline

Computer Science

Start Date

19-4-2023 9:35 AM

End Date

19-4-2023 9:50 AM

Abstract

Natural Language Processing is an effective tool for analyzing large volumes of text effectively. However, most scientific articles contain sophisticated language that can be difficult to understand effectively and quickly. To expedite this, I tuned a model that can quickly classify abstract datasets about scientific topics into specific subcategories. Using the ArXiv corpus with over 2.2 million abstracts, I created a dataset of climate change articles, on which I ran pretrained HuggingFace models. Using observational and quantitative data (ROUGE, Cosine SImilarity, etc.), I tuned the parameters of various keyword extraction models and analyzed the keyword frequency of the dataset. Then, using the BERTopic model with various embedding techniques (SentenceTranformers, spaCy, etc.), I classified the dataset into clusters which could be individually analyzed. I used abstractive and extractive summarization models on each cluster to concisely describe the general progress of particular climate change topics. Using dynamic topic modeling, I then plotted the prevalence of different topics over time, which provided insight into the interest in climate change topics over the past decade. This weakly-supervised algorithm allows analysts and researchers to quickly derive general conclusions about specific scientific topics and visualize their relevance in the scientific community over time.

Share

COinS
 
Apr 19th, 9:35 AM Apr 19th, 9:50 AM

Towards a Scientific Language Processing Model

Natural Language Processing is an effective tool for analyzing large volumes of text effectively. However, most scientific articles contain sophisticated language that can be difficult to understand effectively and quickly. To expedite this, I tuned a model that can quickly classify abstract datasets about scientific topics into specific subcategories. Using the ArXiv corpus with over 2.2 million abstracts, I created a dataset of climate change articles, on which I ran pretrained HuggingFace models. Using observational and quantitative data (ROUGE, Cosine SImilarity, etc.), I tuned the parameters of various keyword extraction models and analyzed the keyword frequency of the dataset. Then, using the BERTopic model with various embedding techniques (SentenceTranformers, spaCy, etc.), I classified the dataset into clusters which could be individually analyzed. I used abstractive and extractive summarization models on each cluster to concisely describe the general progress of particular climate change topics. Using dynamic topic modeling, I then plotted the prevalence of different topics over time, which provided insight into the interest in climate change topics over the past decade. This weakly-supervised algorithm allows analysts and researchers to quickly derive general conclusions about specific scientific topics and visualize their relevance in the scientific community over time.