When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification

Li, Xuedong; Yuan, Walter; Peng, Dezhong; Mei, Qiaozhu; Wang, Yue

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification

dc.contributor.author	Li, Xuedong
dc.contributor.author	Yuan, Walter
dc.contributor.author	Peng, Dezhong
dc.contributor.author	Mei, Qiaozhu
dc.contributor.author	Wang, Yue
dc.date.accessioned	2022-08-10T18:15:12Z
dc.date.available	2022-08-10T18:15:12Z
dc.date.issued	2022-04-05
dc.identifier.citation	BMC Medical Informatics and Decision Making. 2022 Apr 05;21(Suppl 9):377
dc.identifier.uri	https://doi.org/10.1186/s12911-022-01829-2
dc.identifier.uri	https://hdl.handle.net/2027.42/173609	en
dc.description.abstract	Abstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged $$F_{1}$$ F 1 score. Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. Conclusion As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.
dc.title	When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification
dc.type	Journal Article
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/173609/1/12911_2022_Article_1829.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/5340
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dc.date.updated	2022-08-10T18:15:11Z
dc.owningcollname	Interdisciplinary and Peer-Reviewed

Files in this item

Name:: 12911_2022_Article_1829.pdf
Size:: 1.349MB
Format:: PDF

View/Open

Interdisciplinary and Peer-Reviewed

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.