Show simple item record

When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification

dc.contributor.authorLi, Xuedong
dc.contributor.authorYuan, Walter
dc.contributor.authorPeng, Dezhong
dc.contributor.authorMei, Qiaozhu
dc.contributor.authorWang, Yue
dc.date.accessioned2022-08-10T18:15:12Z
dc.date.available2022-08-10T18:15:12Z
dc.date.issued2022-04-05
dc.identifier.citationBMC Medical Informatics and Decision Making. 2022 Apr 05;21(Suppl 9):377
dc.identifier.urihttps://doi.org/10.1186/s12911-022-01829-2
dc.identifier.urihttps://hdl.handle.net/2027.42/173609en
dc.description.abstractAbstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged $$F_{1}$$ F 1 score. Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. Conclusion As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.
dc.titleWhen BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification
dc.typeJournal Article
dc.description.bitstreamurlhttp://deepblue.lib.umich.edu/bitstream/2027.42/173609/1/12911_2022_Article_1829.pdf
dc.identifier.doihttps://dx.doi.org/10.7302/5340
dc.language.rfc3066en
dc.rights.holderThe Author(s)
dc.date.updated2022-08-10T18:15:11Z
dc.owningcollnameInterdisciplinary and Peer-Reviewed


Files in this item

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.