medRxiv : the preprint server for health sciences
Authors: Crowder R, Thangakunam B, Andama A, Christopher DJ, Dalay V, Dube-Nwamba W, Kik SV, Nguyen DV, Nhung NV, Phillips PP, Ruhwald M, Theron G, Worodria W, Yu C, Nahid P, Cattamanchi A, Gupta-Wright A, Denkinger CM, R2D2 TB Network
medRxiv : the preprint server for health sciences
Authors: Bagala I, Namuganga JF, Nayebare P, Cuu G, Katairo T, Nabende I, Gonahasa S, Nassali M, Tukwasibwe S, Dorsey G, Nankabirwa J, Kitaka SB, Kiguli S, Greenhouse B, Ssewanyana I, Kamya MR, Briggs J
Liver transplantation : official publication of the American Association for the Study of Liver Diseases and the International Liver Transplantation Society
Authors: Wang M, Shui AM, Ruck J, Huang CY, Verna EC, King EA, Ladner DP, Ganger D, Kappus M, Rahimi R, Tevar AD, Duarte-Rojo A, Lai JC
Journal of the American Medical Informatics Association : JAMIA
Authors: Apathy NC, Biro J, Holmgren AJ
AIDS (London, England)
Authors: Martinson T, Montoya R, Moreira C, Kuncze K, Sassaman K, Heise MJ, Glidden DV, Amico KR, Arnold EA, Buchbinder SP, Ewart LD, Carrico A, Wang G, Okochi H, Scott HM, Gandhi M, Spinelli MA
Volume 31 of Issue 10 | Journal of the American Medical Informatics Association : JAMIA
Authors: Sushil M, Zack T, Mandair D, Zheng Z, Wali A, Yu YN, Quan Y, Lituiev D, Butte AJ
OBJECTIVE
Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations.
MATERIALS AND METHODS
We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model.
RESULTS
Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set.
DISCUSSION
On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results.
CONCLUSIONS
GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
View on PubMed
Biomedical physics & engineering express
Authors: Lee HY, Lee G, Ferguson D, Hsu SH, Hu YH, Huynh E, Sudhyadhom A, Williams CL, Cagney DN, Fitzgerald KJ, Kann BH, Kozono D, Leeman JE, Mak RH, Han Z
JAMA
Authors: Marcus GM, Curfman G, Bibbins-Domingo K
JAMA
Authors: Ni M, Dadon Z, Ormerod JOM, Saenen J, Hoeksema WF, Antiperovitch P, Tadros R, Christiansen MK, Steinberg C, Arnaud M, Tian S, Sun B, Estillore JP, Wang R, Khan HR, Roston TM, Mazzanti A, Giudicessi JR, Siontis KC, Alak A, Acosta JG, Divakara Menon SM, Tan NS, van der Werf C, Nazer B, Vivekanantham H, Pandya T, Cunningham J, Gula LJ, Wong JA, Amit G, Scheinman MM, Krahn AD, Ackerman MJ, Priori SG, Gollob MH, Healey JS, Sacher F, Nof E, Glikson M, Wilde AAM, Watkins H, Jensen HK, Postema PG, Belhassen B, Chen SRW, Roberts JD
Journal of pediatric gastroenterology and nutrition
Authors: Shifman HP, Hatchett J, Pai RA, Safer R, Gomel R, Vyas M, Li M, Lai JC, Wadhwani SI