Deep Learning–based Assessment of Oncologic Outcomes from Natural Language Processing of Structured Radiology Reports

Abstract

“Just Accepted” papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To train a deep natural language processing (NLP) model, using data mined structured oncology reports (SOR), for rapid tumor response category (TRC) classification from free-text oncology reports (FTOR) and compare its performance with human readers and conventional NLP algorithms. Materials and Methods In this retrospective study, databases of three independent radiology departments were queried for SOR and FTOR from March 2018-August 2021. An automated data mining and curation pipeline was developed to extract Response Evaluation Criteria in Solid Tumors (RECIST)-related TRCs for SOR for ground truth definition. The deep NLP bidirectional encoder representations from transformers (BERT) model and three feature-rich algorithms were trained on SOR to predict TRCs in FTOR. Models’ F1-scores were compared against radiologists, medical students, and radiology technologist students. Lexical and semantic analyses were conducted to investigate human and model performance on FTOR. Results Oncologic findings and TRCs were accurately mined from 9653/12833 (75.2%) queried SOR, yielding oncologic reports from 10455 patients (mean age, 60 years ± [SD]14; 5303 women) that met inclusion criteria. On 802 FTOR in the test-set, BERT achieved better TRC classification results (F1, 0.70; 95%CI: 0.68,0.73) than the best-performing reference, linear support vector classifier (F1, 0.63; 95%CI: 0.61,0.66), and technologist students (F1, 0.65; 95%CI: 0.63,0.67), had similar performance to medical students (F1, 0.73; 95%CI: 0.72,0.75), but was inferior to radiologists (F1, 0.79; 95%CI: 0.78,0.81). Lexical complexity and semantic ambiguities in FTOR influenced human and model performance, revealing maximum F1-drops of-0.17 and-0.19, respectively. Conclusion The developed deep NLP model reached the performance level of medical students but not radiologists in curating oncologic outcomes from radiology FTOR. ©RSNA, 2022

Publication
Radiology: Artificial Intelligence
Klaus Maier-Hein
Klaus Maier-Hein
Head of Medical Image Computing
Jens Kleesiek
Jens Kleesiek
Professor of Translational Image-guided Oncology