Abstract
Despite advances in the application of deep neural networks to various kinds of medical data, extracting information from unstructured textual sources remains a challenging task. Using a dataset of de-identified clinical letters gathered at a memory clinic, we evaluate recurrent neural networks (RNNs) on their ability to predict patients’ diagnoses of ‘Dementia’, ‘Mild Cognitive Impairment’ or ‘Non-impaired’. This classification framework can also have applications in the automatic identification of patients as candidates for clinical trials.
After showing that models trained on state-of-the-art sentence-level embeddings outperform both word-level models and a recent benchmark model that fine-tunes a pre-trained general-domain language model, we probe sentence embedding models in order to reveal interpretable insights into the types of sentence-level representations the RNNs build. Specifically, we take a measure of sentence importance with respect to a given class and identify clusters of sentences in the embedding space that correlate strongly with importance scores for each class. Extracting the most frequent phrases within each group of sentence representations shows that the model is sensitive to sentences that cluster around semantic concepts such as a patient’s level of geriatric depression and how independent the patient is in their daily activities.
In addition to showing which sentences in a document are most informative about the patient’s condition, our method can identify the types of sentences that can lead the model to make incorrect diagnoses.
After showing that models trained on state-of-the-art sentence-level embeddings outperform both word-level models and a recent benchmark model that fine-tunes a pre-trained general-domain language model, we probe sentence embedding models in order to reveal interpretable insights into the types of sentence-level representations the RNNs build. Specifically, we take a measure of sentence importance with respect to a given class and identify clusters of sentences in the embedding space that correlate strongly with importance scores for each class. Extracting the most frequent phrases within each group of sentence representations shows that the model is sensitive to sentences that cluster around semantic concepts such as a patient’s level of geriatric depression and how independent the patient is in their daily activities.
In addition to showing which sentences in a document are most informative about the patient’s condition, our method can identify the types of sentences that can lead the model to make incorrect diagnoses.
Original language | English |
---|---|
Number of pages | 1 |
Publication status | Published - 02 Sept 2019 |
Event | International Conference of the Royal Statistical Society (RSS 2019) - Belfast, United Kingdom Duration: 02 Sept 2019 → 05 Sept 2019 https://events.rss.org.uk/rss/frontend/reg/thome.csp?pageID=83705&ef_sel_menu=1647&eventID=270 |
Conference
Conference | International Conference of the Royal Statistical Society (RSS 2019) |
---|---|
Abbreviated title | RSS 2019 |
Country/Territory | United Kingdom |
City | Belfast |
Period | 02/09/2019 → 05/09/2019 |
Internet address |