A natural language processing method to identify social determinants of health in electronic health records

August 15, 2019
Free for AMIA members; $50 for non-members
Cosmin Adrian Bejan, PhD

Understanding how to identify the social determinants of health from electronic health records (EHRs) could provide important insights to understand health or disease outcomes. We developed a methodology to capture 2 rare and severe social determinants of health, homelessness and adverse childhood experiences (ACEs), from a large EHR repository.

We first constructed lexicons to capture homelessness and ACE phenotypic profiles. We employed word2vec and lexical associations to mine homelessness-related words. Next, using relevance feedback, we refined the 2 profiles with iterative searches over 100 million notes from the Vanderbilt EHR. Seven assessors manually reviewed the top-ranked results of 2544 patient visits relevant for homelessness and 1000 patients relevant for ACE. word2vec yielded better performance (area under the precision-recall curve [AUPRC] of 0.94) than lexical associations (AUPRC = 0.83) for extracting homelessness-related words. A comparative study of searches for the 2 phenotypes revealed a higher performance achieved for homelessness (AUPRC = 0.95) than ACE (AUPRC = 0.79). A temporal analysis of the homeless population showed that the majority experienced chronic homelessness. Most ACE patients suffered sexual (70%) and/or physical (50.6%) abuse, with the top-ranked abuser keywords being “father” (21.8%) and “mother” (15.4%). Top prevalent associated conditions for homeless patients were lack of housing (62.8%) and tobacco use disorder (61.5%), while for ACE patients it was mental disorders (36.6%–47.6%). We provide an efficient solution for mining homelessness and ACE information from EHRs, which can facilitate large clinical and genetic studies of these social determinants of health.

Learning Objectives

After participating in this activity, the learner should be better able to:

  • Engage in information extraction for social determinants of health 
  • Understand deep learning in clinical NLP

Speaker Information

Cosmin Adrian Bejan, PhD
Assistant Professor of Biomedical Informatics
School of Medicine at Vanderbilt University

Cosmin Adrian Bejan is an Assistant Professor of Biomedical Informatics in the School of Medicine at Vanderbilt University.  His research lies at the intersection of biomedical informatics, natural language processing, and machine learning. Currently, he is developing text mining technologies for processing narrative patient reports to identify illness phenotypes and to facilitate clinical and translational studies of large cohorts of critically ill patients. One methodology he recently devised for this purpose is based on statistical hypothesis testing to extract the most relevant clinical information corresponding to the phenotype of interest. For this study, he also developed a state-of-the-art assertion classifier for assigning assertion values to the concepts associated with a specific phenotype.

Dr. Bejan received his B.S. and M.S. in computer science from the University of Iasi, Romania and his Ph.D. in computer science from the University of Texas at Dallas. He completed postdoctoral studies at the Institute for Creative Technologies of the University of Southern California, the Department of Biomedical Informatics and Medical Education at the University of Washington, and the Department of Biomedical Informatics at Vanderbilt University.