Bio


Selen Bozkurt is a postdoctoral scholar at Stanford University, Biomedical Data Science Department and Center for Biomedical Informatics Research. Her research area and interests have focused on health informatics research using electronic health records, machine learning and natural language processing. She also has work experience as a biostatistician in several projects. She is a member of RSNA Radiology Reporting Committee since 2009. Her PhD dissertation work was entitled "A Real Time Decision Support System for Mammography Interpretations" in which she developed an automated system for deep information extraction from mammography reports and an approach for real-time decision support driven by analysis of dictated radiology reports.

Professional Education


  • PhD, Akdeniz University, Faculty of Medicine, Biostatistics and Medical Informatics
  • Visiting PhD Student, Stanford University, Biomedical Informatics
  • MSc, Akdeniz University, Faculty of Medicine, Biostatistics and Medical Informatics
  • BSc, Dokuz Eylul University, Statistics

Lab Affiliations


All Publications


  • Expanding a radiology lexicon using contextual patterns in radiology reports. Journal of the American Medical Informatics Association : JAMIA Percha, B., Zhang, Y., Bozkurt, S., Rubin, D., Altman, R. B., Langlotz, C. P. 2018

    Abstract

    Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus.We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building.Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others.The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.

    View details for DOI 10.1093/jamia/ocx152

    View details for PubMedID 29329435

  • Can Statistical Machine Learning Algorithms Help for Classification of Obstructive Sleep Apnea Severity to Optimal Utilization of Polysomnography Resources? Methods of information in medicine Bozkurt, S., Bostanci, A., Turhan, M. 2017; 56 (4)

    Abstract

    The goal of this study is to evaluate the results of machine learning methods for the classification of OSA severity of patients with suspected sleep disorder breathing as normal, mild, moderate and severe based on non-polysomnographic variables: 1) clinical data, 2) symptoms and 3) physical examination.In order to produce classification models for OSA severity, five different machine learning methods (Bayesian network, Decision Tree, Random Forest, Neural Networks and Logistic Regression) were trained while relevant variables and their relationships were derived empirically from observed data. Each model was trained and evaluated using 10-fold cross-validation and to evaluate classification performances of all methods, true positive rate (TPR), false positive rate (FPR), Positive Predictive Value (PPV), F measure and Area Under Receiver Operating Characteristics curve (ROC-AUC) were used.Results of 10-fold cross validated tests with different variable settings promisingly indicated that the OSA severity of suspected OSA patients can be classified, using non-polysomnographic features, with 0.71 true positive rate as the highest and, 0.15 false positive rate as the lowest, respectively. Moreover, the test results of different variables settings revealed that the accuracy of the classification models was significantly improved when physical examination variables were added to the model.Study results showed that machine learning methods can be used to estimate the probabilities of no, mild, moderate, and severe obstructive sleep apnea and such approaches may improve accurate initial OSA screening and help referring only the suspected moderate or severe OSA patients to sleep laboratories for the expensive tests.

    View details for DOI 10.3414/ME16-01-0084

    View details for PubMedID 28590499

  • Usability Study of RSNA Radiology Reporting Template Library. Studies in health technology and informatics Hong, Y., Zhu, Y., Bozkurt, S., Zhang, J., Kahn, C. E. 2017; 245: 1325

    Abstract

    This study provides insights that could help to improve the Radiological Society of North America (RSNA) Reporting Template Digital Library, based on a usability evaluation. The results show that most users have been satisfied with the website. The general comments for the library are positive, although the participants suggested quite a few areas to improve. About 40% are returning visitors which means people often come back to the website.

    View details for PubMedID 29295406

  • Estimation of cardiovascular disease from polysomnographic parameters in sleep-disordered breathing EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY Turhan, M., Bostanci, A., Bozkurt, S. 2016; 273 (12): 4585-4593

    Abstract

    We aimed to illustrate the causal relationships between cardiovascular diseases (CVDs) and various polysomnographic variables, and to develop a CVD estimation model from these variables in a population referred for assessment of possible sleep-disordered breathing (SDB). Clinical and polysomnographic data of 1162 consecutive patients with suspected SDB whose comorbidity status was known, were reviewed, retrospectively. Variable selection was performed in two steps using univariate analysis and tenfold cross validation information gain analysis. The resulting set of variables with an average merit value (m) of >0.005 was considered to be causal factors contributing to the CVDs, and used in Bayesian network models for providing estimations. Of the 1162 patients, 234 had CVDs (20.1 %). In total, 28 parameters were evaluated for variable selection. Of those, 19 were found to be associated with CVDs. Age was the most effective attribute in estimating CVD (m = 0.051), followed by total sleep time with oxygen saturation <90 % (m = 0.021). Some other important variables were apnea-hypopnea index during non-rapid eye movement (m = 0.018), lowest oxygen saturation (m = 0.018), body mass index (m = 0.016), total apnea duration (m = 0.014), mean apnea duration (m = 0.014), longest apnea duration (m = 0.013), and severity of SDB (m = 0.012). The modeling process resulted in a final model, with 76.9 % sensitivity, 96.2 % specificity, and 92.6 % negative predictive value, consisting of all selected variables. The study provides evidence that the estimation of CVDs from polysomnographic parameters is possible with high predictive performance using Bayesian network analysis.

    View details for DOI 10.1007/s00405-016-4176-1

    View details for Web of Science ID 000387700400066

    View details for PubMedID 27363409

  • Using automatically extracted information from mammography reports for decision-support. Journal of biomedical informatics Bozkurt, S., Gimenez, F., Burnside, E. S., Gulkesen, K. H., Rubin, D. L. 2016; 62: 224-231

    Abstract

    To evaluate a system we developed that connects natural language processing (NLP) for information extraction from narrative text mammography reports with a Bayesian network for decision-support about breast cancer diagnosis. The ultimate goal of this system is to provide decision support as part of the workflow of producing the radiology report.We built a system that uses an NLP information extraction system (which extract BI-RADS descriptors and clinical information from mammography reports) to provide the necessary inputs to a Bayesian network (BN) decision support system (DSS) that estimates lesion malignancy from BI-RADS descriptors. We used this integrated system to predict diagnosis of breast cancer from radiology text reports and evaluated it with a reference standard of 300 mammography reports. We collected two different outputs from the DSS: (1) the probability of malignancy and (2) the BI-RADS final assessment category. Since NLP may produce imperfect inputs to the DSS, we compared the difference between using perfect ("reference standard") structured inputs to the DSS ("RS-DSS") vs NLP-derived inputs ("NLP-DSS") on the output of the DSS using the concordance correlation coefficient. We measured the classification accuracy of the BI-RADS final assessment category when using NLP-DSS, compared with the ground truth category established by the radiologist.The NLP-DSS and RS-DSS had closely matched probabilities, with a mean paired difference of 0.004±0.025. The concordance correlation of these paired measures was 0.95. The accuracy of the NLP-DSS to predict the correct BI-RADS final assessment category was 97.58%.The accuracy of the information extracted from mammography reports using the NLP system was sufficient to provide accurate DSS results. We believe our system could ultimately reduce the variation in practice in mammography related to assessment of malignant lesions and improve management decisions.

    View details for DOI 10.1016/j.jbi.2016.07.001

    View details for PubMedID 27388877

  • Automatic abstraction of imaging observations with their characteristics from mammography reports. Journal of the American Medical Informatics Association Bozkurt, S., Lipson, J. A., Senol, U., Rubin, D. L., Bulu, H. 2015; 22 (e1): e81-92

    Abstract

    Radiology reports are usually narrative, unstructured text, a format which hinders the ability to input report contents into decision support systems. In addition, reports often describe multiple lesions, and it is challenging to automatically extract information on each lesion and its relationships to characteristics, anatomic locations, and other information that describes it. The goal of our work is to develop natural language processing (NLP) methods to recognize each lesion in free-text mammography reports and to extract its corresponding relationships, producing a complete information frame for each lesion.We built an NLP information extraction pipeline in the General Architecture for Text Engineering (GATE) NLP toolkit. Sequential processing modules are executed, producing an output information frame required for a mammography decision support system. Each lesion described in the report is identified by linking it with its anatomic location in the breast. In order to evaluate our system, we selected 300 mammography reports from a hospital report database.The gold standard contained 797 lesions, and our system detected 815 lesions (780 true positives, 35 false positives, and 17 false negatives). The precision of detecting all the imaging observations with their modifiers was 94.9, recall was 90.9, and the F measure was 92.8.Our NLP system extracts each imaging observation and its characteristics from mammography reports. Although our application focuses on the domain of mammography, we believe our approach can generalize to other domains and may narrow the gap between unstructured clinical report text and structured information extraction needed for data mining and decision support.

    View details for DOI 10.1136/amiajnl-2014-003009

    View details for PubMedID 25352567

  • Automated detection of ambiguity in BI-RADS assessment categories in mammography reports. Studies in health technology and informatics Bozkurt, S., Rubin, D. 2014; 197: 35-39

    Abstract

    An unsolved challenge in biomedical natural language processing (NLP) is detecting ambiguities in the reports that can help physicians to improve report clarity. Our goal was to develop NLP methods to tackle the challenges of identifying ambiguous descriptions of the laterality of BI-RADS Final Assessment Categories in mammography radiology reports. We developed a text processing system that uses a BI-RADS ontology we built as a knowledge source for automatic annotation of the entities in mammography reports relevant to this problem. We used the GATE NLP toolkit and developed customized processing resources for report segmentation, named entity recognition, and detection of mismatches between BI-RADS Final Assessment Categories and mammogram laterality. Our system detected 55 mismatched cases in 190 reports and the accuracy rate was 81%. We conclude that such NLP techniques can detect ambiguities in mammography reports and may reduce discrepancy and variability in reporting.

    View details for PubMedID 24743074

  • Annotation for Information Extraction from Mammography Reports INFORMATICS, MANAGEMENT AND TECHNOLOGY IN HEALTHCARE Bozkurt, S., Gulkesen, K. H., Rubin, D. 2013; 190: 183-185

    Abstract

    Inter and intra-observer variability in mammographic interpretation is a challenging problem, and decision support systems (DSS) may be helpful to reduce variation in practice. Since radiology reports are created as unstructured text reports, Natural language processing (NLP) techniques are needed to extract structured information from reports in order to provide the inputs to DSS. Before creating NLP systems, producing high quality annotated data set is essential. The goal of this project is to develop an annotation schema to guide the information extraction tasks needed from free-text mammography reports.

    View details for DOI 10.3233/978-1-61499-276-9-183

    View details for Web of Science ID 000341032900053

    View details for PubMedID 23823416

  • An Open-Standards Grammar for Outline-Style Radiology Report Templates JOURNAL OF DIGITAL IMAGING Bozkurt, S., Kahn, C. E. 2012; 25 (3): 359-364

    Abstract

    Structured reporting uses consistent ordering of results and standardized terminology to improve the quality and reduce the complexity of radiology reports. We sought to define a generalized approach for radiology reporting that produces flexible outline-style reports, accommodates structured information and named reporting elements, allows reporting terms to be linked to controlled vocabularies, uses existing informatics standards, and allows structured report data to be extracted readily. We applied the Regular Language for XML-Next Generation (RELAX NG) schema language to create templates for 110 reporting templates created as part of the Radiological Society of North America reporting initiative. We evaluated how well this approach addressed the project's goals. The RELAX NG schema language expressed the cardinality and hierarchical relationships of reporting concepts, and allowed reporting elements to be mapped to terms in controlled medical vocabularies, such as RadLex®, Systematized Nomenclature of Medicine Clinical Terms®, and Logical Observation Identifiers Names and Codes®. The approach provided extensibility and accommodated the addition of new features. Overall, the approach has proven to be useful and will form the basis for a supplement to the Digital Imaging and Communication in Medicine Standard.

    View details for DOI 10.1007/s10278-012-9456-8

    View details for Web of Science ID 000304109700007

    View details for PubMedID 22258732

    View details for PubMedCentralID PMC3348985