- Split View
-
Views
-
CiteCitation
Lichy Han, Robert Ball, Carol A Pamer, Russ B Altman, Scott Proestel; Development of an automated assessment tool for MedWatch reports in the FDA adverse event reporting system, Journal of the American Medical Informatics Association, Volume 24, Issue 5, 1 September 2017, Pages 913–920, https://doi.org/10.1093/jamia/ocx022
Download citation file:
© 2018 Oxford University Press
Close -
Share
Abstract
Objective: As the US Food and Drug Administration (FDA) receives over a million adverse event reports associated with medication use every year, a system is needed to aid FDA safety evaluators in identifying reports most likely to demonstrate causal relationships to the suspect medications. We combined text mining with machine learning to construct and evaluate such a system to identify medication-related adverse event reports.
Methods: FDA safety evaluators assessed 326 reports for medication-related causality. We engineered features from these reports and constructed random forest, L1 regularized logistic regression, and support vector machine models. We evaluated model accuracy and further assessed utility by generating report rankings that represented a prioritized report review process.
Results: Our random forest model showed the best performance in report ranking and accuracy, with an area under the receiver operating characteristic curve of 0.66. The generated report ordering assigns reports with a higher probability of medication-related causality a higher rank and is significantly correlated to a perfect report ordering, with a Kendall’s tau of 0.24 (P = .002).
Conclusion: Our models produced prioritized report orderings that enable FDA safety evaluators to focus on reports that are more likely to contain valuable medication-related adverse event information. Applying our models to all FDA adverse event reports has the potential to streamline the manual review process and greatly reduce reviewer workload.
BACKGROUND AND SIGNIFICANCE
The US Food and Drug Administration (FDA) receives more than 4000 medication safety reports every day, and the number of reports received each year has been increasing exponentially over the last decade. These reports are stored in a database known as the FDA Adverse Event Reporting System (FAERS), which has collected over 11 million reports since its inception in 1969.1 In the United States, reporting these adverse events, medication errors, and product quality issues by health care professionals and consumers via the MedWatch program is voluntary, but it is mandatory for drug manufacturers.2 The FDA uses these reports to detect safety issues that may not have been identified during pre-market clinical trials used as the basis for medication approval. Among the reasons for not detecting safety issues during pre-market evaluation are that the adverse event may be extremely rare or may be occurring in a population of individuals who were not previously studied.3,4
While the volume of reports is significant and increasing yearly, many of the reports do not provide sufficient information to determine whether the suspect drug caused the reported adverse event, or the information provided does not reasonably suggest that the adverse event was caused by the suspect medication. Therefore, developing a tool that could assist FDA safety evaluators by identifying reports that are most likely to be useful would be highly valuable and could help to ensure that safety issues do not go undetected.
There has been considerable interest in the use of natural language processing (NLP) and machine learning to enable workers to focus on subject matter that is most likely to be useful.5,6 Free-text narratives often contain details that allow causal inference, so extracting these details is expected to be an important aspect of this work. In the current project, we explore the possibility of using NLP and machine learning to assess the likelihood of drug causality for safety reports submitted to the FDA. In this first phase, we demonstrate the feasibility of and potential benefits to the adverse event review process in a subset of adjudicated FAERS reports. With success, FDA safety evaluators would be able to apply our system to the entire corpus of over 11 million reports, enabling them to focus their review efforts on reports that are most likely to indicate the emergence of a new safety concern.
METHODS
Data and gold standard
A “gold standard” for drug causality assessment was created by deidentifying and redacting 326 case reports from FAERS. Safety reports were selected from FAERS based on convenience sampling of cases received by the FDA between November 1997 and March 2015. Specifically, the redacted fields include the patient identification number; date of birth; reporter name, organization, and location; and dates in the narrative. Only cases received after November 1997 were eligible, as only those reports contain an electronically retrievable and machine readable narrative section. The Stanford University Institutional Review Board approved this study (IRB-34866). Access to the data used in this study for research purposes can be requested through the FDA Technology Transfer Program at techtransfer@fda.hhs.gov.
Every case report was assessed for causality by 3 FDA safety evaluators using a modified version of the World Health Organization–Uppsala Monitoring Centre (WHO-UMC) criteria for drug causality assessment (Table 1
Causality Term | Assessment Criteria |
---|---|
Certain |
|
Probable/likely |
|
Possible |
|
Unlikely |
|
Unassessablea | Cannot be judged because information is insufficient or contradictory |
Medication Errora |
|
Product Quality Issuea |
|
Causality Term | Assessment Criteria |
---|---|
Certain |
|
Probable/likely |
|
Possible |
|
Unlikely |
|
Unassessablea | Cannot be judged because information is insufficient or contradictory |
Medication Errora |
|
Product Quality Issuea |
|
aCategory modified from the WHO-UMC Causality Assessment Scale. The unmodified version of this scale is located at http://who-umc.org/Graphics/26649.pdf.7
Causality Term | Assessment Criteria |
---|---|
Certain |
|
Probable/likely |
|
Possible |
|
Unlikely |
|
Unassessablea | Cannot be judged because information is insufficient or contradictory |
Medication Errora |
|
Product Quality Issuea |
|
Causality Term | Assessment Criteria |
---|---|
Certain |
|
Probable/likely |
|
Possible |
|
Unlikely |
|
Unassessablea | Cannot be judged because information is insufficient or contradictory |
Medication Errora |
|
Product Quality Issuea |
|
aCategory modified from the WHO-UMC Causality Assessment Scale. The unmodified version of this scale is located at http://who-umc.org/Graphics/26649.pdf.7
After adjudication, case reports were randomly divided by the FDA investigators into a training and a test set, consisting of 60% and 40% of the data, respectively. We chose these proportions in order to capture a more comprehensive and representative set of the causality categories in the test set for evaluation. We then formulated our problem as a binary classification task by aggregating the causality categories into 2 groups: (1) Certain, Probable, Possible and (2) Unlikely, Unassessable. All features and models were built using the training data and assessed using the test data. All analyses were performed using R 3.3.0 (R Development Core Team, Vienna, Austria).
Feature engineering
Adverse event reports consist of structured data and an unstructured narrative. The structured data consists of fields such as age, adverse event outcomes (death, hospitalization, other), timeline of report entry by the FDA, adverse event reporters and their qualifications, terms mapped to the Medical Dictionary for Regulatory Activities (MedDRA®), and drug suspect products. The unstructured narrative is a free-text entry that varies in length from one word or sentence to multiple paragraphs. These narratives typically detail the chronological course of the adverse event, occasionally including medications and laboratory results. Protected health information was redacted from both the structured data and unstructured narratives prior to review by the Stanford investigators.
Structured data
We processed the structured data by re-encoding categorical variables, generating new features, and summarizing existing data. We expanded each categorical variable by transforming the possible values into multiple binary features. Age was binned by decade, and medications were binned by the first level of the Anatomical Therapeutic Chemical (ATC) Classification System.8 For the report types, direct reports are sent to the FDA by the general public, whereas 15-day (expedited) and nonexpedited reports are sent by pharmaceutical companies. The expedited 15-day time frame is for adverse events that are serious and unexpected. We calculated the number of days to complete the report as the data entry completion date minus the initial FDA received date. The data entry completion date is updated whenever a follow-up report is added to the case report. The total number of outcomes and reporters was calculated and included as additional features. MedDRA terms were represented by the number of preferred terms (PTs), high-level terms, high-level group terms, and system organ classes associated with each report.9
Unstructured narrative
The unstructured narratives were tokenized, lemmatized, and part-of-speech tagged using Stanford CoreNLP.10 Tokenization refers to splitting the narrative into individual words, lemmatization refers to converting each word into its base form, and part-of-speech tagging refers to assigning a part of speech (eg, noun, verb) to each word. We incorporated the number of sentences and average number of words, nouns, verbs, adjectives, and adverbs per sentence as nonsemantic features. We additionally counted the number of redacted terms in each narrative, which typically corresponded to dates detailing the time course of the adverse event.
Expert opinion derived features
From the structured data, we compiled all MedDRA preferred terms that were present in at least 5 reports. We surveyed 5 adjudicators, who assessed each of these preferred terms and indicated if the presence of the term in a report would tend to increase the likelihood of a medication-related adverse event. Any preferred term that had at least 3 affirmative votes was included as an additional binary variable. We refer to this binary feature as the “presence of curated PTs.” The included terms were “drug interaction,” “acute kidney injury,” “seizure,” “drug hypersensitivity,” “drug ineffective,” “rhabdomyolysis,” “product quality issue,” and “toxicity to various agents.”
From the unstructured narratives, 5 adjudicators compiled a list of 10 words and phrases that they believed were more likely to be associated with a medication-related adverse event. These words and phrases were “toxicity,” “induced,” “rechallenge,” “probable,” “no alternative etiology,” “reaction,” “adverse reaction,” and “drug level.” Alternative spellings of these chosen phrases were included as well. The presence of any of these words and phrases in the narrative was included as a binary feature, titled “presence of curated terms.”
Model construction and evaluation
We constructed models to differentiate reports with an assessment of Certain, Probable, or Possible from reports with an assessment of Unlikely or Unassessable. We built models using L1 regularized logistic regression, random forest, and support vector machines (SVMs) using the glmnet,11,randomForest,12 and caret13 packages in R, respectively. Model parameters were chosen by 10-fold cross-validation14,15 on the training set; no reports from the test set were used in model parameter selection or model construction. The penalty parameter lambda for the L1 regularized logistic regression model was chosen as the largest value of lambda that was within 1 standard error of the minimum error.15 The values of lambda searched ranged from 0.0001 to 0.15, as determined by the glmnet package. A grid search was used to determine the optimal values of C and gamma for our SVM model. We searched over 10 values for each parameter, with C ranging from 0.01 to 5 and gamma ranging from 0.001 to 2. Our model predictions were evaluated by calculating the sensitivity, positive predictive value, accuracy, and area under the receiver operating characteristic curve (AUROC). To assess the relative importance of our features, we extracted features retained in the L1 regularized logistic regression model and used the mean decrease in the Gini index in the random forest model.
We further evaluated the utility of our models by using their predictions to create ranked lists of reports corresponding to the order in which they would be reviewed by manual adjudication. Specifically, for all 3 classifiers, we ordered the reports by their probability of having an assessment from Certain to Possible, where a greater probability would correspond to a report being adjudicated sooner and having an earlier ranking. For comparison, we made a ranked list of reports based on the date of data entry completion by the FDA, which we used to represent manual review of reports on a first come, first reviewed basis. We also assessed ranking the reports in reverse date order to simulate a scenario where the newest reports would be reviewed first. Additionally, we simulated arbitrary orderings by generating 10 000 random report rankings. We compared each of these report rankings to a perfectly ranked list using Spearman’s rho correlation coefficient, Kendall’s tau correlation test, and average precision (AP). We calculated AP using 2 cutoffs: (1) the entire test set (N = 125) and (2) the number of reports with an assessment of Certain, Probable, or Possible (N = 64).
We performed an error analysis of our model results by categorizing the false positive (FP) and false negative (FN) errors based on our features. We considered reports with an assessment of Certain, Probable, or Possible as positives and those with an assessment of Unlikely or Unassessable as negatives. We compared the distribution of each feature for the accurately classified reports to the reports classified as FN or FP. For categorical variables, we used the chi-squared test, and for continuous variables, we used Student t-test.
RESULTS
Data
Of the 326 reports, approximately half had an assessment of at least Possible (Figure 1 ), and no reports in the test set had an assessment of Certain. There was a total of 25 reports with an assessment of Medication Error or Product Quality Issue, 6 from the test set and 19 from the training set. Focusing on reports with an assessment of Certain to Unassessable resulted in a total of 301 reports, for a training set of 176 (107 Certain, Probable, Possible; 69 Unlikely, Unassessable) and test set of 125 (64 Certain, Probable, Possible; 61 Unlikely, Unassessable).
When assessing adjudicator agreement, we found that at least 2 of the 3 adjudicators tended to agree on any given report. At the granular level, using 7 classes, all 3 adjudicators agreed on 37.4% of the reports, and 92.6% of reports had at least 2 adjudicators in agreement. When we grouped the reports (Certain, Probable, Possible, vs Unlikely, Unassessable), all 3 adjudicators agreed on 61% of the reports.
Features
Summary statistics for features constructed from structured and unstructured data fields for reports with assessments between Certain and Unassessable are shown in Table 2
Training Set | Test Set | ||
---|---|---|---|
Structured Data | No. of Reports | 176 | 125 |
Age | 50.5 ± 23.0 | 54.6 ± 17.9 | |
Sex: Male, n (%) | 75 (42.6) | 55 (44.0) | |
Days to Complete Report | |||
Report type | 91.5 ± 305.6 | 55.9 ± 92.4 | |
15-day, n (%) | 134 (76.1) | 86 (68.9) | |
Non-expedited, n (%) | 32 (18.2) | 28 (22.4) | |
Direct, n (%) | 10 (5.7) | 11 (8.8) | |
Reporters | |||
Health professional, n (%) | 37 (21.0) | 22 (17.6) | |
Consumer, n (%) | 10 (5.7) | 10 (8.0) | |
Foreign, n (%) | 20 (11.4) | 8 (6.4) | |
Other, n (%) | 120 (68.2) | 85 (68.0) | |
MedDRA Terms | |||
# PTs | 3.9 ± 5.2 | 3.1 ± 2.7 | |
# HLTs | 3.2 ± 2.7 | 2.9 ± 2.4 | |
# HLGTs | 2.9 ± 2.4 | 2.8 ± 2.1 | |
# SOCs | 2.5 ± 1.7 | 2.3 ± 1.7 | |
Presence of curated PTs, n (%) | 18 (10.2) | 12 (9.6) | |
Outcomes | |||
Death, n (%) | 17 (9.7) | 12 (9.6) | |
Hospitalization, n(%) | 71 (40.3) | 40 (32.0) | |
Other, n (%) | 76 (43.2) | 62 (49.6) | |
Serious outcome, n (%) | 140 (79.5) | 98 (78.4) | |
Number of Drug Suspects | |||
Drug suspect ATC class, first level | 2.01 ± 1.9 | 2.06 ± 2.4 | |
J (antiinfectives for systemic use), n (%) | 30 (17.0) | 9 (7.2) | |
R (respiratory system), n (%) | 12 (6.8) | 10 (8.0) | |
A (alimentary tract and metabolism), n (%) | 43 (24.4) | 33 (26.4) | |
N (nervous system), n (%) | 29 (16.4) | 30 (24.0) | |
C (cardiovascular system), n (%) | 10 (5.7) | 12 (9.6) | |
L (antineoplastic and immunomodulating agents), n (%) | 31 (17.6) | 19 (15.2) | |
Unstructured Narrative | No. of Sentences | ||
# Words per sentence | 23.1 ± 24.2 | 19.6 ± 22.3 | |
Parts of speech | 18.1 ± 7.4 | 17.8 ± 10.1 | |
# Nouns per sentence | 6.6 ± 3.2 | 6.3 ± 4.2 | |
# Verbs per sentence | 2.6 ± 1.0 | 2.7 ± 1.4 | |
# Adjectives per sentence | 1.8 ± 1.0 | 1.7 ± 1.3 | |
# Adverbs per sentence | 0.5 ± 0.3 | 0.5 ± ± 0.4 | |
Number redacted | 1.2 ± 3.0 | 0.7 ± 1.9 | |
Presence of curated terms | 50 (28.4) | 36 (28.8) |
Training Set | Test Set | ||
---|---|---|---|
Structured Data | No. of Reports | 176 | 125 |
Age | 50.5 ± 23.0 | 54.6 ± 17.9 | |
Sex: Male, n (%) | 75 (42.6) | 55 (44.0) | |
Days to Complete Report | |||
Report type | 91.5 ± 305.6 | 55.9 ± 92.4 | |
15-day, n (%) | 134 (76.1) | 86 (68.9) | |
Non-expedited, n (%) | 32 (18.2) | 28 (22.4) | |
Direct, n (%) | 10 (5.7) | 11 (8.8) | |
Reporters | |||
Health professional, n (%) | 37 (21.0) | 22 (17.6) | |
Consumer, n (%) | 10 (5.7) | 10 (8.0) | |
Foreign, n (%) | 20 (11.4) | 8 (6.4) | |
Other, n (%) | 120 (68.2) | 85 (68.0) | |
MedDRA Terms | |||
# PTs | 3.9 ± 5.2 | 3.1 ± 2.7 | |
# HLTs | 3.2 ± 2.7 | 2.9 ± 2.4 | |
# HLGTs | 2.9 ± 2.4 | 2.8 ± 2.1 | |
# SOCs | 2.5 ± 1.7 | 2.3 ± 1.7 | |
Presence of curated PTs, n (%) | 18 (10.2) | 12 (9.6) | |
Outcomes | |||
Death, n (%) | 17 (9.7) | 12 (9.6) | |
Hospitalization, n(%) | 71 (40.3) | 40 (32.0) | |
Other, n (%) | 76 (43.2) | 62 (49.6) | |
Serious outcome, n (%) | 140 (79.5) | 98 (78.4) | |
Number of Drug Suspects | |||
Drug suspect ATC class, first level | 2.01 ± 1.9 | 2.06 ± 2.4 | |
J (antiinfectives for systemic use), n (%) | 30 (17.0) | 9 (7.2) | |
R (respiratory system), n (%) | 12 (6.8) | 10 (8.0) | |
A (alimentary tract and metabolism), n (%) | 43 (24.4) | 33 (26.4) | |
N (nervous system), n (%) | 29 (16.4) | 30 (24.0) | |
C (cardiovascular system), n (%) | 10 (5.7) | 12 (9.6) | |
L (antineoplastic and immunomodulating agents), n (%) | 31 (17.6) | 19 (15.2) | |
Unstructured Narrative | No. of Sentences | ||
# Words per sentence | 23.1 ± 24.2 | 19.6 ± 22.3 | |
Parts of speech | 18.1 ± 7.4 | 17.8 ± 10.1 | |
# Nouns per sentence | 6.6 ± 3.2 | 6.3 ± 4.2 | |
# Verbs per sentence | 2.6 ± 1.0 | 2.7 ± 1.4 | |
# Adjectives per sentence | 1.8 ± 1.0 | 1.7 ± 1.3 | |
# Adverbs per sentence | 0.5 ± 0.3 | 0.5 ± ± 0.4 | |
Number redacted | 1.2 ± 3.0 | 0.7 ± 1.9 | |
Presence of curated terms | 50 (28.4) | 36 (28.8) |
Binary features are reported using the number and percentage of reports, and numerical features are reported using mean ± standard deviation.
Training Set | Test Set | ||
---|---|---|---|
Structured Data | No. of Reports | 176 | 125 |
Age | 50.5 ± 23.0 | 54.6 ± 17.9 | |
Sex: Male, n (%) | 75 (42.6) | 55 (44.0) | |
Days to Complete Report | |||
Report type | 91.5 ± 305.6 | 55.9 ± 92.4 | |
15-day, n (%) | 134 (76.1) | 86 (68.9) | |
Non-expedited, n (%) | 32 (18.2) | 28 (22.4) | |
Direct, n (%) | 10 (5.7) | 11 (8.8) | |
Reporters | |||
Health professional, n (%) | 37 (21.0) | 22 (17.6) | |
Consumer, n (%) | 10 (5.7) | 10 (8.0) | |
Foreign, n (%) | 20 (11.4) | 8 (6.4) | |
Other, n (%) | 120 (68.2) | 85 (68.0) | |
MedDRA Terms | |||
# PTs | 3.9 ± 5.2 | 3.1 ± 2.7 | |
# HLTs | 3.2 ± 2.7 | 2.9 ± 2.4 | |
# HLGTs | 2.9 ± 2.4 | 2.8 ± 2.1 | |
# SOCs | 2.5 ± 1.7 | 2.3 ± 1.7 | |
Presence of curated PTs, n (%) | 18 (10.2) | 12 (9.6) | |
Outcomes | |||
Death, n (%) | 17 (9.7) | 12 (9.6) | |
Hospitalization, n(%) | 71 (40.3) | 40 (32.0) | |
Other, n (%) | 76 (43.2) | 62 (49.6) | |
Serious outcome, n (%) | 140 (79.5) | 98 (78.4) | |
Number of Drug Suspects | |||
Drug suspect ATC class, first level | 2.01 ± 1.9 | 2.06 ± 2.4 | |
J (antiinfectives for systemic use), n (%) | 30 (17.0) | 9 (7.2) | |
R (respiratory system), n (%) | 12 (6.8) | 10 (8.0) | |
A (alimentary tract and metabolism), n (%) | 43 (24.4) | 33 (26.4) | |
N (nervous system), n (%) | 29 (16.4) | 30 (24.0) | |
C (cardiovascular system), n (%) | 10 (5.7) | 12 (9.6) | |
L (antineoplastic and immunomodulating agents), n (%) | 31 (17.6) | 19 (15.2) | |
Unstructured Narrative | No. of Sentences | ||
# Words per sentence | 23.1 ± 24.2 | 19.6 ± 22.3 | |
Parts of speech | 18.1 ± 7.4 | 17.8 ± 10.1 | |
# Nouns per sentence | 6.6 ± 3.2 | 6.3 ± 4.2 | |
# Verbs per sentence | 2.6 ± 1.0 | 2.7 ± 1.4 | |
# Adjectives per sentence | 1.8 ± 1.0 | 1.7 ± 1.3 | |
# Adverbs per sentence | 0.5 ± 0.3 | 0.5 ± ± 0.4 | |
Number redacted | 1.2 ± 3.0 | 0.7 ± 1.9 | |
Presence of curated terms | 50 (28.4) | 36 (28.8) |
Training Set | Test Set | ||
---|---|---|---|
Structured Data | No. of Reports | 176 | 125 |
Age | 50.5 ± 23.0 | 54.6 ± 17.9 | |
Sex: Male, n (%) | 75 (42.6) | 55 (44.0) | |
Days to Complete Report | |||
Report type | 91.5 ± 305.6 | 55.9 ± 92.4 | |
15-day, n (%) | 134 (76.1) | 86 (68.9) | |
Non-expedited, n (%) | 32 (18.2) | 28 (22.4) | |
Direct, n (%) | 10 (5.7) | 11 (8.8) | |
Reporters | |||
Health professional, n (%) | 37 (21.0) | 22 (17.6) | |
Consumer, n (%) | 10 (5.7) | 10 (8.0) | |
Foreign, n (%) | 20 (11.4) | 8 (6.4) | |
Other, n (%) | 120 (68.2) | 85 (68.0) | |
MedDRA Terms | |||
# PTs | 3.9 ± 5.2 | 3.1 ± 2.7 | |
# HLTs | 3.2 ± 2.7 | 2.9 ± 2.4 | |
# HLGTs | 2.9 ± 2.4 | 2.8 ± 2.1 | |
# SOCs | 2.5 ± 1.7 | 2.3 ± 1.7 | |
Presence of curated PTs, n (%) | 18 (10.2) | 12 (9.6) | |
Outcomes | |||
Death, n (%) | 17 (9.7) | 12 (9.6) | |
Hospitalization, n(%) | 71 (40.3) | 40 (32.0) | |
Other, n (%) | 76 (43.2) | 62 (49.6) | |
Serious outcome, n (%) | 140 (79.5) | 98 (78.4) | |
Number of Drug Suspects | |||
Drug suspect ATC class, first level | 2.01 ± 1.9 | 2.06 ± 2.4 | |
J (antiinfectives for systemic use), n (%) | 30 (17.0) | 9 (7.2) | |
R (respiratory system), n (%) | 12 (6.8) | 10 (8.0) | |
A (alimentary tract and metabolism), n (%) | 43 (24.4) | 33 (26.4) | |
N (nervous system), n (%) | 29 (16.4) | 30 (24.0) | |
C (cardiovascular system), n (%) | 10 (5.7) | 12 (9.6) | |
L (antineoplastic and immunomodulating agents), n (%) | 31 (17.6) | 19 (15.2) | |
Unstructured Narrative | No. of Sentences | ||
# Words per sentence | 23.1 ± 24.2 | 19.6 ± 22.3 | |
Parts of speech | 18.1 ± 7.4 | 17.8 ± 10.1 | |
# Nouns per sentence | 6.6 ± 3.2 | 6.3 ± 4.2 | |
# Verbs per sentence | 2.6 ± 1.0 | 2.7 ± 1.4 | |
# Adjectives per sentence | 1.8 ± 1.0 | 1.7 ± 1.3 | |
# Adverbs per sentence | 0.5 ± 0.3 | 0.5 ± ± 0.4 | |
Number redacted | 1.2 ± 3.0 | 0.7 ± 1.9 | |
Presence of curated terms | 50 (28.4) | 36 (28.8) |
Binary features are reported using the number and percentage of reports, and numerical features are reported using mean ± standard deviation.
Model evaluation
Results from our models on the test set of reports are shown in Table 3
Metric | By Date | Reverse Date | Random Order (N = 10,000) | Random Forest | L1 Regularized Logistic Regression | SVM |
---|---|---|---|---|---|---|
Sensitivity | – | – | – | 0.66 | 0.69 | 0.50 |
PPV | – | – | – | 0.64 | 0.58 | 0.60 |
Accuracy | – | – | – | 0.63 | 0.58 | 0.58 |
AUROC | – | – | – | 0.66 | 0.66 | 0.58 |
Spearman’s ρ | −0.09 | 0.11 | Mean (SD): −0.002 (0.09) | 0.28 | −0.04 | 0.01 |
Kendall’s tau | −0.07 | 0.09 | Mean (SD): −0.002 (0.08) | 0.24 | −0.04 | −0.03 |
P-value | .31 | .23 | Mean (SD): 0.50 (0.27) | 0.002 | 0.62 | 0.73 |
AP (N = 125) | 0.46 | 0.57 | Mean (SD): 0.51 (0.04) | 0.65 | 0.47 | 0.53 |
AP (N = 64) | 0.44 | 0.59 | Mean (SD): 0.51 (0.07) | 0.74 | 0.44 | 0.55 |
Metric | By Date | Reverse Date | Random Order (N = 10,000) | Random Forest | L1 Regularized Logistic Regression | SVM |
---|---|---|---|---|---|---|
Sensitivity | – | – | – | 0.66 | 0.69 | 0.50 |
PPV | – | – | – | 0.64 | 0.58 | 0.60 |
Accuracy | – | – | – | 0.63 | 0.58 | 0.58 |
AUROC | – | – | – | 0.66 | 0.66 | 0.58 |
Spearman’s ρ | −0.09 | 0.11 | Mean (SD): −0.002 (0.09) | 0.28 | −0.04 | 0.01 |
Kendall’s tau | −0.07 | 0.09 | Mean (SD): −0.002 (0.08) | 0.24 | −0.04 | −0.03 |
P-value | .31 | .23 | Mean (SD): 0.50 (0.27) | 0.002 | 0.62 | 0.73 |
AP (N = 125) | 0.46 | 0.57 | Mean (SD): 0.51 (0.04) | 0.65 | 0.47 | 0.53 |
AP (N = 64) | 0.44 | 0.59 | Mean (SD): 0.51 (0.07) | 0.74 | 0.44 | 0.55 |
Random report orderings are summarized using the mean and standard deviation (SD) of the correlation metrics.
For each metric, the bolded value indicates the best performing classifier.
Metric | By Date | Reverse Date | Random Order (N = 10,000) | Random Forest | L1 Regularized Logistic Regression | SVM |
---|---|---|---|---|---|---|
Sensitivity | – | – | – | 0.66 | 0.69 | 0.50 |
PPV | – | – | – | 0.64 | 0.58 | 0.60 |
Accuracy | – | – | – | 0.63 | 0.58 | 0.58 |
AUROC | – | – | – | 0.66 | 0.66 | 0.58 |
Spearman’s ρ | −0.09 | 0.11 | Mean (SD): −0.002 (0.09) | 0.28 | −0.04 | 0.01 |
Kendall’s tau | −0.07 | 0.09 | Mean (SD): −0.002 (0.08) | 0.24 | −0.04 | −0.03 |
P-value | .31 | .23 | Mean (SD): 0.50 (0.27) | 0.002 | 0.62 | 0.73 |
AP (N = 125) | 0.46 | 0.57 | Mean (SD): 0.51 (0.04) | 0.65 | 0.47 | 0.53 |
AP (N = 64) | 0.44 | 0.59 | Mean (SD): 0.51 (0.07) | 0.74 | 0.44 | 0.55 |
Metric | By Date | Reverse Date | Random Order (N = 10,000) | Random Forest | L1 Regularized Logistic Regression | SVM |
---|---|---|---|---|---|---|
Sensitivity | – | – | – | 0.66 | 0.69 | 0.50 |
PPV | – | – | – | 0.64 | 0.58 | 0.60 |
Accuracy | – | – | – | 0.63 | 0.58 | 0.58 |
AUROC | – | – | – | 0.66 | 0.66 | 0.58 |
Spearman’s ρ | −0.09 | 0.11 | Mean (SD): −0.002 (0.09) | 0.28 | −0.04 | 0.01 |
Kendall’s tau | −0.07 | 0.09 | Mean (SD): −0.002 (0.08) | 0.24 | −0.04 | −0.03 |
P-value | .31 | .23 | Mean (SD): 0.50 (0.27) | 0.002 | 0.62 | 0.73 |
AP (N = 125) | 0.46 | 0.57 | Mean (SD): 0.51 (0.04) | 0.65 | 0.47 | 0.53 |
AP (N = 64) | 0.44 | 0.59 | Mean (SD): 0.51 (0.07) | 0.74 | 0.44 | 0.55 |
Random report orderings are summarized using the mean and standard deviation (SD) of the correlation metrics.
For each metric, the bolded value indicates the best performing classifier.
Using the report rankings from the random forest model resulted in a shift of reports with lower assessments being ranked earlier, whereas ordering reports by date resulted in these reports being more concentrated at the end of the ordering (Figures 3 A and B). As expected, the random orderings tended to produce uniform assessment distributions, with correlation values around 0 (Figures 4 A and B). Our random forest model produced a more significantly correlated report ordering than all but 20 of the 10 000 random ordering simulations and outperformed both date heuristics (Figures 4A and B). Using our random forest model, in order to review at least 80% of the reports with an assessment of Certain, Probable, or Possible in the test set, evaluators would need to review 86 out of the 125 reports, as compared to 100 out of 125 if we reviewed by date or an average of 99 out of 125 if we reviewed in random order. This represents a potential 10% reduction in workload at a sensitivity threshold of 80%. Further examination at the individual assessment level showed a shift of reports with an assessment of Probable or Possible in the test set to earlier in the report order, and reports with an assessment of Unlikely or Unassessable to later (Figure 5 ). Notably, the magnitude of the change in report order appeared to be more accentuated for reports with a more extreme assessment of Probable or Unassessable.
When examining the FP errors, we found that there were no statistically significant differences among the features. Out of all the features, the largest difference between FP errors and accurately classified reports was in the number of direct type reports. We found that none of the FP errors were direct reports, whereas in the training set, a direct report type tended to be associated with the positive class.
For the FN errors, we found that these reports had significantly less manually curated PT terms from MedDRA (P = .007) and were significantly more likely to have death as an outcome (P = .03). None of the FN reports, despite being evaluated as having a higher likelihood of medication-related causality, contained any of the PT terms that were manually curated. Additionally, the FN errors consisted of 2 out of a total 16 Probable reports and 11 out of 48 Possible reports. Thus, the classifier tended to misclassify reports with a lower causality assessment as negative.
Feature importance
We extracted important features from our random forest and L1 regularized logistic regression models (Table 4
Random Forest | L1 Regularized Logistic Regression | ||||||
---|---|---|---|---|---|---|---|
Feature | Mean Decrease in Gini Index | Certain, Probable, Possible | Unlikely, Unassessable | Feature | Coefficient | Certain, Probable, Possible | Unlikely, Unassessable |
Age | 6.312 | ✓ | Presence of curated PTs | 4.54e-3 | ✓ | ||
No. of sentences | 6.120 | ✓ | |||||
# words per sentence | 5.707 | ✓ | Outcome: death | −0.112 | ✓ | ||
Days to complete report | 4.719 | ✓ | Days to complete report | 5.61e-6 | ✓ | ||
MedDRA: # PTs | 3.702 | ✓ | Presence of curated terms | 0.043 | ✓ | ||
Number of drug suspects | 3.053 | ✓ | No. of drug suspects | 0.011 | ✓ | ||
Reporter: consumer | 2.717 | ✓ | Reporter: consumer | −0.327 | ✓ |
Random Forest | L1 Regularized Logistic Regression | ||||||
---|---|---|---|---|---|---|---|
Feature | Mean Decrease in Gini Index | Certain, Probable, Possible | Unlikely, Unassessable | Feature | Coefficient | Certain, Probable, Possible | Unlikely, Unassessable |
Age | 6.312 | ✓ | Presence of curated PTs | 4.54e-3 | ✓ | ||
No. of sentences | 6.120 | ✓ | |||||
# words per sentence | 5.707 | ✓ | Outcome: death | −0.112 | ✓ | ||
Days to complete report | 4.719 | ✓ | Days to complete report | 5.61e-6 | ✓ | ||
MedDRA: # PTs | 3.702 | ✓ | Presence of curated terms | 0.043 | ✓ | ||
Number of drug suspects | 3.053 | ✓ | No. of drug suspects | 0.011 | ✓ | ||
Reporter: consumer | 2.717 | ✓ | Reporter: consumer | −0.327 | ✓ |
A check indicates that a higher value of the feature is associated with a higher likelihood of an assessment from the respective column.
Random Forest | L1 Regularized Logistic Regression | ||||||
---|---|---|---|---|---|---|---|
Feature | Mean Decrease in Gini Index | Certain, Probable, Possible | Unlikely, Unassessable | Feature | Coefficient | Certain, Probable, Possible | Unlikely, Unassessable |
Age | 6.312 | ✓ | Presence of curated PTs | 4.54e-3 | ✓ | ||
No. of sentences | 6.120 | ✓ | |||||
# words per sentence | 5.707 | ✓ | Outcome: death | −0.112 | ✓ | ||
Days to complete report | 4.719 | ✓ | Days to complete report | 5.61e-6 | ✓ | ||
MedDRA: # PTs | 3.702 | ✓ | Presence of curated terms | 0.043 | ✓ | ||
Number of drug suspects | 3.053 | ✓ | No. of drug suspects | 0.011 | ✓ | ||
Reporter: consumer | 2.717 | ✓ | Reporter: consumer | −0.327 | ✓ |
Random Forest | L1 Regularized Logistic Regression | ||||||
---|---|---|---|---|---|---|---|
Feature | Mean Decrease in Gini Index | Certain, Probable, Possible | Unlikely, Unassessable | Feature | Coefficient | Certain, Probable, Possible | Unlikely, Unassessable |
Age | 6.312 | ✓ | Presence of curated PTs | 4.54e-3 | ✓ | ||
No. of sentences | 6.120 | ✓ | |||||
# words per sentence | 5.707 | ✓ | Outcome: death | −0.112 | ✓ | ||
Days to complete report | 4.719 | ✓ | Days to complete report | 5.61e-6 | ✓ | ||
MedDRA: # PTs | 3.702 | ✓ | Presence of curated terms | 0.043 | ✓ | ||
Number of drug suspects | 3.053 | ✓ | No. of drug suspects | 0.011 | ✓ | ||
Reporter: consumer | 2.717 | ✓ | Reporter: consumer | −0.327 | ✓ |
A check indicates that a higher value of the feature is associated with a higher likelihood of an assessment from the respective column.
DISCUSSION
A key component of FDA regulatory activities is pharmacovigilance, which partly relies on post-marketing surveillance and spontaneous reporting systems, including the FDA Adverse Event Reporting System. Over the last decade, the number of adverse event reports has increased exponentially, resulting in a substantial workload for reviewers. Delays in detecting drug adverse events can have costly and detrimental effects on public health, and thus a system to identify reports most likely to contain information demonstrating causal drug events would be highly beneficial. Researchers have investigated such approaches using the US Vaccine Adverse Event Reporting System, in which extracted text features16,17 were used with multiple classification algorithms to create an effective report classification model.18–20
The success of text classification in the Vaccine Adverse Event Reporting System and previous computational discoveries of new medication-related adverse events in FAERS21–27 have generated significant interest in developing a classification system for FAERS. To accomplish this, we built models to classify and rank adverse event reports based on the likelihood of medication-related causality. In addition, we showed the potential utility of our models to assist manual adjudicators by shifting reports with a higher probability of medication-related causality to a higher priority in rank order.
For the first phase of this study, we chose to focus on reports with assessments of Certain to Unassessable, as they constituted over 90% of our corpus. Though the modified WHO-UMC causality categories additionally include Medication Error and Product Quality Issue, these reports were heterogeneous and few in number. Furthermore, Medication Error and Product Quality Issue are MedDRA preferred terms that appear as part of the MedDRA structured data fields. These can aid in the detection of reports with these issues, whereas a system for differentiating reports with assessments from Certain to Unassessable does not yet exist.
We engineered features based on structured data fields and extracted nonsemantic features from unstructured narratives. We chose to focus on nonsemantic features for this initial corpus of reports due to the wide variety of narratives, which range from one word to multiple paragraphs. Due to this, we found that patterns of words and dependencies were frequently unique and that such patterns would not generalize to the entirety of the adverse event report database. Instead, we relied on expert knowledge to curate terms that are considered to raise the probability of a report having medication-related causality. We recognize that the lack of semantic features is a feature engineering limitation, mostly due to the size of our current corpus. However, we believe that the unstructured narratives contain additional valuable information, and with the entirety of FAERS, we plan to fully leverage semantic features to improve the discrimination power of our models. Examples of such features include entity recognition using the Unified Medical Language System,28 quantification of common words and phrases using term frequency, and relationship extraction between words at the sentence level.
With our engineered features, we were able to see modest separation between high- and low-probability reports using 3 different types of machine learning models. The superior results of the L1 regularized logistic regression and random forest models over the SVM model suggest that data-driven feature selection is beneficial in avoiding overfitting and improving performance. When we assessed the reports misclassified by our model, we found that the FP and FN errors had features generally opposite to the predictive trends present in the training data. We attribute these errors to mild overfitting, which is due, in part, to the limitations imposed by a small corpus.
Upon closer inspection of the features selected by our models, we found that the L1 regularized logistic regression model retained only 6 features, many of which were also considered highly important by the random forest model. The majority of these features appear to have an intuitive interpretation, such as a consumer reporter increasing the likelihood of a report being Unlikely or Unassessable. As the “Days to Complete Report” feature reflects the addition of follow-up reports, a longer report completion time appears to increase the likelihood of a Certain, Probable, or Possible assessment. Interestingly, these features range across multiple feature categories, including outcomes, MedDRA terms, drug suspects, and reporter qualifications. The feature selection suggests that each of these categories likely contains correlated variables, with each category providing additional value for classification.
In addition to report classification, we further assessed utility based on reordering reports to expedite the review process for manual adjudicators. We see high-probability reports shifting to earlier in rank order at the individual assessment report level, indicating that our model can improve upon the current system of reviewing by date or randomly. We acknowledge that our work is limited by the number of reports and has the potential for additional feature extraction from the unstructured narratives. This is due to the time-intensive nature of adjudication and curation, which also serves as the motivation for this work. Despite these limitations, we believe that the benefits seen in this work can be magnified upon expansion of our model to the millions of reports in the FAERS system. Furthermore, this evaluation simulates a more realistic application, which can be easily integrated into the existing workflow. Its implementation has the potential to increase efficiency and improve detection of safety issues by prioritizing reports with a higher probability of medication-related causality.
CONCLUSION
This study serves as the foundation for construction of a system to detect adverse event reports with a high probability of indicating medication-related causality. We explored and constructed several models, which we used to demonstrate feasibility and applicability in aiding manual reviewers via report prioritization. Expansion of these models to the entirety of the FDA adverse event reporting system could substantially improve the manual review process and have potential downstream benefits for pharmacovigilance.
DISCLAIMER
The opinions or assertions presented herein are the private views and opinions of the authors and are not to be construed as conveying either an official endorsement or criticism by the US Department of Health and Human Services, the Public Health Service, or the Food and Drug Administration.
FUNDING
This work was supported by the FDA, which supports the University of California, San Francisco–Stanford Center of Excellence in Regulatory Sciences and Innovation, grant no. U01FD004979, and by the National Institutes of Health under award no. T32 GM007365, which supports the Stanford Medical Scientist Training Program. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Department of Health and Human Services, the National Institutes of Health, or the Food and Drug Administration.
COMPETING INTERESTS
The authors have no competing interests to declare.
CONTRIBUTORS
SP, RB, and CAP designed the study, acquired the data and gold standard, assisted with interpretation of the results, and critically revised the manuscript. LH analyzed the data, built the models, and drafted the manuscript. RBA critically revised the manuscript and made substantial contributions to interpreting the data and the results.
ACKNOWLEDGMENTS
The authors would like to thank reviewers from the FDA Center for Drug Evaluation and Research, Office of Surveillance and Epidemiology, Division of Pharmacovigilance and Division of Medication Errors and Prevention for assigning causality terms to the FAERS reports used in this study.