Sharing sensitive data

Although it may not be possible in all cases, it is a good idea to obtain informed consent from the participants in your study to allow for publication of their anonymized data from the research.

Other pages on this site with related information:

Modifying sensitive data for public release

Sensitive data that contain potentially identifying information -- whether it be human subject data or other types of sensitive data -- will likely need to be modified prior to sharing these data with the public. It is important that these modifications are made in order to protect participant confidentiality, the location of endangered wildlife, or for other relevant reasons. However, these modifications may affect the data to the point where reproducibility or additional subsequent research by others is no loner possible. You might consider retaining multiple versions of the data: one that is suitable for public release, and one that is suitable for further research but that is available on a highly restricted basis.

For patient health information (PHI), HIPAA privacy rules provide two methods for de-identification: the expert determination method and the safe harbor method. See the resources listed below for documentation on these methods from the US Department of Health and Human Services, as well as information on how to satisfying these two methods.

Types of identifying information

Identifying information is classified as one of two types: direct and indirect.

Direct identifiers

These data point directly to an individual and are typically removed from data sets before sharing with the public.

These may include:

name
initials
mailing address
phone number
email address
unique identifying numbers, like Social Security numbers or driver's license numbers
vehicle identifiers
medical device identifiers
web or IP addresses
biometric data
photographs of the person
audio recordings
names of relatives
dates specific to individual, like date of birth, marriage, etc.

Indirect identifiers

These may seem harmless on their own, but can point to an individual when combined with other data. It has been recommended (see BMJ article reference below) that datasets containing three or more indirect identifiers should be reviewed by an independent researcher or ethics committee to evaluate identification risk. Any indirect information not needed for the analysis should be removed. It may be reasonable to supply some of these types of data in aggregated form (like ranges of annual incomes instead of exact numbers).

Indirect identifiers may include:

place of medical treatment or doctor's name
gender
rare disease or treatment
sensitive data like illicit drug use or other "risky behaviors"
place of birth
socioeconomic data, like workplace, occupation, annual income, education, etc
general geographic indicators, like postal code of residence
household and family composition
ethnicity
birth year or age
verbatim responses or transcripts

Resources

"Guidance Regarding Mehtods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule," US Department of Health and Human Services, Office for Civil Rights.

Hrynaszkiewicz, I, Norton, ML, Vickers, AJ and Altman, DG. "Preparing raw clinical data for publication: guidance for journal editors, authors, and peer reviewers." BMJ 2010;340:c181.

"Preparing Data for Sharing" from the Inter-University Consortium for Political and Social Research (ICPSR). (2012). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI.

You are here