Can algorithms help protect participant privacy?

data_privacy.jpg

While genetic data sharing has more recently highlighted the privacy problem (de-identified genome is like fingerprint and certain biomarkers are shared within families), the issue of re-identification from de-identified data exists with practically any data type. A 2006 publication, provides up-to-date picture of the threat to privacy posed by the disclosure of simple demographic information.

Global Alliance for Genomics and Health (GA4GH) has provided a forum to discuss the ramifications of privacy breach in data sharing. And last two iDASH Privacy & Security Workshops have seen dozens of teams participate from around the world in trying to solve some hard problems in mathematics and computation to enable privacy protecting analytics.

  1. 2017 challenge 1: De-duplication for Global Alliance for Genomics and Health (GA4GH): Participating teams were given hashed patient attributes (identification number, first and last name, gender, etc.) to develop efficient secure multiparty (>=3) patient linkage protocols that scale well to real world applications (e.g., thousands of centers and millions of records in total). 

  2.  2017 challenge 2: Software Guard Extension (SGX) based whole genome variants search: Given a database of Whole Genome Sequence VCFs (labeled with case/control), participating teams used SGX to generate top K most significant SNPs.

  3.  2017 challenge 3: Homomorphic encryption (HME) based logistic regression model learning: Participating teams developed homomorphic algorithms for training a logistic regression model. 

  4. 2016 challenge 1: Practical Protection of Genomic Data Sharing through  Beacon Services (privacy-preserving output release): Given a sample Beacon database, participating teams were asked to develop solutions to mitigate the Bustamante attack. Winning solution was from Vanderbilt University.

  5. 2016 challenge 2: Privacy-Preserving Search of Similar Cancer Patients across Organizations (secure multiparty computing): The scenario of this challenge is to find top-k most similar patients in a database on a panel of genes. The similarity is measured by the edit distance between a query sequence and sequences in the database. 

  6. 2016 challenge 3: Testing for Genetic Diseases on Encrypted Genomes  (secure outsourcing): Participating teams had to calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a commercial cloud service. Winning solution was from Microsoft Research.

Nuances of these algorithms have indeed turned out to be non-trivial in terms of privacy risks, practicality and analytical accuracy.  However, these algorithms will continue to improve and will become available for general research consumption. And in combination with secure computing infrastructure, these algorithms will enable trusted insight sharing across existing data silos.