Research hopes to ensure privacy of anonymized data

2/12/2015

Katie Carr, CSL

The Internet is now a goldmine of personal information about individuals, collected both voluntarily, through methods such as frequent flier/shopper incentives, social media or Amazon reviews, and involuntarily, such as the U.S. Census or medical records. While much of the information is anonymous, the ability to search information and compare it to other data sets makes today’s data especially susceptible to privacy violations.

CSL Professor Pramod Viswanath was recently awarded a $500,000, three-year NSF grant to conduct research on statistical data privacy, in hopes of developing a way to still use the vital information in these databases, without it become a privacy concern.

Pramod Viswanath
Pramod Viswanath
Pramod Viswanath

“When the U.S. government does a census, they have to release that information,” Viswanath said. “They only release statistics, such as on average how many people earn over $100,000 or how many are college educated. The idea is that averages don’t reveal very much, and that was true historically, but not anymore.”

Anonymization of user information is a classical technique used by researchers, but it is susceptible to correlation attacks, where someone could correlate the anonymized database with another, perhaps publicly available, deanonymized database, and a user’s privacy could still be divulged. This anonymization is important because, in some cases, there is no way to stop this information from being available and some data has great value and analyzing it could lead to progress in fields that are highly important, such as in medicine or education.

In today’s information age, it’s possible to search anonymized health statistics along with time stamps that showed who went to the hospital on which day and discover information about a patient, such as the current governor, as one researcher in Massachusetts did. Additionally, researchers were able to identify people using Netflix records and comparing them to time stamp reviews on IMDb.

“This is a major issue today,” Viswanath said. “Historically, no one would really know about you or they would forget. If they didn’t happen to see you at a certain normal time, they wouldn’t even notice, but on the Internet, nothing goes away.”

Viswanath, along with ECE graduate student Peter Kairouz, will be looking at this anonymous data and are hoping to create algorithms that will add a small amount of noise to the data, specifically in genomic data release and smart meter data release. Due to privacy concerns, these two areas of data are currently unavailable at large. The hope is the noise will make it impossible to identify the participants, but still leave the data in tact.

“There’s a tension between doing a good job of hiding the personal information and revealing some really useful information,” Viswanath said. “You can hide a lot by putting a lot of noise into a data set, but then the data is useless.”

The focus of Viswanath’s project is understanding and addressing that tension and then creating and using algorithms correctly to handle the tension. By randomizing the data sets, just enough, Viswanath could develop a way to keep information private while also releasing a database that’s as true to the intended one as possible.

“Right now we’re just hoping people are benevolent, but there’s no reason why they necessarily are,” Viswanath said.