Milenkovic develops solutions for compressing big data
With the 2016 presidential election campaigns well underway, voters are continually being polled on their views of the candidates. Questions such as “Which candidate seems the most and least trustworthy?” or “What is the single most important problem facing the nation today?” will be asked and analyzed. But the way these answers are sorted and scrutinized isn’t as easy as it seems.Olgica Milenkovic is working to create a solution that will help researchers compare, organize, and compress this data, which often comes from CrowdVoting systems and recommender systems, like Netflix, easily and quickly. Milenkovic received a $500,000 NSF grant to collaborate with the University of Minnesota on this problem.
“There is massive data out there telling you things about voting preferences, food preferences, movie preferences, and usually it is in the form of ratings,” Milenkovic said. “But everyone has a different scale for ratings. For example, my score of five means I really like something, but someone else may never give a five and when he gives a three, that means he really liked something. Ordinal data removes scale and just says what people prefer with respect to multiple options.”
Compression of these data formats involves all sorts of tasks such as ordinal data clustering and ordinal data aggregation.
“You’re looking at how similar or dissimilar votes are and it’s not always easy to say how they compare because you can’t put a numerical value on them,” she said. “This makes it hard to measure distances between the rankings.”
In order to compress similar data together, researchers need to know how similar or dissimilar the data is. Milenkovic is working to develop better ways, both objectively and subjectively, to measure these distances between rankings.
“Many ideas have been proposed, but most don’t work out well practically,” she said. “We’ve created a solution that has theoretical grounding and we’re specifically focusing on solving the problems surrounding recommendation system data, aggregation of rankings in metasearch engines, and clustering of genomes reversal datasets.”
Milenkovic, whose recent research deals with determining smart ways to process, analyze, and store massive data sets, is also studying how to efficiently compress biological data from humans and animals in a parallelized fashion. Along with fellow Illinois professors Deming Chen, Wen-mei Hwu and Jian Ma, she received a $1.35 million NIH grant to create highly efficient algorithms to compress and decompress biological data in minimal time. The team is using state-of-the-art methods from information theory and machine learning to develop their solution.
“People are realizing this is important because of the recent exponential growth in genomic data,” Milenkovic said. “Most of the growth is attributed to low-cost sequencing, which has led lots of initiatives in medical research creating extremely large data sets.”
For example, ENCODE, a project funded by the National Human Genome Research Institute, seeks to identify all functional elements in the human genome sequence and has generated massive amounts of data. According to Milenkovic, if the project tries to store the data without compressing it first, it will be impossible to easily and quickly access the data.
“If a clinic wants to send a patient’s genomes to another lab, there’s no way to send it via email or Dropbox because the data is so large,” Milenkovic said. “You have to physically mail it on a hard drive. This makes it very difficult to work with and people end up spending a lot of time downloading, uploading and trying to compress the data.”
The Illinois team has developed algorithms that use multiple methods to break the data into smaller sets that can be more easily handled. The challenge is to then combine the results from all the different subsets into an efficient and high-quality manner.
Data duplication is another way to handle large amounts of data and Milenkovic is working with UCLA and the Illinois Institute of Technology on a $500,000 NSF grant to create viable compression solutions using deduplication.
“People are paying to store data in places such as Dropbox, but often people are storing the same data or similar data,” Milenkovic said. “So why would Dropbox store so many copies for everyone if they could just have one copy and link it to all the users that have that data?”
While that sounds like an easy solution, it becomes problematic if the one copy of the data is lost. Milenkovic suggests using distributed storage codes, where parts of the data are stored in one place and the other parts in a separate place, in such a way to make the system easy to synchronize.
“These codes allow you to store data in a distributed fashion, so part of a song, for example, is on one disk and part is on another,” she said. “This way, if a disk fails, I won’t lose the whole song. The codes add protection, so there is more than one copy of the data, but one has to have smart ideas in order to simultaneously remove excessive redundancy via deduplication. It’s a balance between introducing redundancy to remove errors and still removing as much data as possible.”
Milenkovic was the first one to propose synchronization duplication and develop the algorithms that allow for this balance between redundancy and compression. She suggests a novel technique where programs would identify changes between old and new files and only save those new changes, rather than uploading a completely new version of data. All this is accomplished in the coded domain where data is only seen in a compressed, encrypted, or otherwise transformed version.
“If we do not find timely solutions for data compression, many labs and scientific solutions will be forced to discard invaluable data after a short storage time, as the cost will be too high to store them,” Milenkovic said. “The costs of storing data in the cloud will increase and access and retrieval will become painfully slow. If we don’t discover a solution quickly, a lot of data that may be used for future studies will be irrecoverably lost. This is especially the problem with biological data as the genetic signature of no two patients is almost ever the same. One needs to know all variations possible.”