# CSL professor looks to simplify data collection

These days, data can be used to help us determine everything from the fastest route to a destination to whether our step count and remaining calorie budget justify that slice of cake. But just how much data is needed to make an accurate decision when dealing with complex networks? It’s a question researchers have pondered for years. CSL and ECE Professor Bruce Hajek is trying to find an answer that would save time and money during data collection by limiting the amount of information that has to be gathered and also fill in the holes when data is scarce.

“In a given situation, it’s not clear whether or not you have enough data so we’re looking at performance limits,” said Hajek, the Leonard C. and Mary Lou Hoeft Endowed Chair in Engineering. “We have models and we’re trying to find the limits of how much data is needed to do a certain reconstruction or discovery task.”

In the project, “Learning in Networks: Performance Limits and Algorithms,” recently funded by the National Science Foundation, Hajek and his team will use models to test algorithms’ ability to discern latent (i.e. hidden) structure from data. One approach is to generate datasets with underlying ground truth structures, or data that is known to be accurate, and then give the algorithms a limited portion of the data to see if the algorithms can correctly fill in the blanks and reach the accurate result. Hajek’s goal is to determine the algorithms that can correctly recover the structure with the least amount of data.

One of the applications of this research that most interests Hajek is understanding genomic interactions.

“In the cells of a soybean plant, there’s approximately 64,000 genes, and some turn on and off other genes,” said Hajek. “We can see the activity level, we can measure the activity level of different genes at different times. What we, and a lot of others, want to do is reverse engineer the plant, to find which genes are regulating other genes. But how much data do you need to accurately recover the network?”

Various data types will be involved in the research: time processing, sequencing, matching graphs, and data classification. Time process looks at the growth of a network over time. Hajek explained this with a paper citation example. When an author writes a paper (node), the paper cites previously written papers (nodes). These nodes are then connected to each other through the citations; as more citations are added, the network grows. Hajek’s group will apply algorithms to modeled data to see if they can recover latent structures in the network, such as relatively dense clusters of papers.

In the genomic field, sequencing can help researchers piece together many short DNA subsequences into a long DNA sequence, akin to a jigsaw puzzle. Some technologies are able to indicate how close the pieces are to each other. An algorithm could be developed to put them in the proper chain sequence.

The third data type, matching graphs, involves looking at two snapshots of data at different times and seeking to match up the node in the second snapshot that corresponds to a given node in the first snapshot.

“We call that graph alignment or graph matching,” said Hajek. “If I see two graphs a week apart, I want to look at them and say this node on this graph is the same as that node over there.”

This particular form of data has applications that reach from social media research to self-driving cars to bird tracking. Hajek and his team will use a model that makes two graphs, one of which is a distortion of the other. The nodes are not labeled and the algorithm is tested to see if it can use the distorted graph to produce the original graph.

The fourth type of data organization also involves graphs. To classify the type of the data, the algorithms will need to decipher what type of graph is being used and what type of model the graph was built on.

If optimal algorithms can be found, Hajek will have made an intellectual break through.

“If you figure out the limits and the algorithm is right up against the limit, it’s an optimal algorithm,” said Hajek. “You can try out algorithms and empirically say which is better but we’re trying to have a principled way to determine the performance of the algorithm.”