CSL researcher applies digital communication processes to DNA sequencing

4/20/2021 Allie Arp, CSL

The raw amount of data produced in a human genome sequencing experiment can be in the order of 100GB. CSL's Ilan Shomorony hopes to improve the efficiency and accuracy in the processing of genomic data.

Written by Allie Arp, CSL

Big data has found a place in every research area, including human genomics. The raw amount of data produced in a human genome sequencing experiment can be in the order of 100GB. Large-scale genetic studies sequence the DNA of thousands of individuals, which generates a massive amount of data that requires proper storage and accurate processing.

“The ability of to obtain inexpensive genomic data is relatively new,” said Ilan Shomorony, CSL assistant professor. “There are DNA sequencing CSL Professor Ilan Shomorony machines that can sequence an entire human genome for less than $1,000. This is causing a revolution in the biological sciences, as it provides a new lens through which to understand the biology of all living beings.”

Shomorony is the principal investigator on a project called “Genomic Data Science: From Informational Limits to Efficient Algorithms,” which hopes to improve the efficiency and accuracy in the processing of genomic data. The project focuses on three data science tasks; the alignment of sequence data, sequence reconstruction from erroneous fragments, and clustering sequence data. When a person is analyzing quantitative survey data, comparing yes/no data is simple. However, standard data science techniques cannot be directly employed when the data is in the form of DNA sequence fragments. Shomorony and his team must develop new analysis techniques before they begin to process it.

Another large part of the research will involve reconstructing the genome sequence from “noisy” data, or data with a lot of additional, meaningless information. They will also work to cluster the data based on those sequences that are similar and create groups of like-sequences. Once the group has aligned, reconstructed, and grouped the data, it will be ready for biological analysis by domain experts.

“It’s like taking multiple jigsaw puzzles, dumping all the pieces together, and trying to assemble it,” Shomorony explained. “If all your jigsaw puzzles are different, you may be able to assemble them completely, but if you have similar jigsaw puzzles you may not be able to fully determine what pieces go with which puzzle. You may assemble some pieces, maybe even large chunks, but there may be parts you’ll never be able to assemble. The question we are trying to study is the fundamental limits of metagenomics, so given how similar the jigsaw puzzles, or genomes are, how much can we reconstruct?”

The mixture retrieved for this analysis is typically retrieved from a person’s mouth or gut.

“The main application I discuss in the proposal is called metagenomics, specifically for microbiome analysis,” said Shomorony, an electrical and computer engineering assistant professor. “A microbiome is a community of microbes, such as the microbes living in the human gut. Metagenomics is the process of extracting the genetic material from all the microbes in such a community and sequencing it.”

Now that his work has received funding, Shomorony would like to find the right graduate students to help him conduct the research. Even though the applications of his work are more aligned with medical and biology fields, Shomorony is looking for students with an engineering background. His own research history includes information theory and digital communications. He found out he could apply some of the same insights he used in digital communications to genomics and DNA sequencing.

“At the end of the day [with digital communication and DNA sequencing] you are observing some noisy version of the truth,” said Shomorony. “The difference is that in digital communications you have someone sitting at the transmitter side and specifying the message and in genomics it’s nature creating the messages through DNA. Instead of asking about the capacity of the communication channel we’re asking about the capacity of a genomic sequence.”

Share this story

This story was published April 20, 2021.