New tool improves accuracy, customization in DNA sequencing capabilities
DNA analysis may open the door to scientists unlocking the mysteries of disease, among other breakthroughs. This analysis often includes identifying variants in DNA sequences – known as variant calling – which has a wide range of applications in bioinformatics and genomics, from discovering a person’s susceptibility to cancer to strategizing crop-breeding methods.
Researchers at the University of Illinois Urbana-Champaign and Mayo Clinic have created a new variant caller that is more accurate and more customizable compared to the current state-of-the-art, which may enable pipelines to greater discovery in the future. In the paper, “HELLO: improved neural network architectures and methodologies for small variant calling,” published in BMC Bioinformatics (Part of Springer Nature), the team advocates for a hybrid strategy and the use of a Deep Neural Network (DNN) architecture that is specifically tailored to sequencing data. DNNs represent a type of machine learning that enables computers to recognize underlying relationships in a set of data, mimicking the inference abilities of the human brain.
“We can train our model in a much shorter time and with a higher chance to succeed than current models,” said Deming Chen, the Abel Bliss Professor of Engineering at UIUC. “We start with a suitable level of domain knowledge, thanks to our collaboration with Mayo, which provides important guidance for our machine learning model. The HELLO caller doesn’t need a lot of parameters to learn well.”
One of the best tools currently on the market, Google’s DeepVariant, uses a DNN to analyze sequencing data that are formatted as images, an approach that is expensive and time-consuming. The HELLO variant caller uses DNN architectures and variant inference functions that account for the underlying nature of sequencing data, instead of converting the sequences to images. HELLO also supports hybrid variant calling where sequencing data from multiple platforms can be combined, which results in even higher accuracy.
To create the HELLO variant caller, the researchers used publicly available data, along with subject matter expertise from Eric Klee, a researcher in Mayo’s Center for Individualized Medicine and the Department of Laboratory Medicine and Pathology. When stacked up against other variant callers, the HELLO tool outperformed competitors in accuracy. For example, HELLO made up-to 65% and 40% fewer erroneous variant calls compared to DeepVariant when targeting indel type and substitution type variants respectively, the two most numerous types of variants in the human genome. HELLO’s DNN models are also between 7 and 14 times smaller than the DeepVariant DNN model.
“When you want to take a high-level approach, which is what we did, the key is to apply the right level of domain assumptions”, said Anand Ramachandran, the paper’s lead author and a PhD candidate in electrical and computer engineering at UIUC. “As a result of appropriate application of these assumptions, we have a smaller model that is actually more accurate.”
An added benefit of its simplicity, says collaborator Steven Lumetta, is that the HELLO tool is cost-efficient, making it feasible for Mayo or other health care organizations to incorporate the tool into patient care.
“As the result of our collaboration with Mayo, we were better able to understand the tradeoffs between the time and cost of operating a tool versus what a hospital can afford to pay,” said Lumetta, an associate professor of electrical and computer engineering at UIUC. “We believe we have a tool that is viable in the marketplace.”
The Center for Computational Biotechnology and Genomic Medicine (CCBGM), part of the National Science Foundation’s Industry-University Cooperative Research Centers program, supported this research.
“One fundamental advantage of HELLO is that it is future-proof,” said Chen. “Its new hybrid methodology can flexibly incorporate different types of sequencing data, which mitigates the weaknesses of any single data type. This is essential for improving the overall quality of variant calling, in terms of its accuracy, cost, and reliability.”