Forget ones and zeros. The new code to break is the human genome and it has been transcribed into monstrous strings of A’s, G’s, C’s and T’s. Computer science researchers at UCSB are providing biologists with the tools to unlock its secrets.
A new field of research known as bioinformatics is the use of technology to aid biological research. Ming Li, a UCSB computer science professor, along with his graduate students, have formed a bioinformatics company that stands to give the field a dynamic leap forward.
Although a rough draft of the human genome was completed in June of 2000 — much earlier than anyone had expected — researchers, geneticists and biologists now have a new task at hand. They must now make sense of the massive text strings.
The map of human genome is nothing more than a large text file containing 3 billion characters, but breaking the code of the human genome could mean the discovery of new cures for genetic diseases as well as a better understanding of our own evolution.
Even though the current map of the human genome is only a rough draft and is still being improved upon, researchers have already poured large amounts of resources into understanding the secrets locked in the seemingly simplistic sequences of letters.
Currently, researchers are trying to compare small sequences of text against the entire genome from which they came. Other research compares the entire genome of one species against that of other species.
Instead of trying to find exact matches, researchers are hoping to find approximate matches. Approximate matches are more valuable to researchers because they indicate that there is a commonality that exists between completely different species. This type of partial matching is called a homology. The logic is that if organisms or species are descended from common ancestors, they will have similar sequences in their genomes.
The task of comparing a small string of text against strings of a billion or more characters is impossible to do by hand. Even more impractical is to compare the genomes of two separate species. Researchers rely on computers to aid them when trying to find homologies, but even that requires the help of supercomputers, which process the workload. Even then it takes up to months for them to return results.
The computer program used by most researchers, called BLAST, is more than 10 years old and as the size of the input given to BLAST increases, it becomes overwhelmed.
Recognizing the need for faster results, Li and his team of graduate students and post doctorates set out to provide a solution.
The result of their work is a program called PatternHunter, which is capable of running on a personal computer. Many experts assumed the program would run more slowly than BLAST because it runs on a personal computer and was written in the Java programming language, which is sometimes unwieldy. However, PatternHunter’s performance, speed and quality rival that of BLAST, which runs on supercomputers.
PatternHunter significantly outperforms BLAST because it is programmed to find approximate matches despite minor inconsistencies between the genomes it is comparing.
Off the Drawing Board
The new program is a leap forward from BLAST in speed and graphing capabilities as well as making higher quality measurements .
“PatternHunter is more sensitive [than BLAST],” Li said.
It catches alignments, or matches, that would have been missed by BLAST. To biologists, those alignments can be very crucial finds. This is also a product of PatternHunter’s ability to approximate when searching for alignments.
Another key advantage the program provides is its ability to compare the complete genomes of different species.
“PatternHunter is the only program that is able to compare two genomes of this size,” Li said.
Comparisons between the human genome and the mouse genome have already been made using PatternHunter.
The lower hardware requirements for the program make it more economically viable for researchers. No longer will they rely on expensive supercomputers to do their processing.
“In the past, there were many jobs they could not do but now they can do it. They can also use home computers to do it,” Li said.
Also, PatternHunter saves researchers time because it was written in Java, which, unlike other programming languages, is compatible with any computer.
Origins of PatternHunter
Li received his Ph.D. from Cornell University and then became a professor at the University of Waterloo in Ontario, Canada. There, he met Dr. Bin Ma, who was a post doctorate under Li.
“Bin Ma had some initial ideas [about PatternHunter] and eventually it was further developed by John Tromp,” Li said.
Ma wrote the very first version of the program in the C++ programming language in July of 2000 at the University of Waterloo. Afterward, Ma began to refine the program.
“The first release was already better than BLAST and probably several times faster,” Li said.
From here, Li and Ma decided to form a startup called Bioinformatics Solutions, Inc. They contacted another of Li’s post doctorates, John Tromp. Tromp wrote the second version of PatternHunter and was able to convince Li to use Java to write the second version of the program.
“I didn’t like the way the source code looked in C++,” Tromp said. Java allowed for cleaner code.
Li attributes Tromp’s optimization of the code as one of the factors that made PatternHunter perform so well.
“John is one of the best programmers in the world; I’m not exaggerating,” Li said. “The new PatternHunter is probably a hundred times faster than the previous version.”
“John’s implementation was so masterfully programmed, he avoided a lot of pitfalls someone else would have run into,” computer science graduate student Larry Miller said.
Miller is responsible for implementing the user interface that interprets the results of Tromp’s algorithms and presents them in a meaningful way to researchers.
“We wanted to make a visualization scheme that shows this set of alignments [matches],” Miller said, “And just by looking at the geometry and some of the other features, biologists can really make use of that information. You need something besides 3 billion letters to look at.”
The Next Step
“PatternHunter currently only does DNA sequence comparisons. It doesn’t do protein [comparison] yet,” Li said, “We are developing the protein-to-protein [comparison] for PatternHunter.”
Li expects to be finished with PatternHunter within a couple of months. Even though it’s not completely functional, it is already being used at major genome research institutes, such as the WhiteHead Institute at the Massachusetts Institute of Technology.
“The hope is that we are doing everything that BLAST does and providing a better program for the bioinformatics community,” Li said.