Solving the DNA Puzzle--MTU Research Magazine 1997

Solving the DNA puzzle

by Marcia Goodrich

Pinpointing the needle in the haystack isn't the half of it. Suppose the haystack is the size of the Sears Tower and the needles look exactly like the hay. And suppose, for the sake of argument, that every straw and every needle have to line up just so, and it's your job to figure out which one goes where.

You'd probably want to ask Xiaoqiu Huang for some help. For your job wouldn't be too far removed from the Human Genome Project, where genes are the needles and nucleotides are the hay.

When they are finished finding the needles and ordering the straw, the scientists working on the Human Genome Project will have deciphered the blueprint of the human species.

One phase of the project involves mapping the location of all the nucleotides-four chemicals that form the basis of DNA, the bunched and twisted strands that make up the twenty-three chromosomes in a human cell.

The mass of data is enormous-the human genetic code contains 3 billion nucleotides. Finding harmony in the apparent chaos seems the next best thing to impossible, but that's what Huang's software does.

To cope with the sheer mass of data, scientists employ a back-door, piecemeal approach.

"It's called a shotgun sequencing strategy," explains Huang, an associate professor of computer science at Michigan Tech, who took a leave in fall 1996 to work at The Institute of Genomic Research. "There are too many nucleotides on the entire DNA molecule to read, so you make many copies of the DNA molecule and cut them into many pieces."

Then biotechnical devices known as sequencing machines figure out the order of the nucleotides on each of the many fragments. Huang's CAP2 software (the updated version of his Contig Assembly Program) works by comparing the fragments, identifying the overlapping areas, and determining how the fragments should fit back together.

"You sometimes have false overlaps, sequencing errors, and lots of repetitive regions," he said. "The program has to be robust."

Huang's program is robust enough to have been licensed to pharmaceutical megacorp Glaxo Wellcome, the Institute for Genomic Research in Rockville, Maryland, and the Medical Research Council Centre at Cambridge University. In addition, it is provided to researchers by Applied Biosystems Inc., the major supplier of instrumentation for the biotechnology market.

The Foster City, California, company sells DNA sequencing machines and was looking for someone who could develop the powerful software needed to perform the fragment assembly and pull all the data together.

"We gave Xiaoqiu a grant to support his research for about a year, and the result was CAP2," Burcham said. "At a competition where a number of the world's best algorithmetists got together with their different sequence assembly programs, CAP2 was the best."

Huang's programs appeal to Applied Biosystems because he keeps his focus broad. "His odd mix of pragmatism and theory fits in well with us," Burcham said. "The problem we have with DNA is that it's not a known system. And it looks random, but it isn't. The emphasis is on real data, and Xiaoqiu pulled it off-he bridged the computer side with the biology side."

"CAP2 solves real problems for real people, who'll use it every day for hours," Burcham added. "We're looking forward to his continued new developments."

Staff scientist Jinghui Zhang first encountered Huang's work when she adapted it to analyze the E.coli genome at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland. At the time, this common bacterium had the best-understood genome of all organisms.

"Only Xiaoqiu's software can deal with these long sequences [of nucleotides], and his software became the best choice," Zhang said. "I tried some others, but the quality was not as good. That's why I stick to Xiaoqiu's."

A second aspect of the Human Genome Project involves pinpointing the location of genes, which, like all DNA, are made up of nucleotides. But in chromosomes, the genes are buried in a tremendous heap of genetic junk-97 percent of the nucleotides in DNA appear to do absolutely nothing.

Huang is refining two other sets of software programs that can tell the difference. They help biologists sift through the heaps of sequenced DNA segments generated by researchers, finding that 3 percent of the DNA that actually does something: make proteins, the basis of human tissue.

To winnow out the genes, Huang's software again employees a back-door strategy. It compares researchers' newly sequenced DNA fragments with other genetic material: proteins and man-made molecules called cDNA and EST. These materials contain clues that can reveal genes' hiding places on DNA.

Until this point, researchers know the composition of the DNA fragment, but they have no idea whether it has any genes on it, or even where they might be. Only Nature, Huang notes, seems to have no trouble telling genes apart from bench-warming nucleotides.

His NAP (Nucleotide-Amino Acid Alignment Program) and GAP (Global Alignment Program) take a little more time than Nature, but they can help researchers make that distinction.

The programs first search the genome molecule database to find a match with the new DNA fragment. Then, the software lines the two molecules up, revealing where the genes might be hidden on the new DNA. For example, in the case of a match with a protein, his software lines up the protein's amino acids with the very genes on the DNA that are used to produce those same amino acids.

At NCBI's GenBank, the Human Genome Project's official deposit site, Wojtek Makalowski, a GenBank fellow, has been using GAP to find genes on DNA.

GAP's big advantage is its precision, Makalowski says: "You don't need a lot of human intervention, and most alignment programs require additional work. On top of that, the program is relatively fast and doesn't require a lot of memory."

GAP isn't for amateurs. No WordPerfect, it's an academic program geared for specialized use. "But the bottom line is, it's a very good program, one of the best."

Huang's software GAP has an added bonus: because it can check any new DNA fragments against DNA that's already in the GenBank database, it shows if the new fragment has been sequenced before. It helps weed out duplicate sequences, a real problem in a project where so many people are working to analyze so much information.

Also at the National Center for Biotechnology Information, Zhang has been using Huang's software in other genetic research, to detect subtle relationships between organisms related as obliquely as yeast and homo sapiens.

She looks for frame shifts-differences in DNA that show where evolutionary change has taken place at the biochemical level. "There are many human and yeast proteins that share similarity," she said. "But you need a very good program to find them."

Huang is pleased that his work is gaining notice outside the digital world of computer science.

"I'm very happy that my program's being used in the big labs," he said. "When you are isolated in your field, you don't know what programs will be important. Now, working with bioscientists, my work is having an impact on the Human Genome Project.

"It's a lifetime opportunity."

DNA/protein alignment

DNA art courtesy of Paul Thiessen, http://ludwig.scs.uiuc.edu/~paul/

COLOR=a perfect match between a DNA codon and the protein's amino acid

Xiaoqiu Huang's NAP software matches up a new DNA sequence, represented by the top line, with a known protein sequence on the bottom line. It's not a match made in heaven, but it does give researchers important clues as to the nature of this DNA fragment.

Each DNA letter represents a nucleotide, while each protein letter corresponds to an amino acid. Three nucleotides, called a codon, code for one amino acid; the colons in the middle row show where the codons code exactly for an amino acid.

This DNA sequence codes for a protein that is similar, but not identical, to the given protein. If this were a perfect match, the codon ATG would line up with the amino acid M, not the first V shown in the protein sequence. But, since this DNA is so close to a known protein sequence, researchers can conclude that it does code for a protein and carries important genetic information.

(Editor's note: Different codons can code for the same protein. Thus, TTC and TTT both are matched with F. The dash (-) indicates that there is a shift in the reading frame. A reading frame is one of the three ways of reading a DNA sequence as a series of codons.)