CLASSIFICATION NEURAL NETWORKS FOR GENOME RESEARCH
The long-term objective is to develop computer technology needed to accomplish the objectives of the Human Genome Project and to apply the technology to the analysis and management of sequencing data. Currently, a database search for sequence similarities represents the most direct computational approach to the analysis of genomic information. However, the search is becoming ever more forbidding due to the accelerating growth of sequencing data. The goal of the proposed research is to further develop and enhance a software tool for speedy classification of unknown sequences, and make it available to the genome community. The research will build upon a pilot system designed and developed by the principal investigator that has shown great promise. The specific aims are (1) to enhance the tool for speedy identification of PIR superfamilies and ProSite patterns, (2) to develop a pilot DNA/RNA classification system, (3) to distribute the tool, and (4) to aid PIR protein database and RDP ribosomal RNA database organization. In contrast to other search methods whose search time grows linearly with the number of entries in the database, the time of the proposed tool grows with the number of families, which is likely to remain low. The tool would automate family assignment which is especially important for managing the influx of new data in a timely manner.
The proposed research applies neural network technology to solving the database search/organization problem. The major design principles involve an encoding schema to extract sequence information and a modular architecture to scale up backpropagation networks. The encoding algorithm is a hashing function similar to the k-tuple method. A pilot system has been implemented on a Cray supercomputer to classify electron transfer proteins and enzymes. The system achieves about 90% accuracy and 50 times speed of other search methods. The speed may be 1000 times faster than others in a decade if the database continues to grow at the current rate. In the proposed research, the sensitivity of the tool would be improved and a full-scale system would be developed. The automated software tool would be portable at the source code, user interface, and hardware levels. The system would be updated in accordance with database releases, and distributed to the research community via anonymous ftp. The tool would be used to classify PIR sequences according to superfamilies and to classify ribosomal RNA sequences according to phylogenetic relations.