Bioinformatics

National Center for Biotechnology Information

Renny Lee

*an updated version of this article can be found here
It is well acknowledged that scientific information is being generated at an exponentially increasing rate. One recent molecular biology endeavor is of particular public interest: The Human Genome Project (HGP) sequenced and mapped the complete human genome. Though the HGP was completed successfully, the work of the HGP is far from over. The structure, function, and molecular mechanisms of all the genetic elements comprising the human genome have yet to be discovered. Bioinformatics is one approach being used in this area. Bioinformatics can be defined as the application of computing tools to the solving of biological problems. The Internet provides an accessible and efficient platform capable of housing bioinformatics.
Many scientists today refer to the next wave in bioinformatics as systems biology, an approach to tackle new and complex biological questions. Systems biology involves the integration of genomics, proteomics, and bioinformatics to create a whole system view of a biological entity.
A plethora of bioinformatic tools exist on the Internet, but one particularly good source of information, tools, and resources can be easily accessed at the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/). The NCBI website is currently the paramount bioinformatics resource made available to researchers and the public. The NCBI offers many services of interest to scientists and students alike. However, even the NCBI's resources are not exhaustive.

This article provides a brief overview of the NCBI and the various resources made available for scientific research and public education. The NCBI is a very general resource for bioinformatic tools and there are more powerful and specialized tools available elsewhere on the Internet. The importance of the NCBI is that it is an accessible and comprehensive source of molecular biology information.

History of the NCBI

The National Center for Biotechnology Information (NCBI) is a multi-disciplinary research group that serves as a resource for molecular biology information. It was formed in 1988 as a complement to the activities of the National Institutes of Health (NIH) and the National Library of Medicine (NLM). Its facilities are located in Bethesda, Maryland, USA. Initially, NCBI's creation was intended to aid in understanding the molecular mechanisms that affect human health and disease with the following goals: to create and maintain public databases, develop software to analyze genomic data, and to conduct research in computational biology. In time, and through widespread use of the Internet, NCBI became increasingly aware of the role of pure biological research. Molecular biology became as prominent as biomedical research. This was evident as various specialized databases were being created by the NCBI. No longer was human health and disease the primary area of focus. NCBI began offering services as well:
-developing new methods to deal with the volume and complexity of data researching into methods that can analyze the structure and function of macromolecules
-creating computerized systems for storing and analyzing data about molecular biology
-providing access to analysis and computing tools (which facilitate the use of databases and software) to researchers and the public

In the process of database development, NCBI formed database standards such as database nomenclature that are also used by other non-NCBI databases. One NCBI database is GenBank, the nucleic acid sequence database that contains sequence information from more than 100 000 different organisms. GenBank is probably the most popular database in use. To many, its name is synonymous with the NCBI.

Genbank as the model database

One of NCBI's roles is to maintain publicly available databases. But what exactly are databases, and why are they important for molecular biology? Basically, a database is a large and organized body of data. But one of the key criteria for a biological database is persistent data. In other words, the information encoded and represented by the data may change but the type of data is more resistant to change. This inflexibility of data is a reflection of what comprises macromolecules and how scientists have chosen to symbolize nature. For instance, the sequence of nucleic acids can be symbolized by letters representing nucleotides and a protein sequence can be represented by 20 letters symbolizing the amino acids. These strings of letter symbols constitute a staggering amount of information, but for computerized systems they can easily be organized and manipulated in an optimal way. A model sequence database is GenBank.

GenBank, a database containing all known nucleic acid sequences, is one of the members of the "Triple Entente" of sequence databases; the other two are the European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ). As of August 2003, Genbank contained 27.2 million different sequences. There are over 130 complete microbial genomes available as well as over a dozen eukaryotic genomes (including the human genome). Approximately 26% of sequences in the database are of human origin (1).

Searching for a sequence in GenBank is referred to as "making a query". The information that springs up is called the "record" (entry) for the query. The record for each sequence in GenBank contains a brief description of the sequence, the scientific name and taxonomy of the source organism from which the sequence was derived, bibliographic references, and a list of "features". Features include the coding sequence regions of the nucleic acid and other sites of biological importance (such as transcription motifs, repeat regions, mutation sites, and areas of modification). In addition, the protein sequences of the translated nucleic acid coding regions are included. Each GenBank record is assigned an "accession number" which is a stable and unique identifier of the record that doesn't change with time. In addition, a "GenInfo (gi) number" is assigned to each sequence as is the "version of the accession number"; these numbers do change. For example when the sequence is updated for CUT1-Receptor (Accession number: AB123456, Version: AB123456.1, gi number: 123456789), the version and gi numbers change. This facilitates archiving of data and prevents inconsistencies of sequence information in the literature.

Genbank's entries are generally divided according to what taxonomic divisions exist - main areas are bacteria, viruses, rodents, and humans, and to what methods were experimentally used to generate the sequence information. For example roughly 70% of all sequences in GenBank are ESTs (Expressed Sequence Tags), which are generated by reverse transcribing mRNAs into complementary cDNAs. ESTs represent segments of DNA which code for an mRNA. Other common experimental methods for sequence generation include Sequence-Tagged Sites (STS) used to derive physical maps in genome construction, and Genome Survey Sequence (GSS).

NCBI offers online software to help researchers submit sequence data into GenBank . Individual researchers may submit a single sequence. Larger submissions often come from sequencing centers, which may submit many sequences or entire genomes. The link between submitting sequence data to GenBank and publication is also a coordinated effort; journals that publish sequence data usually require GenBank submission as a condition for publication. And submission to GenBank also rests on assertions of intent to publish the sequence on the part of the author or researcher. The online submission tool is called BankIt. This tool requires the author to enter the sequence, edit it, and add any biological annotations such as coding regions. BankIt is a tool for small submissions, therefore genome centers use the submission tool Sequin instead. Sequin allows for the submission of longer sequence and has a more organized method of sequence submission.

Once a sequence has been added to the database, what preparations are necessary before analysis of the data can begin? The answer is found in database retrieval tools.

Retrieving Genbank data and data from other NCBI databases

The primary database retrieval system at NCBI is Entrez, which links together several databases including GenBank. The central database in Entrez is the nucleotide database Genbank, which links to the following databases: PubMed, Protein Sequence, Genomes, Taxonomy, Structure, Population, Online Mendelian Inheritance in Man (OMIM), Books, and 3D Domains. Connections between entries in a database are called neighbours, and connections between entries of different databases are called hardlinks. For example, a sequence retrieved from GenBank can hardlink to a literature citation in PubMed for the particular sequence. PubMed is the NCBI literature citation database which contains abstracts of over 12 million journal abstracts. Once a sequence is found in GenBank, or once any data is found in any of the various databases, a list of topic-related journal abstracts can be conjured up in PubMed using hardlinks. Unfortunately, full-text electronic-journals cannot be accessed through any of NCBI's databases free of charge. Fortunately, university libraries (such as the UBC library) do provide this service for free.

Other database retrieval systems offered by NCBI include LocusLink and the Taxonomy Browser. LocusLink offers descriptive information about genes and is based on curated data. The Taxonomy Browser offers information on lineage of organisms that have corresponding sequences in GenBank. Taxonomic and phylogenetic trees can also be viewed through the Taxonomy Browser.

Once data is retrieved by Entrez it must be formatted correctly before NCBI's data analysis software can be applied. The FASTA format is usually applied to sequence data from GenBank to transform the data into a form that can be read by data-analytic software tools.

NCBI's data-analytic software tools

The ultimate goal of bioinformatics is to draw conclusions about data. Analytic software tools allow for the conducting of scientific experiments, the rejection of hypotheses, and the drawing of conclusions concerning molecular biology. Although not a substitute for the workbench, bioinformatics acts as a useful complement to laboratory-generated data. Many data-analytic tools exist at NCBI and at other places on the web. Due to the overwhelming number of techniques available for analyzing data, and to the relative newness of much analytic software, conditions for use of any tool may be confusing. The occurrence of mistakes due to unfamiliarity is quite common. Other tools have gained widespread use simply by being easy to use. One such tool is the Basic Linear Alignment Search Tool (BLAST), which is most commonly used to analyze nucleic acid sequences from GenBank.

BLAST is a software tool that aligns two sequences in order to decide whether homology exists between the two sequences. The sequences can either be two nucleotide sequences or two protein sequences. Homology indicates that the sequences being studied came from a common ancestral sequence. Homology between sequences is also indicative of (but not sufficient to prove) similar function at the molecular level. Misunderstanding about the meaning of the term can be illustrated by statements like, "these two sequences are 66% homologous" and "homology exists to this degree". Homology is not based on percentage or degree; its existence is an extreme. Homology either exists between sequences or it doesn't. So how does BLAST infer homology? Basically, BLAST is based on the notion of percent-similarity between sequences. BLAST is based on statistical models of the distribution of obtaining a given nucleotide sequence by chance. If two nucleotide sequences show a degree similarity they would, according to the statistical model, be classified as homologous sequences. Different statistical models exist for protein sequences. NCBI offers a variety of BLAST-based tools for analyzing different data types. Besides using BLAST to infer homology between two sequences, it is possible to BLAST a query sequence against the human genome or the mouse genome to look for homologous sequences.

Other NCBI data-analytic tools include Electronic-PCR, which locates Sequence-Tagged Sites, and BLAST-Link (Blink), which shows protein BLAST alignments for every protein sequence found in Entrez. Many more tools can be accessed through NCBI's website. Some of these data-analytic tools are also databases. A non-exhaustive list of tools includes: OrfFinder (for open-reading frames), RefSeq, UniGene, SNP Database (for single-nucleotide polymorphisms), Human Genome Sequencing, Human MapViewer (to view the draft of the human genome project), Gene Expression Omnibus, Online Mendelian Inheritance in Man (OMIM) (catalogues human genetic diseases), the Molecular Modeling Database (MMDB) which is a 3D protein structure database, and the Conserved Domain Database (CDD).

Databases and public education

One Entrez database serves as a potential source for public education in molecular biology: it is the BOOKS Database. Not only do the web-based books supplement and clarify topics, they also serve as a highly credible resource for science reporters and journalists. The news is often the only mode of scientific information transfer between the researcher and the public. In addition university students may find some required course textbooks in the database. For instance, Lodish's Molecular Cell Biology (UBC's Biology 350), Albert's Essential Cell Biology (UBC's Biology 441), Gilbert's Developmental Biology (UBC's Biology 331), Modern Genetic Analysis (UBC's Biology 334&335), and Janeway's Immunobiology (UBC's Microbiology 301) contents are fully available.

In addition, NCBI provides "Science Primers" on areas that form the theoretical foundations of NCBI itself, with tutorials on topics such as bioinformatics, ESTs, microarray technology, STSs, and molecular modeling. Lastly, NCBI offers tutorials on how to use its various databases and data-analytic software tools

Conclusions

With input in mapping the human genome, NCBI's services are undeniably important. NCBI offers a comprehensive array of databases and software tools to analyze information. The advantage of having NCBI is that they offer a sizable quantity of accessible information to the public. NCBI continues the scientific tradition of making scientific knowledge free for all, which is an uncommon phenomenon in today's world of biotech companies and their closely guarded patents. Bioinformatics, as a discipline, continues to grow at an exponential rate. The NCBI currently combats the problem of redundancy of information by establishing non-redundant databases to limit search-times and increase the ease of making a query. The NCBI website currently handles its services efficiently, despite the overwhelming amount of services present. To continue this efficiency, NCBI must be aware of and receptive to new ways of assimilating data into an organized form

Glossary

1. Curated data = the information supplied is based on the consensus and opinions of a number of researchers.
2. BLAST a query sequence = To input a sequence under study into the database and compare it to the entire collection of sequences in the GenBank database in order to search for homologous sequences.

References

1. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: Update. Nucleic Acids Research, 2004, vol
32, Database Issue: D23-D26.

Recommended Resources for Further Information

1. The NCBI Website http://www.ncbi.nlm.nih.gov/
There is a never-ending series of links. The most useful place to start is probably the SiteMap. The best place to visualize the databases and software tools is the website itself. Experimenting and playing with NCBI's services is the best way to learn about how they work.

2. A printed resource is the book by Baxevanis and Ouelette entitled Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd edition.
This book is very theoretical and may soon be out of date.
It contains colourplates of many different databases (some of which are NCBI databases).

3. Journals
A good journal for information on bioinformatics databases is Nucleic Acids Research.
This journal publishes an issue devoted entirely to databases at the beginning of each year

Related Articles		Related Resources

The Human Genome Project A historical perspective Genome Projects the ins and outs of sequencing What is Bioinformatics? Article based on an interview with Francis Ouelett, director of the UBC Bioinformatics Centre.		Genome Warrior New Yorker article on Craig Venter from Celera & the race to sequence the human genome. NCBI tutorials links to online tutorials for using BLAST & tips for teaching bioinformatices to students