National Center for Biotechnology Information
Renny
Lee
*an updated version of this article can be found here
It
is well acknowledged that scientific information is being generated
at an exponentially increasing rate. One recent molecular biology
endeavor is of particular public interest: The Human Genome Project
(HGP) sequenced and mapped the complete human genome. Though the
HGP was completed successfully, the work of the HGP is far from
over. The structure, function, and molecular mechanisms of all the
genetic elements comprising the human genome have yet to be discovered.
Bioinformatics is one approach being used in this area. Bioinformatics
can be defined as the application of computing tools to the solving
of biological problems. The Internet provides an accessible and
efficient platform capable of housing bioinformatics.
Many scientists today refer to the next wave in bioinformatics as
systems biology, an approach to tackle new and complex biological
questions. Systems biology
involves the integration of genomics, proteomics, and bioinformatics
to create a whole system view of a biological entity.
A
plethora of bioinformatic tools exist on the Internet, but one particularly
good source of information, tools, and resources can be easily accessed
at the National Center for Biotechnology Information (NCBI) website
(http://www.ncbi.nlm.nih.gov/).
The NCBI website is currently the paramount bioinformatics resource
made available to researchers and the public. The NCBI offers many
services of interest to scientists and students alike. However,
even the NCBI's resources are not exhaustive.
This article provides a brief overview of the NCBI and the various
resources made available for scientific research and public education.
The NCBI is a very general resource for bioinformatic tools and
there are more powerful and specialized tools available elsewhere
on the Internet. The importance of the NCBI is that it is an accessible
and comprehensive source of molecular biology information.
History of the NCBI
The National Center for Biotechnology Information (NCBI) is a multi-disciplinary
research group that serves as a resource for molecular biology information.
It was formed in 1988 as a complement to the activities of the National
Institutes of Health (NIH) and the National Library of Medicine
(NLM). Its facilities are located in Bethesda, Maryland, USA. Initially,
NCBI's creation was intended to aid in understanding the molecular
mechanisms that affect human health and disease with the following
goals: to create and maintain public databases, develop software
to analyze genomic data, and to conduct research in computational
biology. In time, and through widespread use of the Internet, NCBI
became increasingly aware of the role of pure biological research.
Molecular biology became as prominent as biomedical research. This
was evident as various specialized databases were being created
by the NCBI. No longer was human health and disease the primary
area of focus. NCBI began offering services as well:
-developing new methods to deal with the volume and complexity of
data researching into methods that can analyze the structure and
function of macromolecules
-creating computerized systems for storing and analyzing data about
molecular biology
-providing access to analysis and computing tools (which facilitate
the use of databases and software) to researchers and the public
In the process of database development, NCBI formed database standards
such as database nomenclature that are also used by other non-NCBI
databases. One NCBI database is GenBank, the nucleic acid sequence
database that contains sequence information from more than 100 000
different organisms. GenBank is probably the most popular database
in use. To many, its name is synonymous with the NCBI.
Genbank as the model database
One
of NCBI's roles is to maintain publicly available databases. But
what exactly are databases, and why are they important for molecular
biology? Basically, a database is a large and organized body of
data. But one of the key criteria for a biological database is persistent
data. In other words, the information encoded and represented by
the data may change but the type of data is more resistant to change.
This inflexibility of data is a reflection of what comprises macromolecules
and how scientists have chosen to symbolize nature. For instance,
the sequence of nucleic acids can be symbolized by letters representing
nucleotides and a protein sequence can be represented by 20 letters
symbolizing the amino acids. These strings of letter symbols constitute
a staggering amount of information, but for computerized systems
they can easily be organized and manipulated in an optimal way.
A model sequence database is GenBank.
GenBank,
a database containing all known nucleic acid sequences, is one of
the members of the "Triple Entente" of sequence databases;
the other two are the European Molecular Biology Laboratory (EMBL)
and the DNA Database of Japan (DDBJ). As of August 2003, Genbank
contained 27.2 million different sequences. There are over 130 complete
microbial genomes available as well as over a dozen eukaryotic genomes
(including the human genome). Approximately 26% of sequences in
the database are of human origin (1).
Searching
for a sequence in GenBank is referred to as "making a query".
The information that springs up is called the "record"
(entry) for the query. The record for each sequence in GenBank contains
a brief description of the sequence, the scientific name and taxonomy
of the source organism from which the sequence was derived, bibliographic
references, and a list of "features". Features include
the coding sequence regions of the nucleic acid and other sites
of biological importance (such as transcription motifs, repeat regions,
mutation sites, and areas of modification). In addition, the protein
sequences of the translated nucleic acid coding regions are included.
Each GenBank record is assigned an "accession number"
which is a stable and unique identifier of the record that doesn't
change with time. In addition, a "GenInfo (gi) number"
is assigned to each sequence as is the "version of the accession
number"; these numbers do change. For example when the sequence
is updated for CUT1-Receptor (Accession number: AB123456, Version:
AB123456.1, gi number: 123456789), the version and gi numbers change.
This facilitates archiving of data and prevents inconsistencies
of sequence information in the literature.
Genbank's
entries are generally divided according to what taxonomic divisions
exist - main areas are bacteria, viruses, rodents, and humans, and
to what methods were experimentally used to generate the sequence
information. For example roughly 70% of all sequences in GenBank
are ESTs (Expressed Sequence Tags), which are generated by reverse
transcribing mRNAs into complementary cDNAs. ESTs represent segments
of DNA which code for an mRNA. Other common experimental methods
for sequence generation include Sequence-Tagged Sites (STS) used
to derive physical maps in genome construction, and Genome Survey
Sequence (GSS).
NCBI offers online software to help researchers submit sequence
data into GenBank . Individual researchers may submit a single sequence.
Larger submissions often come from sequencing centers, which may
submit many sequences or entire genomes. The link between submitting
sequence data to GenBank and publication is also a coordinated effort;
journals that publish sequence data usually require GenBank submission
as a condition for publication. And submission to GenBank also rests
on assertions of intent to publish the sequence on the part of the
author or researcher. The online submission tool is called BankIt.
This tool requires the author to enter the sequence, edit it, and
add any biological annotations such as coding regions. BankIt is
a tool for small submissions, therefore genome centers use the submission
tool Sequin instead. Sequin allows for the submission of longer
sequence and has a more organized method of sequence submission.
Once
a sequence has been added to the database, what preparations are
necessary before analysis of the data can begin? The answer is found
in database retrieval tools.
Retrieving Genbank data and data from other NCBI databases
The primary database retrieval system at NCBI is Entrez, which
links together several databases including GenBank. The central
database in Entrez is the nucleotide database Genbank, which links
to the following databases: PubMed, Protein Sequence, Genomes, Taxonomy,
Structure, Population, Online Mendelian Inheritance in Man (OMIM),
Books, and 3D Domains. Connections between entries in a database
are called neighbours, and connections between entries of different
databases are called hardlinks. For example, a sequence retrieved
from GenBank can hardlink to a literature citation in PubMed for
the particular sequence. PubMed is the NCBI literature citation
database which contains abstracts of over 12 million journal abstracts.
Once a sequence is found in GenBank, or once any data is found in
any of the various databases, a list of topic-related journal abstracts
can be conjured up in PubMed using hardlinks. Unfortunately, full-text
electronic-journals cannot be accessed through any of NCBI's databases
free of charge. Fortunately, university libraries (such as the UBC
library) do provide this service for free.
Other database retrieval systems offered by NCBI include LocusLink
and the Taxonomy Browser. LocusLink offers descriptive information
about genes and is based on curated data.
The Taxonomy Browser offers information on lineage of organisms
that have corresponding sequences in GenBank. Taxonomic and phylogenetic
trees can also be viewed through the Taxonomy Browser.
Once data is retrieved by Entrez it must be formatted correctly
before NCBI's data analysis software can be applied. The FASTA format
is usually applied to sequence data from GenBank to transform the
data into a form that can be read by data-analytic software tools.
NCBI's data-analytic software tools
The ultimate goal of bioinformatics is to draw conclusions about
data. Analytic software tools allow for the conducting of scientific
experiments, the rejection of hypotheses, and the drawing of conclusions
concerning molecular biology. Although not a substitute for the
workbench, bioinformatics acts as a useful complement to laboratory-generated
data. Many data-analytic tools exist at NCBI and at other places
on the web. Due to the overwhelming number of techniques available
for analyzing data, and to the relative newness of much analytic
software, conditions for use of any tool may be confusing. The occurrence
of mistakes due to unfamiliarity is quite common. Other tools have
gained widespread use simply by being easy to use. One such tool
is the Basic Linear Alignment Search Tool (BLAST), which is most
commonly used to analyze nucleic acid sequences from GenBank.
BLAST is a software tool that aligns two sequences in order to
decide whether homology exists between the two sequences. The sequences
can either be two nucleotide sequences or two protein sequences.
Homology indicates that the sequences being studied came from a
common ancestral sequence. Homology between sequences is also indicative
of (but not sufficient to prove) similar function at the molecular
level. Misunderstanding about the meaning of the term can be illustrated
by statements like, "these two sequences are 66% homologous" and
"homology exists to this degree". Homology is not based on percentage
or degree; its existence is an extreme. Homology either exists between
sequences or it doesn't. So how does BLAST infer homology? Basically,
BLAST is based on the notion of percent-similarity between sequences.
BLAST is based on statistical models of the distribution of obtaining
a given nucleotide sequence by chance. If two nucleotide sequences
show a degree similarity they would, according to the statistical
model, be classified as homologous sequences. Different statistical
models exist for protein sequences. NCBI offers a variety of BLAST-based
tools for analyzing different data types. Besides using BLAST to
infer homology between two sequences, it is possible to BLAST
a query sequence against the human genome or the mouse genome
to look for homologous sequences.
Other NCBI data-analytic tools include Electronic-PCR, which locates
Sequence-Tagged Sites, and BLAST-Link (Blink), which shows protein
BLAST alignments for every protein sequence found in Entrez. Many
more tools can be accessed through NCBI's website. Some of these
data-analytic tools are also databases. A non-exhaustive list of
tools includes: OrfFinder (for open-reading frames), RefSeq, UniGene,
SNP Database (for single-nucleotide polymorphisms), Human Genome
Sequencing, Human MapViewer (to view the draft of the human genome
project), Gene Expression Omnibus, Online Mendelian Inheritance
in Man (OMIM) (catalogues human genetic diseases), the Molecular
Modeling Database (MMDB) which is a 3D protein structure database,
and the Conserved Domain Database (CDD).
Databases and public education
One Entrez database serves as a potential source for public education
in molecular biology: it is the BOOKS Database. Not only do the
web-based books supplement and clarify topics, they also serve as
a highly credible resource for science reporters and journalists.
The news is often the only mode of scientific information transfer
between the researcher and the public. In addition university students
may find some required course textbooks in the database. For instance,
Lodish's Molecular Cell Biology (UBC's Biology 350), Albert's Essential
Cell Biology (UBC's Biology 441), Gilbert's Developmental Biology
(UBC's Biology 331), Modern Genetic Analysis (UBC's Biology 334&335),
and Janeway's Immunobiology (UBC's Microbiology 301) contents are
fully available.
In addition, NCBI provides "Science Primers" on areas that form
the theoretical foundations of NCBI itself, with tutorials on topics
such as bioinformatics, ESTs, microarray technology, STSs, and molecular
modeling. Lastly, NCBI offers tutorials on how to use its various
databases and data-analytic software tools
Conclusions
With input in mapping the human genome, NCBI's services are undeniably
important. NCBI offers a comprehensive array of databases and software
tools to analyze information. The advantage of having NCBI is that
they offer a sizable quantity of accessible information to the public.
NCBI continues the scientific tradition of making scientific knowledge
free for all, which is an uncommon phenomenon in today's world of
biotech companies and their closely guarded patents. Bioinformatics,
as a discipline, continues to grow at an exponential rate. The NCBI
currently combats the problem of redundancy of information by establishing
non-redundant databases to limit search-times and increase the ease
of making a query. The NCBI website currently handles its services
efficiently, despite the overwhelming amount of services present.
To continue this efficiency, NCBI must be aware of and receptive
to new ways of assimilating data into an organized form
Glossary
1. Curated data = the information supplied is based on the consensus
and opinions of a number of researchers.
2. BLAST a query sequence = To input a sequence under study into
the database and compare it to the entire collection of sequences
in the GenBank database in order to search for homologous sequences.
References
1. Benson DA, Karsch-Mizrachi I, Lipman DJ,
Ostell J, Wheeler DL. GenBank: Update. Nucleic Acids Research,
2004, vol
32, Database Issue: D23-D26.
Recommended Resources for Further Information
1.
The NCBI Website http://www.ncbi.nlm.nih.gov/
There is a never-ending series of links. The most useful place to
start is probably the SiteMap. The best place to visualize the databases
and software tools is the website itself. Experimenting and playing
with NCBI's services is the best way to learn about how they work.
2. A printed resource is the book by Baxevanis and Ouelette entitled
Bioinformatics: A Practical Guide to the Analysis of Genes and
Proteins, 2nd edition.
This book is very theoretical and may soon be out of date.
It contains colourplates of many different databases (some of which
are NCBI databases).
3. Journals
A good journal for information on bioinformatics databases is
Nucleic Acids Research.
This journal publishes an issue devoted entirely to databases at the
beginning of each year