Users must help to keep public databases correct

With the continued growth of the public DNA sequence databases, and the recent addition of the 11,000,000,000th nucleotide to GenBank (including DDBJ, EMBL and GenBank), it is timely to assess how we use these databases.

GenBank is the archive of all publicly available DNA, RNA and protein sequences. Upon publication a new sequence and its annotations appear in it. Investigators use GenBank in many ways, most commonly for similarity searches such as BLAST; to retrieve records; and for sequence analysis, multiple sequence alignment or pattern finding search. Errors sometimes occur in GenBank, ranging from the trivial (incorrect postal codes), to the misleading (30 nucleotides of vector left on the ends of a record), to the mission-critical (a full length mRNA without a coding sequence (CDS) annotated on it). Also very common are incomplete references that prevent researchers from linking the GenBank record to the publication that refers to it first.

Over the years some people have chosen to report these errors, but in most cases they are left unmodified. An uncorrected 'discovered' error is one of the worst possible failings in GenBank, so if you discover an error, report it to the database ( and it should be rectified — although a follow-up is advised to make sure this gets done.

If you are a submitter, look at the record you submitted a few years ago: is it still correct? Was the citation ever updated? Take pride in the sequences that carry your name! Our ability to interpret genomes depends on all of these records being as accurate as possible. This is a task for all users of the databases.

  1. Bioinformatics Core Facility, Centre for Molecular Medicine and Therapeutics, University of British Columbia, 950 West 28th Avenue, Vancouver, British Columbia V5Z 4H4, Canada

    • Francis Ouellette


