Joel L. Sussman1,2, Enrique E. Abola1, Nancy O. Manning1, Jaime Prilusky3
The challenge facing the new 3DB is to keep abreast of the increasing flow of data, to maintain the archive as error-free as possible, and to organize and present this information in ways that facilitate data retrieval, knowledge exploration, and hypothesis testing without interrupting current services. The PDB introduced substantial enhancements to both data management and archive access in the past two years, and is well on the way to converting to a more powerful system that combines the advantages of object oriented and relational database systems. 3DB will transform PDB from a Data Bank serving solely as a data repository into a highly sophisticated knowledge-based system for archiving and accessing structural information. The process will be evolutionary, insulating users from drastic changes and providing both a high degree of compatibility with existing software and a consistent user interface for casual browsers.
Development is underway for 3DB to operate as a direct-deposition archive, providing mechanisms for depositors to submit data over the Internet with minimal staff intervention. Data archived in 3DB is managed using the Relational Database Management System (RDBMS), SYBASE [4]. The new database (3DBase) is being developed with a view towards being a member of a federation of biological databases. Collaborative international centers are also being established to assist in data deposition, archiving, and distribution activities.
2. Resource Status, July 1996
Rapid developments in preparation of crystals of macromolecules and in experimental
techniques for structure analysis and refinement have led to a revolution in Structural Biology. These
factors have contributed significantly to an enormous increase in the number of laboratories
performing structural studies of macromolecules to atomic resolution and the
number of such studies per lab. Advances include: 1)
recombinant DNA techniques that permit almost any protein or nucleic acid to be produced
in large amounts; 2) better X-ray detectors; 3) real-time interactive computer
graphics systems, together with more automated methods for structure determination and
refinement; 4) synchrotron radiation, allowing the use of extremely tiny crystals, Multiple
Wavelength Anomalous Dispersion (MAD) phasing, and time-resolved studies via Laue
techniques; 5) NMR methods permitting structure determination of macromolecules in
solution; and 6) electron microscopy (EM) techniques, for obtaining high-resolution structures.
Fig 1. PDB Coordinate Entries Available
PDB Archive Contents | |
---|---|
4652 | released atomic coordinate entries |
576 | structure factor files |
169 | NMR restraint files |
Molecule type | |
4207 | proteins, peptides, and viruses |
108 | protein/nucleic acid complexes |
325 | nucleic acids |
12 | carbohydrates |
Experimental Technique | |
3865 | diffraction and other |
651 | NMR |
136 | theoretical modeling |
Fig 2. FTP data access
The PDBBrowse incorporates a number of features that make it easy to access information found in PDB entries. Multiple search strings covering various fields, corresponding to PDB record types such as compound, header, author, biological source, or heterogen data, are supported. These searches support boolean 'and', 'or', and 'not' operators. Entries selected can be retrieved automatically, and the molecular structures can be displayed using the public-domain X-based molecular viewer RasMol (or similar viewer). They also include hypertext links to SWISS-PROT [10], BMRB [11], the Enzyme Commission Database, and the Entrez Reference Database [13]. Internet access to the archives has become the primary mode of retrieving entries from the PDB. However, we continue to receive a considerable number of orders for our CD- ROM product. We anticipate that this will continue to be true for a variety of reasons. For example, network performance still remains poor in a number of locations and these disks, released quarterly, provide local access to the contents of the archive. Some of these network access difficulties are easily overcome by installing a copy of the PDB FTP and WWW servers using mirroring software. With this software all files in the PDB are stored locally and any changes are automatically reflected on a daily basis.
3. The 3-Dimensional Database of Biomacromolecules Structures (3DB)
Implementing plans to convert PDB to 3DB entails changes to every aspect of current
operations. A new data submission and archival system is being implemented which
attempts to balance the need for full automation with the need to maintain very high levels
of data accuracy and reliability. The new system relies on an RDBMS for data management
and archival. An overview of the relationships between 3DBase and depositors, users,
third party software developers, and other databases is shown in Fig. 3. The following
sections gives a summary of development work.
Fig 3. 3DBase relationships.
This development effort attempts to address the needs of the diverse user community served by the PDB. The schema supports queries by those interested in answering crystallographic questions as well as molecular biology questions. The system is being designed with the idea that it will shortly be federated with other biological databases. Our expectation is that through federation, complex queries may be submitted to our database, for which answers that traverse several databases may be easily returned. Interoperability is addressed through the use of schema sharing with other OPM-based databases and support for a variety of data interchange formats in the query results. In addition to providing users with a powerful environment to do complex ad-hoc queries, 3DBase will also facilitate managing the growing archive, which is expected to contain over 30,000 structural reports by the year 2000.
This work is being done as a collaboration among the following groups:
In 3DBase, literature citation data are being loaded into the CitDB database of references that was developed by GDB [15]. A pointer to the appropriate entry in CitDB is loaded in the oExperiment object of 3DBase. This is an example of the strategy that we are following in linking to external databases. CitDB will be managed as a federation run by a number of database centers that includes GDB and PDB. There are several advantages to this scenario. By sharing the schema and management of the citation database, access to information stored in each of the database via the bibliographic citation becomes straightforward. Duplication of effort is also minimized. Today it is still common to have several public databases build and maintain their own bibliographic databases. This will no longer be economically feasible with the expected rapid growth in database size.
3.3 Building Semantic Links to External Data Sources
Links to contents of sequence databases are provided in 3DBase via the oPrimarySeq
and oSeqAdv classes. These classes form another set of objects that link 3DBase objects
to external databases. Representing, building, and maintaining these links will be one of
3DB's primary task in the coming years. There are several issues that must be addressed
for this effort to succeed. Data representation issues are foremost. Each database use
different data models to represent and store information. Semantic contents are rarely the
same, for example the primary sequence data stored in sequence databases such as
SWISS-PROT or
PIR
[16,
17]
are presented using a view which differs significantly from that used
by PDB.
In general PIR and SWISS-PROT entries contain information on the naturally-occurring wild-type molecules. Each entry normally contains the sequence of one gene product and some entries include the complete precursor sequence. Annotation is provided to describe residue modifications. In both databases, the residue names used are limited to the 20 standard amino acids.
In contrast, PDB entries contain multichain molecules with sequences that may be wild type, variant, or synthetic. Sequences may also have been modified through protein engineering experiments. A number of PDB entries report structures of domains cleaved from larger molecules.
The oPrimarySeq object class was designed to account for these differences by providing explicit correlations between contiguous segments of sequences as given in PDB ATOM records and PIR or SWISS-PROT entries. Several cases are easily represented using this class. Molecules containing heteropolymers will be linked to different sequence database entries. In some cases, such as those PDB entries containing immunoglobulin Fab fragments, each PDB chain may be linked to several different SWISS-PROT entries.
This facility is needed, because these databases represent sequences for the various immunoglobulin domains as separate entries. oPrirmarySeq should also be able to represent molecules engineered by altering the gene (fusing genes, altering sequences, creating chimeras, or circularly permuting sequences). In addition it will also be possible to link segments of the structure to entries in motif databases (e.g. PROSITE [18], BLOCKS [19]) .
Initial building of these links is straightforward and requires analysis of a few entries coming out of a FASTA or BLAST search against the sequence databases. What may be problematic in the long run is updating these links as new experimental evidences is encountered, leading to a correction in either database. Both PIR and SWISS-PROT have similar problems as they build pointers to PDB entries. To help obviate these difficulties we have agreed to establish a closer interaction between the databases. We are setting up a protocol that will broadcast to each database changes that occur which in turn could affect specific entries.
4. Data Deposition
3DB will operate as a direct-deposition archive, providing mechanisms that will allow
depositors to load data with minimal staff intervention. This strategy is essential if 3DB is to
meet present projections of exponential growth in depositions against a fixed staff size. This
is particularly challenging, due to the complexity of the data being handled, the need for a
common viewpoint of the entry description, and the community requirement that these data
be accessible immediately upon receipt.
With direct deposition, there will be a concomitant need to increase the power of data validation procedures. These procedures must reflect current models for identifying errors and must be as complete as possible. Quality control issues assume a more central and difficult role in direct deposition strategies. Distributed data must be of the highest quality; otherwise users will lose their trust in the archived data and will have to revalidate data received from 3DB before using them, clearly an unproductive scenario.
4.1 Current Data Deposition Procedures
Since its inception in 1971, the method followed by the PDB for entering and distributing
information paralleled the review and edit mode used by scientific journals. Currently, the
author submits information which is converted into a PDB entry and run against PDB
validation programs by a PDB processor. The entry and the output of the validation suite
are then evaluated by a PDB scientific staff member, who completes the annotations and
returns the entry to the author for comment and approval. Table 2 summarizes checks
included in our current data validation suite. Corrections from the author are incorporated
into the entry, which is reanalyzed and validated before being archived and released.
Class | What is checked |
---|---|
stereochemistry | bond distances and angles Ramachandran plot (dihedral angles) planarity of groups chirality |
bonded/non-bonded interactions | crystal packing, unspecified inter- and intraresidue links |
crystallographic information | Matthews coefficient, Z-value, cell transformation matrices |
noncrystallographic transformation | validity of noncrystallographic symmetry |
primary sequence data | discrepancies with sequence databases |
secondary structure | generated automatically or visually checked |
heterogen groups | identification, geometry and nomenclature |
miscellaneous checks | solvent molecules outside the hydration sphere, syntax checks, internal data consistency checks |
The current deposition load of ~100 entries a month is handled by about ten staff members, who annotate and validate entries. The process is a production line in which checking is repeated at various steps to ensure that errors and inconsistencies in data representation are minimized. Prior to June 1994, a significant number of depositions required that administrative staff keyboard information provided in a deposition form. Introduction of the current Electronic Deposition Form and a new parsing program has greatly reduced, though still not yet completely eliminated, hand entry of information into entries.
Today, most of the processing time is spent resolving data representation issues and ensuring that outliers are identified and annotated. The most troublesome areas consistently are those involving handling of heterogens, resolving crystal packing issues, representing molecules with non-crystallographic symmetry, and resolving conflicts between the submitted amino acid sequence and that found in the sequence databases. Publications and other references are sometimes consulted to verify factual information such as crystal data, biological details, reference information, etc. Processing programs, although much improved from those used in 1991, still allow errors to pass undetected through the system, requiring a visual check of all entries. We continually improve these programs and acquire software from collaborators to address deficiencies that both we and our users have identified. In addition, we now have formed a quality control group that will be looking into our operations to identify sources of errors and to recommend steps to improve data quality.
4.2 Development of Automatic Deposition and Validation
3DB must overcome many challenges for direct deposition to work. In a recent workshop
held to assess the needs of 3DB users, crystallographers and NMR spectroscopists were
unanimous in their desire to have a system that did not require additional work on their part
when depositing data. On the other hand, consumers (which included these same
depositors) were vocal in their desire for entries to contain more information than what is
currently available within the PDB. We are striving to develop a suite of deposition and
validation programs that accommodates these somewhat conflicting desires while ensuring
that the archives maintain the highest standard of accuracy. A schematic of the automatic
deposition process is depicted in Fig. 4.
Fig 4. 3DBase automatic validation.
In addition to the deposition form that is filled out by AutoDep, authors are requested to submit the coordinate data entry and other experimental data files for processing and archive. Facilities are provided by AutoDep that help simplify this process. An FTP script is provided that author specified local filenames uploads to the PDB server site.
The completed form is then converted automatically into a file in PDB format, and along with the coordinate data are submitted to a set of validation programs for checking and further annotation. These programs are designed to check 1) the quality, consistency, and completeness of the experimental data, 2) possible violations of physical or stereochemical constraints (e.g., no two atoms in the same place, appropriate bond angles, etc.), 3) compliance with our data dictionary (syntax checks) and 4) in the near future, the correspondence of the experimental data to the derived structure. Development of the validation suite will evolve with advice from the community and encompass programs currently in use, written both within and outside PDB.
The validation software automatically generates and includes in the entry measures of data quality and consistency, as well as annotations giving details of apparent inconsistencies and outliers from normal values. This output is returned to the depositor for review. Entries whose data quality and consistency meet appropriate standards may then be sent by the depositor directly for automatic entry into the database. Entries that do not pass the quality and consistency checks may be revised by the depositor to correct for inadvertent errors, or alternatively, more experimental work may be needed to resolve problems uncovered.
Apparent inconsistencies or outliers may remain in a submitted entry, provided these are explained by the depositor in an annotation. In the most interesting cases, unusual features are a valid and important part of the structure. However, all such entries will be reviewed for possible errors by 3DB staff, who may discuss any important issues with the depositor. 3DB staff will then forward acceptable entries to the database.
To make automatic deposition as easy as possible, we are working with developers of software commonly used by our depositors. By modifying these programs to produce compliant data files and performing validation and consistency checks before submission, it may be possible to bypass most of the tedious steps in deposition. We are already working with Dr. A. Brünger to use procedures available through X-PLOR [27] to replace part of the validation suite for structures produced by X-ray crystallography and NMR. Diagnostic output will be included automatically as annotations in the entry. A limited version of X-PLOR will be available from BNL to all depositors for validation purposes only.
Validation of coordinate data against experimental X-ray crystallographic data requires access to structure factor data, which are requested by PDB, the International Union of Crystallography (IUCr) and some journals but are not always supplied by the depositor. We are working toward building consensus in the community that structure factor data are a necessary component of deposits of structures derived by X-ray crystallography. Statistics such as number of F's and R-values vs. sin(theta)/lambda, etc. will be calculated and included in the 3DB entry as annotation for the experiment.
In order to make it easier for depositors to submit structure factors (as well as to exchange these data between laboratories), the PDB , in close collaboration with a number of macromolecular crystallographers has developed a standard interchange format for these data. This standard is in CIF (Crystallographic Information File) [29, 30] and was chosen both for simplicity of design and for being clearly self-defining, i.e. that the file contains sufficient information for the file to be read and understood by either a program or a person. Details of this format is available through the PDB WWW server.
A consensus is still developing in the NMR community as to what types of experimental data should be deposited and what kinds of validation and consistency checks should be performed. Structural data produced by other methods may also have special features that should be archived or checked, for example the sequence alignment used for modelling studies. Requirements for the types of data to be deposited and proper ways of checking the validity and consistency of the data will be developed in cooperation with the experimental community for each type of structure data archived by the 3DB.
4.3 AutoDep's most important features
Fig 5. Accessing 3DBase.
For those familiar with (or willing to learn about) the OPM protocol, access to the object
layer will be provided using a high level OPM-based query language. As part of the 3DB
open database policy, direct access to the underlying RDBMS will be allowed and actively
supported. These queries are not parsed by the 3DB-QA module, so better response time
can be expected. This provides third party developers with the opportunity to either
incorporate SQL clients in their products or to learn more of the OPM protocol and thereby
gain access to all of the benefits that the Object model affords (e.g., active external links,
programs, etc.).
As depicted in figure 5, the output generator will return query results using a variety of
data interchange formats. PDB will continue to support its current format in the foreseeable
future. We also plan to extend this format to allow us to represent objects being stored in
3DBase. In addition, a "raw format" is being provided which returns an attribute/value pair.
This form is easily parsed and is more compact than the PDB format.
6. Appendix A: Description of primary database objects
The following describe two primary object classes found in 3DBase.
For a more complete and detailed view, you may use our schema browser at URL
http://pdb.pdb.bnl.gov/opmbrowser.html.
OPM schema consists of definitions of object classes, each described by a set of attributes.
Datatypes assigned to attributes can be primitive types such as numbers or character
strings, or they can be other classes defined in the schema. In addition to object
classes, OPM provides controlled value classes which restricts values possible for
objects in the class. Attributes can be either single or mulitple valued. The latter can
be speficied as an ordered list (list-of) or as an unordered list (set-of).
Object classes are grouped into a hierarchy of subclasses and superclass relationship
called an ISA hierarchy. A class is said to inherit all the attributes of its superclass. In
the description below object classes have been assigned name which start with the
small letter "o", attributes with the small letter "a", and controlled value classes with the
prefix "cv".
oExperiment isa o3DBExportObj *aTitle: list-of varchar(255) required Description: Contains a title for the experiment or analysis that is represented in the object. It should identify an entry in the PDB in the same way that a title identifies a paper. *aDepositor: list-of oPerson required Description: An ordered list of names with address information *aKeywords: set-of varchar(80) optional Description: Set of keywords relevant to the entry. These provide a simple means of categorizing the experiment or the molecules studied. *aExpdta: cvExperimentTypes required Description: Identifies the experimental technique used in the study. This normally refers to the type of radiation and sample, but can also include the spectroscopic or modeling technique. Permitted values include: ELECTRON DIFFRACTION FIBER DIFFRACTION FLUORESCENCE TRANSFER NEUTRON DIFFRACTION NMR THEORETICAL MODEL X-RAY DIFFRACTION *aReference: list-of oExternalReference optional Description: Publications related to the study. These citations are chosen by the depositor. *aRemarks: set-of oAnnotations optional Description: General comments regarding the experiment or the molecules studied. oMacroMolecule isa o3DBExportObj *a3DBID: char(4) required Description: Contains the PDB identification code. This is a four character field of which the first character must be an integer greater than zero. This identifier is unique within PDB and is assigned randomly to an entry *aMolName: set-of varchar(80) optional Description: Set of molecule names. Each molecule may be assigned more that one name allowing for the use of synonyms and aliases. *aSrcDescr: oSource optional Description: Specifies the biological and/or chemical source of each biological molecule in the entry. Sources are described by both the common name and scientific names, genus and species. Strain and/or cell-line for immortalized cells are given when they help in uniquely identifying the biological entity studied. *aMolType: cvMacroMoleculeType optional Description: Molecule type -Valid values are: protein dna rna polysaccharide other - must annotate *aBioMol: varchar(255) optional Description: Information on accessing the structure of the complete biological molecule. Currently this contains the filename for a biomol entry found in the PDB ftp server. This attribute will be replaced by a new object class with attributes that provide the transformation matrices, descriptive text, and if available the filename for the coordinate set. *aCoordinates: set-of oChain optional Description: Atomic coordinate values stored as individual chains *aMolSequence: list-of oPrimarySeq optional Description SEQRES records contain the amino or nucleic acid sequence of residues in each chain of the macromolecule. *aDomain: varchar(100) optional Description: Specifies a domain or region of the molecule *aEngineered: cvFlagDict optional *aEnzyme: set-of oExternalReference optional Description: The Enzyme Commission number associated with the molecule *aMutation: varchar(255) optional Describes the mutations present *aFormula: varchar(80) optional *aMolWeight: float optional *aMolID: cvLocalID required Description: Integer to uniquely identify each instance of a coordinate set for a molecule. For example, each occurences of lysozyme in the database will be identified by a unique number. *aAnnotate (aSummLine, aExtDB) : set-of (varchar(255), oExternalReference) optional Description:Annotations describing the molecule. This is presented as a table of text and pointer to an external database.7. Appendix B: 3DBase Report in different formats
3DBase - raw format For the oMacromolecule object of the entry 1ACE: *Export_Object: 1ACE *Macromol_name: Acetylcholinesterase *Macromol_name: Ache *EC_number: 3.1.1.7 *3DB_init_res_num: 4 *3DB_term_res_num: 534 *Init_res_num: 25 *Term_res_num: 555 *Databae_ID_code: ACES_TORCA *Domain_desc: No *Engineered: No *Source_sci_name: Torpedo californica *Source_common_name: Pacific electric eel Data in BoulderIO format: Export_Object = 1ACE Macromol_name = Acetylcholinesterase Macromol_name = Ache EC_number = 3.1.1.7 Chain = { 3DB_init_res_num = 4 3DB_term_res_num = 534 Init_res_num = 25 Term_res_num = 555 Database_ID_code = ACES_TORCA } Domain_desc = No Engineered = No Source_sci_name = Torpedo californica Source_common_name = Pacific electric eel
As a security measure, if a specific amount of time has passed between transactions, or the request comes from a different computer, you will be asked to re-enter the password. The number of times you should have to do this is minimal and it is there to try to protect against an unmonitored workstation being used to view your work. While these measures are NOT fool-proof, they are significantly better than the system in use today. PDB will do everything in its power to maintain the confidentiality of your deposition. You must however do your part as well.
Joel L. Sussman Protein Data Bank Dept. of Structural Biology Biology Department, Bldg. 463 Weizmann Institute of Science Brookhaven National Laboratory Rehovot 76100 Upton NY 11973-5000 USA Israel Phone: (516)-344-6355 972-8-934-2638 Fax: (516)-344-2176 972-8-934-4159 E-mail: csjoel@weizmann.weizmann.ac.il URL: http://www.weizmann.ac.il/~jsgrp/joel.html
Enrique E. Abola Protein Data Bank Biology Department, Bldg. 463 Brookhaven National Laboratory Upton NY 11973-5000 USA Phone: (516)-344-6354 Fax: (516)-344-5751 E-mail: abola1@bnl.gov
Nancy Oeder Manning Protein Data Bank Biology Department, Bldg. 463 Brookhaven National Laboratory Upton NY 11973-5000 USA Phone: (516)-344-5744 Fax: (516)-344-5751 E-mail: oeder@bnl.gov
Dr Jaime Prilusky Head Bioinformatics Unit Dep. of Biological Services Weizmann Institute of Science 76100 Rehovot - Israel Phone: 972-8-9343456 Fax: 972-8-9344113 E-mail: lsprilus@weizmann.weizmann.ac.il URL: http://bioinformatics.weizmann.ac.il/jaime_prilusky.html