Statistics in the DAQ-Score Database


The DAQ-Score Database will be monthly updated. The current database entries are based on the data in the Protein Data Bank as of 2024/03/27. The DAQ-Score database entries were constructed with the following filtering criteria, starting from all PDB entries (217,705 PDB IDs).

Then all versioned models of those PDB IDs were collected from PDB Versioned Archive. To exclude a not well fitted model to the map, we used the sum of the cross-correlation coefficient and the overlap between the experimental map segmented by the model and the simulated map of the model. If the sum is <1.50, the pair of versioned PDB model and associated EMDB map was excluded from DAQ-Score database. (13,549 PDB IDs remained)


Cross-Correlation Coefficient and Overlap

Chains of which a sequence (_entity_poly.pdbx_seq_one_letter_code_can) is composed of only unknow amino acids ("X") is not in the database, currently.

The DAQ-Score database includes analysis of 4485 chains from the PDBNR1Å dataset used in the DAQ-Score paper and additional chains analyzed more recently. The total number of proteins analyzed are listed above.

The PDBNR1Å dataset was constructed as follows: First, we clustered protein chains in PDB with a 90% sequence identity cutoff. Then within each cluster, we computed Ca RMSD between structure model pairs and one of the structures was removed if they had a Ca RMSD of less than 1.0 Å. Thus, chain models, in general, have less than 90% sequence identity between each other, but if model pairs with more than 90% identity were structurally different by more than 1.0 Å in Ca RMSD, both of them were kept. Then, we further applied the following four criteria and kept protein models that satisfy all the conditions:

  1. at least 200 residues long;
  2. were constructed from cryo-EM maps and the model has at least 50% of volume overlap with the maps;
  3. have at least 0.5 cross-correlation coefficients between the EM map;
  4. do not have 25% or more sequence identity to any proteins in the 183 training/validation set of Emap2sec+.

-->