Skip to main content
  • Oral presentation
  • Open access
  • Published:

Dataset overlap density analysis

The need to compare compound datasets arises from various scenarios, like mergers, library extension programs, gap analysis, combinatorial library design, or estimation of QSAR model applicability domains. Whereas it is relatively easy to find identical compounds in two datasets, the quantification of the overlap is not straightforward. The various approaches described include pairwise nearest neighbor comparisons, clustering and mixed cluster statistics, or binning of e.g. rule-of-five property space distributions. The BCUT methodology creates a binned N-dimensional space and allows to assess the amount of mixed cells. ChemGPS creates a PCA reference projection based on drug-like and satellite molecules in property space to classify new compounds.

But is it possible and also plausible to quantify the overlap of two datasets in a single interpretable number?

PCA projection models with the World Drug Index as drug-like reference space were created based on MACCS, ECFP4, estate or Lipinsky-like physchem descriptors. Compounds from the commercial vendor i-research library, ZINC, ChEMBL and a current screening subset from PubChem were projected onto the WDI maps.

The dataset overlap density index DOD is calculated from the summations over the occupancies of each N-dimensional "volume" element occupied by both datasets, divided by all such elements populated by at least one dataset. The index provides a measure of the overlap of two sets.

It is shown that the number of principal components needed to describe at least 75% of the information content of the descriptor greatly varies and that a projection in 2 dimensions is not adequate. Such N-dimensional projections are extremely sparse (about 1043 elements for WDI and MACCS descriptor) and crowded only in small regions of the spanned N-dimensional space.

The approach is universal to any descriptor. It can be extended to a DOD vector based on different descriptor types each describes different characteristics of the encoded molecules. The box element graining can be easily adjusted as needed for a particular application. based on needs. It allows to quantify local gaps or overlaps. Proprietary datasets can be compared just by the first N principal component values without even seeing the descriptors behind.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Andreas H Göller.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Göller, A.H. Dataset overlap density analysis. J Cheminform 5 (Suppl 1), O14 (2013).

Download citation

  • Published:

  • DOI: