Volume 13, Issue 6 p. 564-571
Full Paper
Free Access

Hit Dexter: A Machine-Learning Model for the Prediction of Frequent Hitters

Conrad Stork

Conrad Stork

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Search for more papers by this author
Johannes Wagner

Johannes Wagner

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Search for more papers by this author
Nils-Ole Friedrich

Nils-Ole Friedrich

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Search for more papers by this author
Christina de Bruyn Kops

Christina de Bruyn Kops

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Search for more papers by this author
Martin Šícho

Martin Šícho

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

National Infrastructure for Chemical Biology, Laboratory of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology Prague, 166 28 Prague 6, Czech Republic

Search for more papers by this author
Prof. Dr. Johannes Kirchmair

Corresponding Author

Prof. Dr. Johannes Kirchmair

Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany

Search for more papers by this author
First published: 29 December 2017
Citations: 36

Graphical Abstract

Hit Dexter: False-positive assay signals triggered by badly behaving compounds continue to pose a major challenge to experimental screening. A free web service, called Hit Dexter, is able to identify such compounds with high accuracy, enabling chemists to make better-informed decisions on their hit compounds.

Abstract

False-positive assay readouts caused by badly behaving compounds—frequent hitters, pan-assay interference compounds (PAINS), aggregators, and others—continue to pose a major challenge to experimental screening. There are only a few in silico methods that allow the prediction of such problematic compounds. We report the development of Hit Dexter, two extremely randomized trees classifiers for the prediction of compounds likely to trigger positive assay readouts either by true promiscuity or by assay interference. The models were trained on a well-prepared dataset extracted from the PubChem Bioassay database, consisting of approximately 311 000 compounds tested for activity on at least 50 proteins. Hit Dexter reached MCC and AUC values of up to 0.67 and 0.96 on an independent test set, respectively. The models are expected to be of high value, in particular to medicinal chemists and biochemists who can use Hit Dexter to identify compounds for which extra caution should be exercised with positive assay readouts. Hit Dexter is available as a free web service at http://hitdexter.zbh. uni-hamburg.de.

Introduction

Biochemical assays are of considerable importance for early drug discovery, and modern high-throughput screening technologies allow the testing of over one hundred thousand compounds within one day.1, 2 However, high rates of false-positive readouts caused by various types of assay interference remain a major issue. A substantial number of false hits continue to appear as valid active compounds in the peer-reviewed literature.3 As a consequence, efforts to characterize badly behaving compounds4, 5 (frequent hitters, pan-assay interference compounds, aggregators and others) and develop good practice guidelines on how to identify assay artifacts and reject such hits6, 7 have recently been gaining traction.

False-positive results can be related to the chemical reactivity of a compound.8 In particular, electrophiles can bind covalently (and non-discriminately) to various proteins, thereby changing the function of the bio-macromolecule that is measured by the assay.9 A wide range of in silico approaches for the identification of reactive compounds are available. Reactive compounds can be identified using models based on sets of rules, quantum chemical methods, and other linear and nonlinear modeling techniques.10

Besides chemical reactivity, false-positive readouts in biochemical assays may be related to a variety of other effects and processes, such as redox cycling, interference with assay spectroscopy, membrane disruption, decomposition in buffers and metal complexation.3 Baell et al.11 have devised a set of 480 substructures from high-throughput screening data that encode the molecular substructures of pan-assay interference compounds (PAINS). These substructures can be encoded as SMARTS patterns to use as a filter for flagging compounds that are likely PAINS. However, their applicability domain is narrow and they also match (potentially) benign moieties.11, 12 A recent study showed that the patterns match a substantial number of compounds that do not show any assay activity (i.e., “Dark Chemical Matter”).13, 14 They should therefore be used as indicators rather than hard filters.

Colloidal aggregators are a further and possibly the most abundant type of compounds that may cause false-positive signals in biochemical assays.15 These compounds are related to the formation of micelles at specific concentrations and generally not covered by the SMARTS patterns discussed above. An in silico approach for flagging likely colloidal aggregators based on molecular similarity with over 12 600 known aggregators (taking calculated logP values into account) is available.16

Most of the available computational approaches are limited to the identification of a specific type of badly behaving compounds.6 An exception is Badapple,17 which assigns a promiscuity score to compounds based on their molecular scaffolds. Badapple is derived from more than 430 000 compounds measured in 822 different assays. While this model can in principle provide valuable indications of compound promiscuity, the reduction of molecular structures to scaffolds limits its capacity to account for subtle differences in compound structure.

In this work we explored machine learning approaches to develop classifiers for the prediction of frequent hitters (also referred to as promiscuous molecules) based on a large, curated dataset extracted from the PubChem Bioassay database.18, 19 The best performing models resulting from this work are available as a web service at http://hitdexter.zbh.uni-hamburg.de.

Results and Discussion

Compilation of datasets for model development

Bioactivity data on 468 260 small molecules measured in 2266 confirmatory dose–response assays were retrieved from the PubChem Bioassay database. The dataset was subjected to a multi-step data preparation process (Figure 1) resulting in 427 657 unique compounds with activity data on a total of 653 unique proteins.

Details are in the caption following the image

Overview of the data preparation pipeline. The numbers of ligands that survived each filtering step are reported in boxes, and the number of removed compounds are shown on the right. After the treatment of multi-component compounds (salts), molecules with a molecular weight below 250 or above 900 Da and molecules consisting of elements other than those commonly observed in drug-like molecules were removed. A duplicate filter was applied to the remaining compounds, followed by quality checks to discard, e.g., contradicting bioassay data. See Experimental Section for detail.

From this dataset, two subsets were extracted for model development, consisting of 391 552 and 311 491 compounds that have been tested for activity on at least 20 and 50 different proteins, respectively (Figure 2). These two cutoff values were found to produce datasets covering a broad range of biological activities and a large chemical space. The latter was analyzed by principal component analysis (PCA) as reported in Figure 3. We refer to these datasets as the PC20 and the PC50 datasets, where PC is used as an abbreviation for “protein count”.

Details are in the caption following the image

Number of compounds in the curated dataset that have bioactivity data reported for at least the given number of proteins (PC). The numbers of compounds relevant for the PC20 and PC50 dataset are indicated by red circles.

Details are in the caption following the image

The scatter plot of the second against the first component based on 45 molecular descriptors (Table S1) shows that the PC50 training set covers the total (i.e., processed PubChem Bioassay) dataset well.

The diversity of the PC20 and PC50 datasets was tested with a clustering approach. For each of the two datasets, 20 subsets were compiled, each consisting of 50 000 randomly selected compounds. These subsets were clustered with the Butina (unsupervised non-hierarchical) clustering algorithm20 based on Morgan2 fingerprints21, 22 and a Tanimoto similarity threshold of 0.75. Among all subsets of both datasets, the largest clusters contained only 32 (PC20) and 15 (PC50) molecules, respectively. The lowest number of clusters for all subsets was 44 649 (PC20) and 44 514 (PC50), respectively. Based on these results we deemed the datasets sufficiently diverse for modeling.

The hit rates of the individual compounds in biochemical assays were quantified with the active-to-tested ratio (ATR), which is calculated as in Equation 1:
urn:x-wiley:18607179:media:cmdc201700673:cmdc201700673-math-0001(1)

where A is the number of proteins for which a compound was measured as active and T is the total number of proteins on which a compound has been tested. The ATR is low for most compounds in the dataset, but a significant number of frequent hitters are also present (Figure 4).

Details are in the caption following the image

ATR distributions among compounds of the PC20 and PC50 datasets.

Compounds were assigned one of three different promiscuity labels according to the ATR thresholds reported in Table 1: a “non-promiscuous” (NP) label for any compounds with an ATR<ATRmean, a “promiscuous” (P) label for any compounds with ATR>ATRmean+1σ, and a “highly promiscuous” (HP) label for any compounds with ATR>ATRmean+3σ. Note that, according to this definition, highly promiscuous compounds are a subset of promiscuous compounds.

Table 1. Composition of the datasets used for model training and validation.

Assigned promiscuity class

Dataset

Number of compounds in

Threshold definition[a]

Threshold value

PC20

PC50

PC20[b]

PC50[b]

Non-promiscuous (NP)

Total:

Training set:

Test set 1:[c]

Test set 2:[d]

Test set 3:[e]

273 781

246 331

27 450

16 872

6569

226 710

203 992

22 718

14 611

5863

ATR<ATRmean

0.017

0.015

Promiscuous (P)

Total:

Training set:

Test set 1:[c]

Test set 2:[d]

Test set 3:[e]

35 438

31 915

3523

2303

1090

29 112

26 201

2911

2060

965

ATR>ATRmean+1σ

0.049

0.043

Highly promiscuous (HP): a subset of compounds labeled P

Total:

Training set:

Test set 1:[c]

Test set 2:[d]

Test set 3:[e]

7371

6653

718

496

283

5527

4970

557

409

203

ATR>ATRmean+3σ

0.112

0.100

  • [a] Compounds with ATRs between ATRmean and the given standard deviation were not assigned a promiscuity label and were effectively removed from the datasets. [b] ATR threshold values calculated for the individual datasets according to the ATR threshold definition. [c] Independent test set obtained by random split of the curated dataset prior to model development. [d] Subset of the independent test set consisting only of molecules showing a Morgan2 fingerprint-based maximum Tanimoto coefficient of 0.8 to any compounds in the training data. [e] Subset of the independent test set consisting only of molecules showing a Morgan2 fingerprint-based maximum Tanimoto coefficient of 0.7 to any compounds in the training data.

Prior to any modeling experiments, the datasets were each split into a training set and an independent test set using a 9:1 ratio (Table 1). This resulted in a training set with up to 246 331 instances for each of the promiscuity classes. The test sets consisted of up to 27 450 instances for each promiscuity class.

Analysis of the physicochemical properties of non-promiscuous and promiscuous molecules

The ability of a compound to trigger a positive signal in biochemical assays can be related to its physicochemical properties or to the presence of specific chemical patterns. We computed characteristic physicochemical properties to probe whether a link to compound promiscuity can be established.

As shown in Figure 5, the molecular weight distribution of NP, P and HP compounds is similar. However, P (and also HP) compounds tend to be more lipophilic than NP compounds (Table 2). Their calculated logP is on average one log unit higher than that of NP compounds. This is consistent with the general observation that nonspecific compound binding is correlated with hydrophobicity. In addition, a higher proportion of aromatic atoms as well as a lower proportion of rotatable bonds were found among P (and also HP) compounds. Both of these properties are related to planarity and flexibility, which themselves are known to be associated with a compound's ability to form colloidal aggregates. Whether these differences in physicochemical properties are sufficient to yield accurate classifiers will be explored in the subsequent sections.

Details are in the caption following the image

Density plots of the logP, molecular weight, ratio of aromatic atoms to heavy atoms and ratio of rotatable bonds among all bonds for the PC50 dataset. HP compounds in red; P compounds in blue; NP compounds in black.

Table 2. Physicochemical properties and their correlations with the ATR for the PC50 dataset.

Property

NP[a,d]

P[b,d]

HP[c,d]

Correlation with ATR

Ratio of aromatic atoms to heavy atom

0.44±0.16

0.52±0.18

0.50±0.18

0.13

Ratio of rotatable bonds among all bonds

0.14±0.04

0.12±0.04

0.11±0.05

−0.13

logP

2.62±1.33

3.46±1.47

3.46±1.60

0.20

Molecular weight

371.19±7.92

364.45±77.49

366.74±78.19

−0.01

  • [a] Non-promiscuous compounds. [b] Promiscuous compounds. [c] Highly promiscuous compounds. [d] Data are the average±standard deviation.

Model development

Two different types of binary classification models were developed: one to discriminate promiscuous from non-promiscuous compounds (P-NP) and one to discriminate highly promiscuous from non-promiscuous compounds (HP-NP). In combination with the two datasets, PC20 and PC50, this gave rise to a total of four different models.

Identification of the most suitable machine learning algorithm and descriptor sets

In initial experiments we explored the performance of random forest classifiers23 (RFCs) and extremely randomized tree classifiers24 (ETCs) trained on 1) all 206 2D physicochemical property descriptors implemented in MOE,25 2) MACCS key fingerprints (166 bits), and 3) Morgan2 fingerprints (1024 bits), both implemented in RDKit.26

All models were trained with scikit-learn27 and evaluated by 10-fold cross-validation. Default values were used for the hyperparameters, except for the number of estimators, which was increased to 50, and the class weights, which were set to “balanced”.

The Matthews correlation coefficient (MCC) was used as the primary measure of model performance. The MCC is a balanced measure of prediction quality which not only takes true positives (TP) and false positives (FP) into account, but also true negatives (TN) and false negatives (FN). It is calculated according to Equation 2.
urn:x-wiley:18607179:media:cmdc201700673:cmdc201700673-math-0002(2)

The area under the receiver operating characteristic curve (AUC) served as an additional measure of how well the model was able to rank the compounds for promiscuity according to the probabilities given by the machine learning algorithms.

Models derived from the combination of the extremely randomized tree algorithm with Morgan2 fingerprints consistently obtained the best performance for all combinations of promiscuity thresholds and datasets. The models’ MCC and AUC values ranged up to 0.61 and 0.94, respectively (Table 3). The random forest classifier in combination with Morgan2 fingerprints obtained comparable results (MCC of up to 0.56 and AUC of up to 0.94). Models based on molecular fingerprints clearly outperformed those based on physicochemical property descriptors. This result was expected because assay interference is often linked to specific molecular substructures, and Morgan2 fingerprints are the most suitable (among those tested) to capture these substructures. Differences in performance with respect to promiscuity thresholds and datasets were small. As a result of these experiments, the combination of the ETC with Morgan2 fingerprints was identified as the most suitable starting point for further optimization of the models.

Table 3. Performance of models derived from different combinations of machine learning algorithms and descriptor sets during 10-fold cross-validation.[a]

MOE physicochemical property descriptors

MACCS fingerprints

Morgan2 fingerprints

Algorithm:

ETC

RFC

ETC

RFC

ETC

RFC

Metric:

MCC

AUC

MCC

AUC

MCC

AUC

MCC

AUC

MCC

AUC

MCC

AUC

P-NP

with

PC20

0.47±0.3×10−4

0.89

0.47±0.4×10−4

0.89

0.53±0.3×10−4

0.87

0.52±0.3×10−4

0.89

0.58±0.9×10−4

0.91

0.55±0.5×10−4

0.91

HP-NP

with

PC20

0.44±4.1×10−4

0.93

0.43±3.5×10−4

0.92

0.56±1.6×10−4

0.92

0.53±1.1×10−4

0.93

0.61±1.5×10−4

0.94

0.56±1.9×10−4

0.94

P-NP

with

PC50

0.46±1.2×10−4

0.89

0.46±1.1×10−4

0.89

0.52±1.0×10−4

0.87

0.51±1.1×10−4

0.89

0.57±0.8×10−4

0.91

0.54±1.2×10−4

0.91

HP-NP

with

PC50

0.41±4.0×10−4

0.92

0.40±1.7×10−4

0.92

0.56±1.9×10−4

0.92

0.52±1.5×10−4

0.92

0.61±2.4×10−4

0.94

0.55±2.3×10−4

0.94

  • [a] ETC, extra tree classifier; RFC, random forest classifier; P-NP, discrimination of promiscuous from non-promiscuous compounds; HP-NP, discrimination of highly promiscuous from non-promiscuous compounds. MCC (with standard deviations) and AUC values averaged over all folds of the cross-validation.

Optimization of model hyperparameters

The number of estimators and the maximum fraction of features considered per split were optimized using a grid search with 10-fold cross-validation (Table 4). Model performance was evaluated based on the average MCC obtained over all folds.

Table 4. Hyperparameters optimized by grid search.

Parameter

Tested values[a]

Number of estimators

(n_estimators)[b]

10,[c] 50, 100, 150, 200, 250, 300, 400, 500, 600

Maximum fraction of

features considered

per split

(max_features)[b]

“sqrt”,[c] 0.2, 0.4, 0.6, 0.8, None[d]

  • [a] Bold values were used for final model development. [b] Parameter name in the scikit-learn implementation. [c] Default value. [d] All features are used.

For all combinations of datasets and promiscuity thresholds, minor performance improvements corresponding to increasing numbers of estimators were observed (Tables S2–5). For example, the MCC values of models with 100 estimators were up to 0.04 higher than those of models with only 10 estimators. Marginal, if any, improvements in performance beyond 100 estimators did not justify the additional computational cost. The effect of the maximum fraction of features considered per split (max_features) on model performance was small (up to around 0.01 in MCC and AUC). The best models were achieved with max_features set to 0.2 for all combinations of datasets and promiscuity thresholds.

Overall, the best-performing classifier that emerged from the grid search was able to distinguish HP from NP compounds with an MCC and AUC of 0.62 and 0.95 for the PC20 dataset, and 0.61 and 0.95 for the PC50 dataset, respectively (n_estimators=100, max_features=0.2, for both datasets; Figure 6). The P-NP classifiers performed slightly worse than the HP-NP classifiers, obtaining MCC and AUC values of 0.60 and 0.91 for the PC20 dataset and 0.58 and 0.91 for the PC50 dataset, respectively (Figure 6). The observed differences in model performance were expected, as the ATR margin between the HP and NP classes (3σ) is broader than the margin between the P and NP classes (1σ). Because the performance of models derived from the PC20 and PC50 dataset was comparable, further discussion will focus on models derived from the latter.

Details are in the caption following the image

Mean ROC curves obtained during 10-fold CV for the best-performing, optimized models (i.e., the ETC derived with Morgan2 fingerprints, n_estimators=100, max_features=0.2).

Model evaluation on independent test sets

The final models were trained with the above-mentioned, optimized hyperparameters (n_estimators 100; max_features 0.2) on the complete PC50 training set balanced with the synthetic minority over-sampling technique (SMOTE) algorithm.28 Performance data on the models derived from the PC20 dataset are provided in Figure S1.

The HP-NP model was able to predict compound promiscuity for the independent test set 1 with MCC and AUC values of 0.67 and 0.96, respectively (Figure 7). Consistent with the trends observed in the cross-validation, slightly lower values were obtained with the P-NP model (MCC 0.61; AUC 0.92). The MCC and AUC values for the independent test sets were slightly better (up to 0.06 MCC and 0.01 AUC) than those for cross-validation on the training set. The increase in performance is likely a result of the over-sampling approach and the fact that more data were available and used for training than during the cross-validation approach.

Details are in the caption following the image

ROC curves obtained with the final models for the three test sets extracted from the PC50 dataset prior to model development.

To explore the robustness of the models, two subsets of the independent test set were generated consisting only of molecules showing a Morgan2 fingerprint-based maximum Tanimoto coefficient of 0.8 (test set 2) and 0.7 (test set 3) to any compounds in the training data (Table 1). As expected, the MCC and AUC values obtained for the test sets 2 and 3 were lower than for test set 1 (Figure 7). The HP-NP classifier obtained an MCC and AUC of 0.63 and 0.96, respectively, for test set 2. Both values were approximately 0.05 lower for the P-NP classifier on the same dataset. For test set 3, MCC and AUC values of 0.54 and 0.95, respectively, were obtained for the HP-NP classifier. The respective values for the P-NP classifier were again around 0.05 lower than those of the HP-NP classifier.

In addition, the HP-NP and P-NP classifiers were also tested on the Dark Chemical Matter (DCM) dataset,13 which consists exclusively of compounds that have been tested in a minimum of 100 different assays and have not shown any activity. Prior to testing, any compounds present in the PC50 training set (341 compounds in total) and any compounds outside the applicability domain of the models (13 672 compounds that did not pass the filters applied for molecular weight and element types; see the Experimental Section for details) were removed from the DCM dataset. This resulted in a test set of 125 339 compounds, of which 99.9 % and 98.4 % were correctly classified as not promiscuous by the HP-NP and the P-NP models, respectively (Figure 8).

Details are in the caption following the image

Likelihood of promiscuity predicted for over 125 000 compounds of a preprocessed subset of the DCM dataset. The figure shows that the HP-NP and P-NP models correctly classify the vast majority of compounds as not promiscuous. They obtained an overall accuracy of 99.9 % and 98.4 %, respectively.

Hit Dexter web service

A web service called “Hit Dexter” is accessible free of charge via http://hitdexter.zbh.uni-hamburg.de. The web service offers an easy and quick way to make predictions for individual molecules and sets of molecules with the best-performing classifier (i.e., the ETC derived from the SMOTE-balanced PC50 dataset, Morgan2 fingerprints, n_estimators=100, max_features=0.2). Users upload molecular structures as SMILES or a list of SMILES and initiate the calculations. After a few seconds the user is presented a tabular overview of results, including the molecule name and the calculated probabilities of a compound to be a frequent hitter (Figure 9). The results and a log file can be downloaded for further use. There is an option to also retrieve the five nearest neighbors of query molecules present in the training set, which will give users a better estimate of how reliable the predictions are for their particular compounds of interest.

Details are in the caption following the image

Screenshot of the Hit Dexter result page.

Conclusions

Assay interference continues to present a significant challenge in early drug discovery. Current computational approaches attempting to identify frequent hitters, including reactive compounds, PAINS and aggregators, are clearly limited in their applicability. In this work we report on the development of Hit Dexter, a machine learning approach for the identification of compounds likely to trigger positive assay signals. The classification models included in Hit Dexter reached MCC and AUC values of up to 0.67 and 0.96 on an independent test set, respectively. A free, public web service includes functionality to retrieve the five nearest neighbors present in the training data for each query molecule, in order to support users in estimating the reliability of the predictions for their particular compounds of interest. Importantly, besides reactive compounds, PAINS and aggregators, Hit Dexter also identifies compounds with particular pharmacophores that allow them to bind to multiple proteins.

We believe that Hit Dexter will help scientists to flag compounds that have an increased likelihood of triggering positive signals in biochemical assays. Compounds flagged by Hit Dexter should not be regarded as being of lower value for drug discovery but rather as having more uncertainty regarding their activity. In fact, frequent hitters may even be desirable, for example, in the context of polypharmacology and drug repurposing,29 provided they are true promiscuous binders.

The purpose of Hit Dexter is to raise awareness and motivate further investigations of the flagged compounds in orthogonal assays. In particular, we also hope that these models will contribute to the effort to decrease the amount of false hits in the scientific literature.

Experimental Section

Activity data for chemical substances (substance type=“chemical”) measured with 2266 confirmatory dose–response assays (screening stage=“confirmatory, dose–response”) for single protein targets (target=“single” and target type=“Protein Targets”) were downloaded from the PubChem Bioassay database.18, 19, 30 The SMILES notations for all 468 260 compounds in this dataset were retrieved via the PubChem Identifier Exchange Service.31 Compounds consisting of multiple components (salts) were split and the components sorted by decreasing number of heavy atoms (“size”). If the second-largest component was significantly smaller than the largest one (i.e., number of heavy atoms less than 70 % of the largest component), the largest component was defined as the active component and all others were discarded. If this was not the case, the compounds were removed from the dataset (as no clear assignment of the main components could be made) unless the two largest components were identical, in which case one of these was preserved and all others discarded (salt filter in Figure 1). Compounds with the same unique SMILES were treated as one compound (duplicate filter in Figure 1).

Compounds with a molecular weight below 250 or above 900 Da (molecular weight filter in Figure 1), as well as compounds consisting of any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br and I (element filter in Figure 1) were removed from the dataset. The InChIs were retrieved for all remaining compounds via the PubChem Identifier Exchange Service.31 The 240 compounds for which the InChI could neither be retrieved via the PubChem Identifier Exchange Service nor the PubChem PUG REST interface32 were also discarded.

All downloaded bioactivity records in the PubChem Bioassay database have one of the following four activity values (activity outcomes): “Includes Probe”, “Active”, “Inactive” or “Unspecified/Inconclusive”. Any assays not having at least one “Active” and one “Inactive” record were removed from the dataset. Any compounds (i.e., all instances having the same InChI after application of the salt filter) with contradicting activity values for one and the same assay were discarded (quality check in Figure 1). Following this step, any compounds reported by at least one assay as active on a particular protein were labeled active on that protein. This procedure resulted in a total of 405 399 compounds with assigned bioactivities.

All PubChem Bioassays are linked to a “gene identifier” (GI), a unique identifier for genes in the NCBI Protein database.33 This identifier was retrieved for the individual assays via the PubChem PUG REST32 interface to link assays to proteins. A total of 712 unique GIs were retrieved. Using these GIs, the protein sequences were retrieved in FASTA file format from the NCBI Protein database. The protein sequences were checked for sequence identity with cd-hit34 (structure equality=100 %), resulting in 653 unique proteins.

All calculations are performed on Linux workstations running openSUSE 42.2 and equipped with Intel i5 processors (3.2 GHz) and 16GB of main memory.

Acknowledgements

Rainer Fährrolfes, Florian Flachsenberg, and Gerd Embruch are thanked for technical support and discussions. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation; grant KI 2085/1-1) and by the Ministry of Education of the Czech Republic (grants NPU I-LO1220 and LM2015063). M.S. was supported by the Erasmus+ Programme of the European Commission.

    Conflict of interest

    The authors declare no conflict of interest.