Automated Training Set Selection used for Developing a 3D QSAR Pharmacophore Model in CATALYST™ for p38 MAP Kinase inhibitors
Dr. Shikha Varma describes an automated method for selecting a training set for generating a Catalyst hypothesis from a large collection of p38 MAP Kinase inhibitors.
|
Abstract
The objective here is to develop an automated method for selecting a training set that can be used for Catalyst hypothesis generation from a large collection of compounds. Functionalities within Cerius 2 and Catalyst are used in conjunction to develop an automated training set selection protocol. This protocol involves selection of the most diverse set of compounds on the basis of their multi-conformer 3D fingerprints, multi-conformer shape indices, and activity. For this study, SAR data was collected for p38 MAP Kinase inhibitors. 2-7 p38 MAP Kinase belongs to a family of MAP kinases and has been implicated in cytokine signaling. Its inhibitors are potentially useful for the treatment of arthritis and osteoporosis. Development of more potent p38 inhibitors is a goal of many pharmaceutical companies, creating a highly competitive and rapidly growing field.
Introduction
Training set selection from a given SAR data is the first step in deriving a predictive QSAR model. The quality of the resultant model is highly dependent upon the molecules which are used to derive the model; therefore, selection of these compounds must be done very carefully.
Guidelines for 3D QSAR Model Generation in Catalyst
3D QSAR (HypoGen) model generation within Catalyst requires the following guidelines in order to select molecules for hypothesis generation:
- At least 16 compounds to assure statistical significance of the pharmacophore model
- Activity range of the compounds should span at least 4 orders of magnitude
- Each order of magnitude should be represented by at least three compounds
- The most active and inactive compoundsshould be included
- Two compounds with similar structure must differ in activity by an order of magnitude to be included, otherwise pick only the most active of the two
- Two compounds with similar activities must be structurally distinct in order to be included, otherwise pick only the most active of the two
- No redundant information should be included
These guidelines have been discussed in detail previously. 1 This procedure can be a daunting task if there is a large amount of SAR data (in hundreds), and if the compounds tend to share similar chemo type or are structurally very homologous.
P38 MAP Kinase Inhibitors
The majority of the p38 inhibitors come from a pyridyl-imidazole family, and numerous crystal structures of the kinase complexed with inhibitors were solved and published (PDB codes 1A9U, 1BL6, 1BL7, 1BMK, to mention a few). Recently, another class of p38 MAP kinase inhibitors was identified which contains a pyrazoylyl-urea like substructure. For this study, a total of 311 compounds were selected with an activity range from 0.11nM - 114000nM. The average molecular weight of compounds is 361.94 with an average of 47.6 number of conformers per compound.
In order to calculate the 3D fingerprints, we first create from a multi-conformer Catalyst .bdb file, a feature file containing the coordinates of all the features present in each conformer. This is done within the Cerius 2 interface (3DKeys) using the catFeatures program in Catalyst. The features HBA, HBD, RING AROMATIC, HYDROPHOBIC, NEG CHARGE, NEG IONIZABLE, POS CHARGE, and POS IONIZABLE are predefined by Catalyst and are surface accessible.
Stepwise progression of file types:
p38kinase. sd --> p38kinase. bdb --> p38kinase. fea --> p38kinase. 4pf (or p38kinase. 3pf )
The highly correlated and high dimensionality fingerprint data is then treated with multidimensional scaling. MDS coordinates, together with shape descriptors and activity data, are then analyzed by principal component analysis (PCA). Finally, Cerius 2 diversity tools are used to select the N most diverse compounds (Figure 1).

Enlarge - Figure 1. A graphical scheme of the training set selection methodology.
A 3D fingerprint for a compound is defined as the collection of all possible combinations of three features (3 point fp) or four features (4 point fp) fingerprints in three dimensions for all conformers. Each multiplet is characterized by a set of feature types and the corresponding inter-feature distances (Figure2).
Figure 2. A 4-point 3D fingerprint representation of a p38 MAP Kinase Inhibitor.
Shape descriptors are calculated for all the multi-conformer compounds using the Catalyst functionality, catShape. The shape descriptors consist of volume descriptors (mean and median) and x, y, and z components of principal axes (min., max., mean, range, median) (Figure 3).

Figure 3. (a) A p38 MAP Kinase Inhibitor, (b) shape of a conformer, (c) an ensemble of the multi-conformer shapes.
A principle component analysis is then performed on the descriptors (MDS coordinates derived from 3D fingerprint, shape, and activity) to reduce the high dimensionality descriptor data into principal components. To visualize the compounds in three dimensions, the first three principal components are plotted. The last step is to select a diverse set of compounds to be used as a training set for developing a HypoGen model in Catalyst (Figure 4).

Figure 4. PCA plot of p38 MAP Kinase inhibitors with 30 most diverse molecules selected (red).
References:
- http://www.accelrys.com/products/catalyst/catalystproducts/cathypo.html
- Gallagher, T. F. et al, Bioorg. & Med. Chem. Lett. 5(11) , 1171-1176 (1995)
- Boehm, J. C. et al, J. Med. Chem. Vol 39(20 ), 3929-3937 (1996)
- Henry, J. R. et al, J. Med. Chem. 41(22) , 4196-4198 (1998)
- Wang, Z. et al, Structure 6 , 1117-1128 (1998)
- Liverton, N. J. et al, J. Med. Chem. 42(12) , 2180-2190 (1999)
- Underwood, D. C. et al, J. Phar. and Expt. Therap . 293(1) , (2000)