Enabling Matched Molecular Pairs Analysis for Target Activity Prediction on Small Datasets
With the very fresh 2.0 release of the Discngine Chemistry Collection for Pipeline Pilot nifty new functionality is available to Pipeline Pilot users. Find out more about the new release of the Chemistry Collection.
This post describes one of these new features, fuzzy context specific matched molecular pairs (fcsMMPs), why you should adopt them and a few examples of what you can do with this. Note that this post focuses on target activity prediction (fairly unsual for MMP analysis), but everything that is described here applies to more classical property prediction too.
Everything that is shown here can be done with this Pipeline Pilot Component Collection. Even the data is included.
If you do not feel 100% familiar with the concepts behind matched molecular pairs analysis, please read through the introductory part below. If you know what a transformation, a common core and a context is and have done matched pairs effect analysis in the past, you can directly go to the fcsMMP specific part of this post.
Introduction
Matched molecular pairs analysis (MMPA) is a commonly used approach to analyze structure property data. It allows to extract frequently observed transformations that alter (or not) a given property. One of the main advantages of MMPA: results are generally easy to interpret by computational chemists but also and more importantly, medicinal chemists.
What is a Matched Molecular Pair or MMP ?
A MMP is formed by two compounds that share a common structural part and differ on another part. The common structural part is usually called common core, whereas the variable part is named fragment. Thus a pair defines a transformation of that variable part from fragment 1 to fragment 2.
On the image on the right, the fragment that is replaced is marked in red, while the common core is colored in gray. Below you can find another example of a matched pair. Note that the fragment replacement from a phenyl to pyridyl group is also observed on the following matched pair.
Note the change in the common core this time. In a dataset of n compounds this phenyl to pyridyl transformation can be observed several times. However, the common core can be distinct between different matched pairs as shown on the example above.
How MMPs are used?
Now that we have defined what a MMP is, how can these compound pairs be useful for medicinal chemistry? The idea behind the analysis of such pairs is rather simple. Given a pair of compounds with a particular transformation, let's stick to our phenyl - pyridyl transformation, people stipulate that such a transformation on another pair of molecules might have the same effect on molecular properties, even if the common core is different, if the effect has been observed a sufficient number of times. For example, they will all affect solubility of the compound in the same way. For more information on how MMPs are currently used you can read for instance this Leach paper that appeared in 2006 in J Med Chem.
What is a Matched Molecular Pair property/activity effect?
Let us suppose we are trying to analyse a database of compounds with annotated physicochemical properties, like solubility, lipophilicity etc...
We previously saw how a matched pair of two molecules is defined transforming a phenyl to a pyridyl ring. This matched pair has been observed on two compounds, let us call them A & B. Now, we found the same transformation but with another common core. Thus the same transformation is also described by compounds C & D. By analyzing the whole database of compounds we will end up with hundreds or even thousands of matched pairs transforming the phenyl to a pyridyl ring. What is the effect of this replacement on each of these pairs on a physicochemical property? When analyzing all matched pairs with the same transformation one can derive a pie chart like the one on the right. This chart shows that in 40% of the time the transformation improves the property of the molecule (let's say solubility, just as an example). However, in 30% the property remains the same and on the remaining 30% it is even deteriorated.
In some cases, MMP effect analysis yields clear results showing for instance 90% improvement on a property, or 90% of unchanged property values, but the real world scenario looks more what you can see on the right side. Can you draw clear conclusions and decide which compound to make next with such a result?
In 2010 George Papadatos published a well known paper together with the Willet group and people from the Stevenage site of GSK in the UK. In this paper they show that taking into account the context of a matched pair is important to extract meaningful activity effects for transformation rules. Unsurprisingly they saw that this was true for compound activity against hERG. However, they also showed that the same holds true for the analysis of solubility and lipophilicity of compounds. These results clearly show that the context of the transformation is important for the MMP principle to hold true.
But what is the Context of a Matched Molecular Pair?
The context of the matched molecular pair is the part of the common core directly attached to the fragment that is replaced.
The image above shows 3 different common core contexts that are frequently used in MMP analysis. Generally, people use up to four bonds from the attachment point of the fragment to represent the context of a transformation.
This context is then used to limit the extraction of activity effects on a subset of transformations observed in a dataset, namely on all transformations occurring on the same context. Thus, rather than having one pie chart representing the effects of one transformation for a given property on all matched pairs, this will result in having one pie chart per transformation & context. As you can imagine, this can substantially reduce the amount of data behind one activity effect analysis.
So the bigger the context, the more precise the MMP effect ?
That's the general idea. However, as described previously, when considering a bigger context you might end up with a very small subset of all phenyl to pyridyl transformations. These subsets can become so small, that the effect that is monitored can not be shown to be statistically significant due to the small sample size. Thus, usually you need lots of experimental data to successfully use MMP effect analysis.
Ok, so I can also predict effects of transformations on activity versus a therapeutic target?
People tend to say so and even apply MMP analysis on off-targets. hERG is a popular example that we can find fairly often in literature. However, imagine you are working on optimizing a chemical series versus your favorite kinase. Will an addition of a trifluoro substitute on a phenyl ring improve the activity versus your kinase as it would if there was something completely different attached to it? It doubtfully will, unless you consider the rest of the molecule properly. If the common core is the same or very similar between matched pairs showing the same transformation, then, and only then MMP analysis could be potentially used for target activity analysis / prediction.
In other words, if you want to perform target activity analysis using matched molecular pairs your are somehow on the edge what MMPA can currently provide, as you have to fully consider the context, but also need lots of data.
Introduction to fuzzy context specific Matched Molecular Pairs (fcsMMP)
Here we introduce a concept that has already been presented at the German Conference on Chemoinformatics in 2013 and used in a recent paper by people from Böhringer Ingelheim. On the right side you can see again the example of a matched pair of compounds already used in the introduction. Here again, the fragment that is transformed is colored in red, the common core in gray.
However, beneath the common core you can now also see a set of nodes (circles) and edges (lines). These nodes and edges represent the pharmacophore graph of the common core.
The pharmacophore graph is an enhanced reduced graph representation of the original molecule. Well defined fragments of the molecule (for instance an aromatic cycle, or a particular functional group) are reduced to one node with a given pharmacophoric property. For instance, the dimethyl benzene group at the bottom of the molecule is reduced to one single node. The pharmacophoric features of this node are described by its aromaticity (black outline) and hydrophobicity (white fill color). Here we decided not to take into account terminal groups, thus the methyl groups are also reduced to the single aromatic node. In the end, this results in a fuzzy representation of this dimethyl benzene group. This inherent fuzziness can now be used to identify matched pairs using the pharmacophore graph version of the common core. In the example above, the common cores do not match exactly when using the molecular representation (note the missing methyl group on the benzene on the bottom of the molecule). If the pharmacophore graph is used, then both common cores match, and thus a fuzzy matched pair can be identified.
What about the context of the fragment?
On the right the same molecule as above is shown. Marked in green is the context of the fragment at 1, 2 and 3 bonds from the fragment using the pharmacophore graph representation.
Thus, including only one pharmacophore graph node into the context already adds the whole imidazole ring to the context.
Now, imagine using this context in order to create activity effect classes. Rather than relying on the imidazole representation of the context, we can rely on the pharmacophore graph representation. The context is here described as an aromatic ring containing a hydrogen bond acceptor. This fuzzy representation includes the imidazole ring, but would also accept other scaffolds that are very similar, like a pyridyl ring, a pyrole or a pyridine for instance.
Ok, but why is this interesting?
As we have tried to demonstrate during the introduction, up to some point you are confronted with the analysis of your matched pairs. All matched pairs are put together into activity classes, where a transformation attached to a given context yields a certain effect. In order to yield statistically significant and more precise effects from such an analysis, the context should be taken into account at some point and by context we mean a bit more than the typical up to four bonds representation of the context.
Furthermore, in order to perform MMP analysis for target oriented activity prediction, one has to include the full context. However, including the full context in its atomic form may simply result in a single or very few matched pairs, especially for small datasets. So there is no theoretical way to use MMP analysis in such a setting.
On the other hand we can use a pharmacophore graph representation of the full common core to create activity effect classes. The pharmacophore graph is fuzzy enough to allow for acceptance of various distinct common core that nevertheless share a common topology and pharmacophoric features.
But doesn't this add too much noise into your analysis?
This is actually a legitimate question. A lot of people would think so indeed, and in fact we are breaking the strict restrictions that exist on matched pair definitions itself. Although this is also true for classical MMP analysis, the most important thing is to perform proper statistical testing whether or not an observed activity effect is significant. If we can demonstrate that a given transformation significantly improves a property or activity and if this transformation occurs on a topologically & pharmacologically similar common core, we could at least suggest that the transformation occurring on the fragment position is responsible for the majority of the change in activity.
I definitely recommend to read the excellent paper of Christian Kramer on Significance and the Impact of Experimental Uncertainty in Matched Molecular Pairs Analysis.
We decided to integrate proper statistical testing and automated p-value calculations into our Pipeline Pilot Component calculating the activity classes for continuous and categorical data.
Application example & proof of concept
Next I'll show a few examples on how this concept can be applied to target activity prediction. However, you can also apply this to more classical property predictions too. I focused on activity prediction here, as common MMP analysis settings do not allow to extract meaningful transformation rules. Note that all analyses are target specific and done on rather small data-sets, a real challenge using currently available methodologies.
The following results were obtained using the 2.0 release of the chemistry collection with the following settings :
reduce common core to pharmacophore graph (topology & pharmacophoric features)
represent the context as full pharmacophore graph
keep full molecular structure on fragments
retain all activity classes with more than ten matched pairs for effect analysis
allow up to four cuts during molecular fragmentation
The fcsMMP integration in Pipeline Pilot furthermore allows to calculate p-values assessing if an observed shift in activity upon a transformation is statistically significant. Applying very strict p-value filtering, all extracted activity effects can be classified into 3 categories :
Activity cliffs: a change in activity is observed (p-value <= 0.001)
Cliff candidate : a change in activity is probably observed (0.001 < p-value <= 0.01)
Bioisosteric replacements : it is likely that the transformation does not alter the activity (p-value > 0.1)
Epoxide hydratase Example
Above you can see an excerpt of a result you would get when analyzing activity data (pIC50) on epoxide hydratase extracted from Chembl (453 molecules). As example, one transformation attached to a fuzzy context specific core is shown on the right, as well as the activity histogram of this transformation on all fuzzy context specific matched pairs. The effect pie chart is shown as well (green: increase in activity). This is a clear example of an activity cliff where the transformation results in an increase in activity. On several common cores that all share globally the same topology and pharmacophoric features, the same transformation is likely to result in an increased activity.
Comparison to classical MMP analysis on public data
In order to highlight the power of the approach we tried to use classical MMP's to derive activity cliffs and bioisosteric replacement rules and compare it to using fcsMMPs and the previously described settings. For classical MMP analysis we considered the following settings :
retain all activity classes with more than 5 or 10 matched pairs for effect analysis
allow up to four cuts during molecular fragmentation
minimum core size: 10 atoms
maximum fragment size: 10 atoms
The following results highlight how many activity cliffs, cliff candidates and bioisosteric replacement rules can be identified within the same data-set using both, the fcsMMP (histograms on the left) and the classical MMP approach (histograms on the right).
The results obtained on the epoxide hydratase are fairly impressive. Retaining only activity classes with more than (or equal) 10 matched pairs fcsMMP is able to identify 41 activity cliffs, 60 cliff candidates and 46 bioisosteric replacements. And this considering the full common core as context. On the other hand, classical MMP analysis provides us with only 3 putative bioisosteric replacements and 1 cliff candidate using the less precise context possible (1 bond).
When lowering the number of matched pairs required to form an activity effect class to only 5, then classical MMP analysis still is not able to provide the user with sufficient information. fcsMMP on the contrary yields the again 41 activity cliffs and allows this time to increase the number of cliff candidates to 104 and more than triples the number of bioisosteric replacements.
In light of these first results it appears that the most precise fcsMMP approach would be able to increase the resolution of MMP data by a factor between of at least 15 when using the most permissive classical MMP approach (context of only 1 bond).
Results on obtained for VEGFR2 show similar trends. When seeking for activity effect classes with more than 10 matched pairs, no class is found at all using the classical MMP approach, while fcsMMP with the full context allows to determine 16 activity cliffs, 21 cliff candidates and 26 bioisosteric replacements.
CCR2 results show again the same trend. Interestingly, the dataset included 365 distinct molecules, whereas the previous datasets (expoxide hydratase and vegfr2 included slightly more) and despite this fact significantly less activity effect classes are found. This clearly indicates that the analysis is depending on the chemical diversity of the dataset itself.
This dependency becomes even clearer with the example of GPR24 shown above. However, the method is still able to derive a significant amount of activity effect classes, while classical MMP analysis is not.
Maybe you have noticed, the dataset size of the previous examples gradually decreased from 453 molecules to 344 molecules. What is the minimum amount of data required to still derive meaningful information? Above you can see the example on a dataset containing 62 molecules with pIC50 data against liver glycogen phosphorylase. Even with a dataset of this size information (might it be little) can be extracted. Here again, the amount of activity effect classes depends on the relative similarity of the ligands in the dataset. Thus I would definitely recommend using several hundreds of compounds if you can.
Another interesting observation that can be done is that the amount of analyseable activity effect classes obtained with fcsMMP vary only slightly when considering the context with only one bond or using the full context as pharmacophore graph.
Conclusion
The results shown here, but also results that we haven't put in here clearly indicate that using fcsMMP allows you to extract more informative rules from less data. It is also very important to highlight that this information is actually far more precise than the information obtained using classical approaches, where the context is not considered at all or up to n bonds. If you are not convinced about this point, look at the following example (only one of several) :
Here we can clearly see that classical MMP merges a large variety of different scaffolds into the same activity class. This often results in noisy activity effects, as shown in the Papadatos paper, where no clear effect can be determined. Using fcsMMP and considering the full common core as context the scaffolds of the cores in the same activity class actually share the same topology and pharmacophoric features.
We do hope that the use of fuzzy contexts will find a widespread application in analysis of property and activity data. Do not hesitate to contact us if you want to use the method, or if you have questions regarding the method, implementation or the results presented in this rather exhaustive blog post. Note that the data and protocols used to produce these results are shipped together with the chemistry collection 2.0.