In the following article, we describe Iktos's innovative model that is implemented in Makya predictors: the continuous classification model.
Models
Once molecules in a dataset have been transformed into a molecular representation, a mathematical model tries to detect patterns in the features that could explain the variations in the target property. A number of statistical models exist (linear models, decision trees, ensemble models, neural networks, etc.) and can leverage the features in different ways.
Since Makya generally uses relatively small datasets (< 5000 molecules), the type of model does not significantly impact the quality of the trained predictor. In Makya we mainly use linear models because they are simple, efficient, have fewer unanticipated secondary outcomes, and tend to generalize better.
Regression and classification
Depending on the input data, two prediction algorithm classes are traditionally considered:
- Regression models: for continuous data, regression models predict the value of the target objective – e.g. for some logD measurements, the model would predict a logD on the same scale.
- Classification models: for categorical data. e.g. toxicity of a compound (toxic = 1, non-toxic = 0), a classification model will discriminate between the different classes and output a likelihood score that a compound is either toxic or non-toxic. This score is usually between 0 and 1 and accounts for the confidence of the model in its prediction. For instance, a value of 0.2 would mean that the model is pretty confident that the compound will be non toxic, 0.8 that it will be toxic, and with 0.5 the model would be uncertain.
Makya works exclusively with classification models, which predict classes (active/inactive) instead of continuous values (for pIC50, ADMET etc). While it has pros and cons (see below), overall classification models were found to provide better results.
Pros
- Output in [0, 1] for any target → easier to use in multi-objective settings. All the targets are scaled
- Robust to the initial dataset distribution anomalies (e.g. coming from biological test sensitivity threshold). An example distribution is shown below.
Cons
- Loss of information. The notion of order present in the initial continuous data is lost when separating compounds between active and inactive categories.
- Output not straightforward to interpret.
- Threshold effect around the TPP value: with a pIC50 threshold of 8, a compound with 7.9 would be inactive and another one with 8.1 active, while their activities are actually quite similar.
To address these two last points, we implemented an improved classification strategy which we term Continuous Classification.
The continuous classification model
To overcome the limitations of classification models while retaining their advantages, Iktos has developed a model that behaves as a classification model and leverages the continuous information (like in a regression model). The crucial advantage of this model is that it behaves as a classification model when the value is far from the threshold (e.g. for very obviously inactive compounds), but behaves as a regression model locally around the threshold, which is often the area of greater interest. This ensures that the data points close to the threshold are not simply thrown out. As an example, see below an illustration of how this model behaves as compared to a plain classification model on two biological targets (pi3K and mTor).
Let us compare points around the threshold at 7 in the pi3k example (top row). Let us consider two compounds whose activities are 6.9 (inactive) and 7.1 (active). These two compounds have very close activities. Despite that, a classical classification model could give them respective scores close to 0 (for the inactive compound) and 1 (for the active one), which is illustrated by the s-shaped sigmoid curve of predictions vs true values which spreads values around the threshold on the y-axis. Meanwhile, Iktos's continuous classification model would score all threshold-adjacent compounds in a narrower, 0.3 to 0.7 range. As a result, our two compounds would end up having closer predicted scores, which illustrates their close true activity values.