The test performance
Estimating the performance of a model requires leaving a part of the initial dataset aside. It is the train / test split (called outer split). The model is fit on the train, and its performance computed on the test. There are multiple ways of splitting the data, which will dramatically impact the performance of the model. In Makya, a random split is made because the models are then applied on generated molecules close to the train set. Other examples of splits could be time-split (most recent compounds in the test, others in the train) or scaffold split. These splits are not implemented in Makya.
The outer split differs depending on the number of datapoints available:
- if there are less than 75 available values for a target, the outer split is a leave-one-out. Each molecule is kept apart, the parameters are optimized on the rest of the dataset and in the end the prediction of the left-out molecule is computed. Finally the predictions are aggregated and are used to compute the performances.
- For in-between cases, i.e targets with 76 - 150 available values, the outer splitter is a leave-one-group-out. It is the same as leave-one-out but with groups of 10% of the dataset.
- For regular targets with 151 - 1500 available values, the splitter is a stratified 4-Fold.
- For targets with more than 1500 available values, the splitter is a plain stratified random split.
It is important to note that the same seed is used for a particular dataset, so that it always gets split (into K-folds) in exactly the same way into inner and outer sets, if an attempt is made to build the same predictor multiple times. As a result, the user might get highly similar if not identical metrics every time.
For all cases with multiple folds, the predictions/values are concatenated and the metrics computed on the full vector.
The validation performance
Aside from the outer split that is used for model performance estimation, another split is made for parameters optimization: the inner split, split between train and validation sets. It is performed after the outer split, on its trainouter sets. There are thus three layers of split: the train, validation and test sets (the trainouter of the outer split corresponds to traininner + valinner). If the outer split yields multiple train/test sets (e.g a 4-Fold), independent inner splits are applied on each train set.
The optimization algorithm (bayesian optimization in Makya) then picks parameters, the model is fit with these parameters on the traininner sets and the model quality metric computed on the valinner sets. The optimization algorithm maximizes the quality metric on the valinner sets. Once the optimization is finished, the best parameters are used to fit a model on (traininner + valinner) which performances are computed on the test set.
Similarly to the outer split, the inner split depends on the number of datapoints available:
- if there are less than 1500 available values for a target, a stratified 5 times repeated 2-fold is used as inner split.
- for targets with 1501-5000 available values, a stratified 4-fold split is used as the inner split.
- for targets with more than 5000 available values, a stratified 2-fold split is used as the inner split.
In Makya, the validation performance is the average performance on all the validation sets.
Final optimization
The procedure described previously is useful to get an unbiased performance estimation by leaving a test set out. However the “best” parameters obtained are sub-optimal because they did not take the test set into account. That is why a final optimization is made on the whole dataset, without outer split. This model is the final one used in Makya.
Figure: Illustration of the outer and inner splits for performance estimation.