Introduction
The Fine Tuning generator has been designed with the goal to find optimal molecules maximizing at lot of constraints simultaneously (models, descriptors, substructures, similarity...). It is not chemistry-based and therefore does not work under the constraint of generating feasible molecules.
The Fine Tuning generate molecules close to the chemical space because the goal is not to find something different or novel, but to find a molecule which makes the consensus. It is most often used with QSAR models which impose by design to stay close to the training dataset due to applicability model constraints.
Use the Fine Tuning generator when the search is complex and fragment-based generation is not applicable.
NOTE:
- Play on the chemical space: by changing it, you guide the Fine Tuning generator in a different direction, as described in this article.
- Do not hesitate to add your own ideas as starting point.
Example Use-Case
Set-up
In theory, all the set-up options are optional, but the user is encouraged to use their best judgement from a chemistry perspective to obtain meaningful results in the applicability domain of the project.
Chemical Space
Here, select ≥1 datasets or specify SMILES that can be used to guide the generation. This feature is strongly recommended so that the generator does not generate randomly.
Chemical constraints
Products: In this panel, add constraints on the final generated molecules. Input the SMARTS for required and/or forbidden substructures, or select a list of forbidden PAINS/Tox.
Rewards
Scoring APIs: Plug in external scores or models that can be accessed to guide this generation (see Scoring APIs).
QSAR: Select among trained QSAR Models to guide the generation around a defined target product profile (see QSAR). (Be careful to select Models whose applicability domain covers the chemical space you are exploring!)
Additional Scores
Post-Processors: Select any scores that will be calculated during the generation and available in post-processing (see Scorers).
Advanced
The Fine Tuning generator is a sequence-based generator i.e. it generates SMILES strings. It is designed to reproduce the distribution of molecules of a training dataset. It has been pre-trained on large public databases (like ChEMBL) to learn the SMILES syntax. Without any further training it generates diverse molecules from a very broad chemical space - which is of little interest when working on specific series / chemotypes. That’s why the chemical space is so crucial. The pre-trained generator will be fine-tuned on the chemical space to learn its distribution and generate compounds within it. This initialization is a preliminary step to optimization. First, the global generator is focused towards the chemical space, then it starts optimizing the given predicted properties.
During the initialization step, not only one generator but multiple ones (called agents) will focus on subsets of the input chemical space. So that if the chemical space is diverse (e.g with multiple chemical series) each agent will focus on a different part of the chemical space, to make sure that all the potential of the chemical space is exploited. This must be kept in mind when designing a chemical space: compounds that are actually of no interest shouldn’t be included, otherwise resources will be dedicated to train agents on these molecules. A typical case is when the initial dataset contains historical data from series that have been discarded. This data should be removed from the chemical space or else Makya will generate within these series. However, the predictors should be trained on the whole dataset.