Makya Datasets can be selected as a Chemical Space in any generator set-up. This causes (Tanimoto) Similarity to this Chemical Space to act as a reward and guide the generation. Simply put, the generator understands from this Chemical Space what sort of molecules the user wants to see.
In this article, we will review how molecules in the chemical space affect the outcome of generators in Makya.
Set-up
As an example, we will run two instances of Fine Tuning (FT) generator with the following molecules each as their chemical spaces:
Molecules are labelled as Chemical Space-1 (CS1) and Chemical Space-2 (CS2) respectively.
They differ from each other in shaded areas: CS1 has a piperazine ring while CS2 has a pyrazolopyridine ring. The rest is identical.
In FT generator #1, we will use CS1 as the chemical space and in FT generator #2 we will use CS2.
In both instances, we use the following substructure constraint to match. This allows the generators to make changes to the amide nitrogen (because of its single explicit hydrogen) and the aromatic ring attached to it, while keeping the rest of the molecule the same:
SMARTS: [#6]-[#6]-c1ncnc(c1C#Cc1ccc(-[#7])nc1)-*:1:*:*:*(:*:*:1)-[#6](-[#7])=O
Results
At the end of their runs, FT generators #1 and #2 produced 9625 and 5269 molecules respectively.
FT generator #1
As expected, the FT generator #1 makes many molecules very similar to CS1 with the six membered piperazine being present in top molecules as shown below.
Notice that all these molecules differ from each other and from CS1 with only "fine" modifications like methyl or amino groups attached or removed from them:
However, on doing a simple substructure match, only 2436 (25.3%) of 9625 molecules contain the piperazine ring.
The rest have closely related five, six and seven membered nitrogen heterocycles such as pyrrolidine, piperidine and azepane derivatives:
FT generator #2
Similarly, the FT generator #2 designs molecules related to CS2's pyrazolopyridine ring:
But how does this chemical space look like compared to molecules from FT generator #1?
Morgan fingerprints (2048 bits) were computed for all output molecules from both generators and three different algorithms for dimensionality reduction were used to produce a lower dimensional representation.
The resulting 2D coordinates were plotted on a scatter plot.
The following visualisation illustrates how the chemical space was explored using both generators relative to each other:
The legends "1" and "2" represent molecules explored around CS1 and CS2 respectively.
Notice that in the TSNE and UMAP plots, there is very little overlap between the two chemical spaces, indicating significant divergence in their output chemical spaces around the two molecules: CS1 and CS2.
A visualisation based on RDKit descriptors using PCA shows a similar distinction between the two chemical spaces:
Observations and conclusions
- The chemical space chosen by the user strongly impacts the outcome of generation. Playing with the chemical space (and in particular, creating a chemical space of chemist's ideas) can orient the generation in the desired direction.
- The above scenario describes a Fine-Tuning generator, but similar results can be expected using the Fragment Growing or Fragment Linking generators.
NOTE: when a Makya predictor (or QSAR model) is selected to guide generation, generated molecules are optimized with respect to the predicted scores, but not directly to the Tanimoto similarity to the training dataset. This can be done by selecting the training dataset (or a subset of it) as a chemical space.