Problem:How can you fine-tune and optimize an existing chemical series with promising activity, ADME, or biological endpoints using Makya?
The following guide walks through this process step-by-step. You can download the dataset referenced here.
Dataset Overview
The dataset used in this guide is based on a PI3K-mTOR chemical series. It includes:
SMILES strings representing chemical structures
-
Multiple objective columns, such as:
pKi IC50 PI3KpKi IC50 mTORWater SolubilityCaco-2 PermeabilityCYP1A2 inhibitorCYP3A4 inhibitorTotal Clearance
Step 1: Upload the project data
Chemical Space
We will first upload our full dataset, which will be used as a reward (so that Tanimoto similarity to these molecules informs the generation process), and as a training set to build a QSAR model.
Go to the "Datasets" tab and upload the previously downloaded dataset as a CSV file. The following image shows a screenshot of the example PI3K-mTor dataset in MS Excel. This dataset will be available in the default project of your Makya installation. The dataset contains one column of SMILES and other columns with values for objectives like pKi IC50 Pi3K, pKi IC50 mTor, Water Solubleness, Caco2 permeability, etc.
Upload this dataset to Makya by clicking on the New Dataset, and drop your file or click on the Click to select one and navigate to the location of the csv file. Once uploaded, the dataset will appear with SMILES replaced with 2D chemical structures. Click on Start Cleaning and once cleaning is done, name the dataset Data_pi3k_mtor (or any other name of your choice) and save it.
Fine Tuning Space
Prepare a subset of the larger dataset (the Fine Tuning dataset size needs to be 100 molecules or less) and upload it at the Datasets tab in the same procedure as the larger dataset. This smaller subset will serve as a fine tuning space during the molecular generation.
A good subset is a selection of potent, interesting molecules, that a chemist would naturally use as a starting point for brainstorming.
Step 2: Train the QSAR model
Step 2: The data is uploaded so now you can build predictive models which can be use to guide the generation process. To do so, go to the QSAR tab and click on the New QSAR. On the following page, select the dataset Data_pi3k_mtor, the larger dataset, that you have just uploaded in a previous step.
7 Objectives will appear on the screen. Use the sliders next to each objective to define the thresholds for the five objectives listed below. Make sure that there is always a balanced number of molecules that fall in and out of the thresholds (indicated by Molecules In and Molecules Out). An alternative to using the sliders is directly entering the desired value in the min or max threshold box for each objective. For this example use case, set the following thresholds.
| pKi IC50 Pi3K | 7 (min) |
| pKi IC50 mTor | 8.5 (min) |
| Water Solubility | -4.0 (max) |
| Caco2 permeability | 0.9 (max) |
| Total Clearance | 0.5 (min) |
After setting the thresholds, click Train QSAR and name the model. If you go back to the QSAR tab, the newly created predictor should appear on this page.
Once the QSAR model is trained, clicking on the See Results button will display the ROC curves and associated metrics for each model trained on each objective. These can be used to assess the performance or quality of the trained models.
NOTE: If any of the models is deemed unsatisfactory, either (1) improve the quality and quantity of the provided data or (2) run again the QSAR process with relaxed thresholds.
Clicking on the Test button will allow the user to input an individual SMILES or a file of SMILES to be evaluated by model.
Step 3: Create the generator
Step 3: Finally, to generate molecules guided by the QSAR model, go to the "Generators" tab of the project page and click on the New Generator. Name the generator and set "Generation Engine" as Fine Tuning (beta).
On the following page specify the various options to set up this generator.
- Fine Tuning Space: Check the box next to the smaller dataset that you uploaded in step 1. This dataset is used to fine tune the generator, which will learn from the retrosynthetic analysis of this subset of molecules to generate new molecules in the same synthesis space.
- Chemical Space: Check the box next to Data_pi3k_mtor dataset, the larger dataset. This is the larger dataset that was created and uploaded in Step 1. By doing this selection, Tanimoto similarity to this chemical space will be used to guide the generation.
-
Products: In the "Substructures" tab, click the Add button on "Molecules must match at least one of these substructures (matching set)" enter the following SMARTS string and click Save:
[#6]-[#6]-c1ncnc(c1C#Cc1ccc(-[#7])nc1)-*:1:*:*:*(:*:*:1)-[#6](-[#7])=O
- QSAR: Check the box next to Lead_optimization_predictor. This should check boxes for all the individual predictor models on the objectives we trained on in step 2.
Go back to the "Generators" tab, and the newly set up generator will show up. Click on the Run button for this generator.
Step 4: Explore the generated molecules
Step 4: While the generation is still running, visualize the generated molecules that match the specified blueprint by clicking on the eye icon. On the page that appears, go to the Parallel Coordinates and adjust the Confidence metric to >0.7 by holding down the mouse on this metric and dragging it down to 0.7 as shown in the image below.