Datasets can be uploaded into Makya in order to:
- build QSAR models,
- define a chemical space that will be used to guide generation,
- be rescored using (2D or 3D) scores available in Makya.
How to create a new Dataset
The dataset is uploaded by importing a CSV or SDF file directly from the user’s computer by clicking on the New Data Set button and following the subsequent instructions.
NOTE: when uploading a CSV file, the data file to be uploaded as a Makya dataset must comply with the following requirements:
- One column must correspond to SMILES of the molecule, named appropriately in upper or lower case (“smiles” or “SMILES”)
- All columns should have a name
If there are errors in the file, Makya will fail to upload the dataset and alert the user.
For each column of the uploaded dataset, the user can decide if it should be imported or cleaned and can change the data type (the types are SMILES, Number, String/Data). Columns containing Number types (int or float) are considered target columns for which the predictor models can be trained (see the Predictors section for details).
Makya then automatically cleans the data file according to the following steps:
- Molecules are standardized (removal of salts, conversion of radicals...)
-
Target columns are cleaned and parsed:
- Cells containing non-numerical values or placeholders (such as “N/A” or “–") become blank
-
Values that contain symbols such as “<”, “>”, “=” are converted to their numeric values (eg. ">7" becomes "7")
-
Chirality is removed:
- Chiral molecules are converted to their achiral version
- Duplicate rows (based on the achiral Inchi key) are merged
For more information about why we remove chirality, see the corresponding page of the documentation.
The user can review the different steps and outcomes of the cleaning process.
After naming the dataset and clicking on the Save button, the clean dataset is uploaded and appears as a panel on the Datasets page. When opened, an interactive parallel coordinates plot allows the user to explore their data.