How to create a new Dataset

The dataset is uploaded by directly importing a CSV or SDF, by clicking on the New Data Set button and following the instructions.

NOTE: when uploading a CSV file, the data file to be uploaded as a Makya dataset must comply with the following requirements:

One column must correspond to SMILES of the molecule, named appropriately in upper or lower case (“smiles” or “SMILES”)

All columns should have a name

If there are errors in the file, Makya will fail to upload the dataset and alert the user.

For each column of the uploaded dataset, the user can decide if it should be imported or cleaned and can change the data type (the types are SMILES, Number, String/Data). Columns containing Number types (int or float) are considered target columns for which the predictor models can be trained (see the Predictors section for details).

Makya then automatically cleans the data file:

Molecules are standardized (removal of salts, conversion of radicals...)
Target columns are cleaned and parsed:
- Cells containing non-numerical values or placeholders (such as “N/A” or “–") become blank
- Values that contain symbols such as “<”, “>”, “=” are converted to their numeric values (eg. ">7" becomes "7")
Chirality is removed:
- Chiral molecules are converted to their achiral version
- Duplicate rows (based on the achiral Inchi key) are merged

For more information about why we remove chirality, see the corresponding page of the documentation.

The user can review the different steps and outcomes of the cleaning process.

After naming the dataset and clicking on the Save button, the clean dataset is uploaded and appears as a panel on the Datasets page. When opened, an interactive parallel coordinates plot allows the user to explore their data.