notes

* Clone the repo
$ git clone https://github.com/JDACS4C-IMPROVE/GraphDRP

* Create env
$ conda_env_py37.sh

* Generate required datasets:
choice:
    0: create mixed test dataset
    1: create saliency map dataset
    2: create blind drug dataset
    3: create blind cell dataset
$ python preprocess.py --choice 0

* Train model
$ python training.py --model 0 --train_batch 1024 --val_batch 1024 --test_batch 1024 --lr 0.0001 --num_epoch 300 --log_interval 20 --cuda_name "cuda:0" --set drug

* Saliency
$ python saliency_map.py --model 0 --num_feature 10 --processed_data_file "data/processed/GDSC_bortezomib.pt" --model_file "model_GINConvNet_GDSC.model" --cuda_name "cuda:0"


--------
Run
--------
# Preprocess data (create datasets)
# choice: 0: create mixed test dataset, 1: create saliency map dataset, 2: create blind drug dataset, 3: create blind cell dataset
python preprocess.py --choice 0
python preprocess.py --choice 1
python preprocess.py --choice 2
python preprocess.py --choice 3

# Training mixed test experiment
python training.py --model 0 --train_batch 1024 --val_batch 1024 --test_batch 1024 --lr 0.0001 --num_epoch 300 --log_interval 20 --cuda_name "cuda:0"
python training.py --model 1 --train_batch 1024 --val_batch 1024 --test_batch 1024 --lr 0.0001 --num_epoch 300 --log_interval 20 --cuda_name "cuda:0"
python training.py --model 2 --train_batch 1024 --val_batch 1024 --test_batch 1024 --lr 0.0001 --num_epoch 300 --log_interval 20 --cuda_name "cuda:0"
python training.py --model 3 --train_batch 1024 --val_batch 1024 --test_batch 1024 --lr 0.0001 --num_epoch 300 --log_interval 20 --cuda_name "cuda:0"


------------------
Data preprocessing
------------------
preprocess.py was adopted from tCNN (https://github.com/Lowpassfilter/tCNNS-Project/blob/master/data/preprocess.py).

* PANCANCER_Genetic_feature.csv
    https://www.cancerrxgene.org/downloads/genetic_features
    col "genetic_feature" contains either mutation suffixed with "_mut" or CNA prefixes with "cna_"
* PANCANCER_IC.csv
    https://www.cancerrxgene.org/downloads/drug_data
    Click on Download to get response data
    GDSC1 and GDSC2 provides different files
* Cell_list.csv

* Druglist.csv --> 265 drugs
    CSV file downloaded from https://www.cancerrxgene.org/downloads/drug_data (click on CSV, not Download)
    GDSC1 and GDSC2 provides different files (in this paper they use only one file, which one?)
* drug_smiles.csv --> 223 drugs
    Generated by func preprocess.py/download_smiles() to contain drugs from Druglist and their SMILES
* pychem_cid.csv and unknow_drug_by_pychem.csv
    Generated by func preprocess.py/write_drug_cid()
    pychem_cid: molecules retrieved from PubChem
    unknow_drug_by_pychem: molecules not found in PubChem
* small_molecule.csv
    Downloaded from http://lincs.hms.harvard.edu/db/sm/
    This dataset was downloaded in order to find molecules that are present in Druglist but were not retrieved from PubChem


--------
Issues
--------
Didn't find create_data.py
That becomes a problem when using saliancy_map.py


---------------
Learning curves
---------------
Created lc_prep.py from preprocess.py and modified it as needed to generete data for learning curves.
python lc_prep.py
./lc_batch.sh


---------------
CANDLE
---------------


---------------
CSG
---------------
* Prepare raw data using either cross_study_gen or IMP_data
* ./frm_preprocess.sh