Added some scripts to build SPN cohort on Orfeo

caravagnalab · Aug 28, 2024 · 4833288 · 4833288
1 parent 4484562
commit 4833288
Show file tree

Hide file tree

Showing 5 changed files with 1,249 additions and 0 deletions.
diff --git a/building_cohorts/README.md b/building_cohorts/README.md
@@ -0,0 +1,82 @@
+# How to Build The Cohort from The Phylogenetic Forests
+
+For every SPN, the cohort contains 13 sets of reads: 12 of them are produced by
+simulating the tumour sequencing with coverage in 50X, 100X, 150X, and 200X and
+purity varying among 0.3, 0.6, and 0.9; the last one is the 30X simulated
+sequencing of normal samples. The cells in the normal sample share the same
+mutations, i.e., the germline mutations. 
+
+For efficiency purposes, the tumour reads are not directly produced by rRACES.
+Instead, the different coverages and purities are obtained by mixing two 200X
+cohorts with purity 1 and 0. These two cohorts are called "tumour" and
+"mixing_normal". Differently from the cohort normal sample, "mixing_normal"
+simulates the reads of a sample whose cells contain all the tumour
+pre-neoplastic mutations.
+
+## Splitting "tumour" and "mixing_normal" into Lots
+
+The `rRACES` functions `simulate_seq()` and `simulate_normal_seq()` can easily
+produce the cohorts "tumour" and "mixing_normal". However, to speed up their
+generation, we split each of them into 200 lots. Each of these lots simulates a
+1X sequencing of the tumour at purity 1 (for "tumour") and purity 0 (for
+"mixing_normal") using a different random seed so to guarantee the production
+of a different set of reads.
+
+The lot generation is distributed on a computational cluster managed by SLURM.
+The cluster consists of 8 nodes equipped with an AMD EPYC 7H12 64-core
+processor with 512GB of RAM and 352GB of local scratch.
+
+## Producing a Set of Reads with a Specific Coverage and Purity
+
+The number of reads $R$ to be simulated to reach a specific coverage $c$ is
+$R = c\frac{G}{s}$ where $G$ and $s$ are the reference genome size and the read
+size, respectively.
+
+Some of the $R$ reads comes from "tumour" and some from "mixed_normal"
+according to a probability that depends on the quantities of tumour and normal
+DNA in the sample ($T$ and $N$, respectively). In particular, the probability
+for one of the reads to come from a "mixed_normal" cell is $p_N=\frac{N}{T+N}$.
+
+The `rRACES` method `PhylogeneticForest$get_samples_info()` returns the number
+of tumour cells $t$ and the quantity of tumour DNA in the sample with respect
+to a normal cell $D$, i.e., the number of normal cells $e$ needed to equal the
+sample tumour DNA. Thus, 
+$p_N = \frac{N}{T+N} = \frac{n D}{e D +n D} = \frac{n}{e+n}$ where $n$ is the
+number of normal cells in the sample.
+
+If $\pi$ is the aimed purity, then $n = t\frac{1-\pi}{\pi}$ is the number of
+"mixing_normal" cells that should be added to the tumour sample to have purity
+$\pi$. It follows that when the purity is $\pi$, the probability for one of the
+reads to come from a "mixing_normal" cell is
+$p_N=\frac{t(1-\pi)}{e\pi+t(1-\pi)}$.
+
+Dealing with 200 uniformly distributed lots, we can sample the binomial
+distribution $\mathcal{B}(p_N, R/200)$ 200 times and decide, lot by lot, how
+many reads should be extracted from "mixing_normal" and, consequently, how many
+from "tumour".
+
+## Generating the Tumour Cohort
+
+This directory contains 3 Python scripts that build the cohort of a SPN by
+using its phylogenetic forest file on Orfeo.
+
+To generate the cohort, copy all the Python scripts to a directory in your home
+on Orfeo, log in on Orfeo, and execute the following command.
+
+```{sh}
+nohup python3 build_cohorts.py <SPN> <PHYLO_FOREST> <OUTPUT_DIR> -p EPYC -A cdslab &
+```
+
+where `<SPN>` is the name of the SPN, `<PHYLO_FOREST>` is the absolute path of
+the phylogenetic forest, and `<OUTPUT_DIR>` is the absolute path of the output
+directory.
+
+For faster executions, both   `<PHYLO_FOREST>` and `<OUTPUT_DIR>` should be
+located in one of the `/fast` subdirectories.
+
+The Python scripts produce "tumour", "mixing_normal", and the normal sample
+reads. Moreover, they also generate the FASTQ files and the corresponding
+`sarek` files. 
+
+> [!CAUTION]
+> The content of the `/fast` subdirectories is deleted every 30 days
diff --git a/building_cohorts/build_cohort.py b/building_cohorts/build_cohort.py
@@ -0,0 +1,295 @@
+#!/usr/bin/python3
+
+import os
+import sys
+import glob
+import time
+import argparse
+import subprocess
+import datetime
+import fnmatch
+
+tumour_coverage = 200
+mixing_normal_coverage = 200
+normal_coverage = 30
+num_of_lots = 200
+BAM_parallel_jobs = 40
+diluition_parallel_jobs = 12
+
+fastq_shell_script="""#!/bin/bash
+#SBATCH --nodes=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=20
+#SBATCH --time=6:00:00
+#SBATCH --mem=20GB
+
+module load R
+module load samtools
+
+echo "Rscript generate_normal_fastq.R ${BAM_FILE} ${OUTPUT_DIR}"
+
+Rscript generate_normal_fastq.R ${BAM_FILE} ${OUTPUT_DIR}
+"""
+
+fastq_R_script="""rm(list = ls())
+library(rRACES)
+
+generate_fastq <- function(orig_file, fastq_dir) {
+  base_orig_file <- tools::file_path_sans_ext(basename(orig_file))
+
+  file_prefix <- file.path(fastq_dir, base_orig_file)
+  R1 <- paste0(file_prefix, ".R1.fastq.gz")
+  R2 <- paste0(file_prefix, ".R2.fastq.gz")
+  unpaired <- paste0(file_prefix, ".unpaired.fastq.gz")
+  singleton <- paste0(file_prefix, ".singleton.fastq.gz")
+
+  cmd <- paste("samtools fastq -@ 20 -c 9 -N -1", R1, "-2", R2, "-0", unpaired, 
+               "-s", singleton, orig_file)
+  invisible(system(cmd, intern = TRUE))
+}
+
+args <- commandArgs(trailingOnly = TRUE)
+
+if (length(args) != 2) {
+  stop(paste("Syntax error: generate_normal_fastq <BAM_file>",
+             "<output_dir>"),
+       call. = FALSE)
+}
+
+bam_file <- args[1]
+output_dir <- args[2]
+
+if (!file.exists(output_dir)) {
+    dir.create(output_dir)
+}
+
+log_dir <- file.path(output_dir, "log")
+
+if (!file.exists(log_dir)) {
+    dir.create(log_dir)
+}
+
+base_orig_file <- tools::file_path_sans_ext(basename(bam_file))
+
+generate_fastq(bam_file, output_dir)
+
+done_file <- file.path(log_dir, paste0(base_orig_file, ".done"))
+
+invisible(file.create(done_file))
+"""
+
+
+def build_reads(SPN, phylogenetic_forest, output_dir, account, partition,
+                num_of_lots, coverage, parallel_jobs, options=[]):
+    if not os.path.exists(output_dir):
+        os.mkdir(output_dir)
+
+    cmd = (['python3', './deploy_rRACES_seq.py', '-p', f'{partition}',
+            '-A', f'{account}', '-l', f'{num_of_lots}', '-c', f'{coverage}',
+            '-j', f'{parallel_jobs}', SPN, phylogenetic_forest, output_dir]
+            + options)
+
+    subprocess.run(cmd)
+
+    return output_dir
+
+
+def diluite_reads(SPN, phylogenetic_forest, output_dir, tumour_dir, mixing_normal_dir,
+                  account, partition):
+
+    diluitions_dir = os.path.join(output_dir, "diluitions")
+    if not os.path.exists(diluitions_dir):
+        os.mkdir(diluitions_dir)
+
+    fastq_dir = os.path.join(output_dir, "fastq")
+    if not os.path.exists(fastq_dir):
+        os.mkdir(fastq_dir)
+
+    cmd = ['python3', './deploy_diluite_cohort.py', '-p', f'{partition}',
+           '-A', f'{account}', '-j', f'{diluition_parallel_jobs}', SPN,
+           phylogenetic_forest, diluitions_dir, fastq_dir,
+           os.path.join(tumour_dir, "BAM"), os.path.join(mixing_normal_dir, "BAM")]
+
+    subprocess.run(cmd)
+
+    return (diluitions_dir, fastq_dir)
+
+
+def get_done(directory):
+    return glob.glob(os.path.join(directory, "*.done"))
+
+
+def build_fastq(bam_dir, fastq_dir, account, partition):
+    with open('generate_normal_fastq.R', 'w') as outstream:
+        outstream.write(fastq_R_script)
+
+    with open('generate_normal_fastq.sh', 'w') as outstream:
+        outstream.write(fastq_shell_script)
+
+    if not os.path.exists(fastq_dir):
+        os.mkdir(fastq_dir)
+
+    fastq_log_dir = os.path.join(fastq_dir, "log")
+    if not os.path.exists(fastq_log_dir):
+        os.mkdir(fastq_log_dir)
+
+    bam_files = glob.glob(os.path.join(bam_dir, "*.bam"))
+
+    for bam_file in bam_files:
+        basename = os.path.splitext(os.path.basename(bam_file))[0]
+        cmd = ['sbatch', f'--account={account}',
+                f'--partition={partition}',
+                (f'--export=BAM_FILE={bam_file},'
+                 + f'OUTPUT_DIR={fastq_dir}'),
+                f'--output={fastq_log_dir}/{basename}.log',
+                './generate_normal_fastq.sh']
+                #'./test.sh']
+
+        subprocess.run(cmd)
+
+    while (len(bam_files) != len(get_done(fastq_log_dir))):
+        time.sleep(60)
+
+    return fastq_dir
+
+def get_fastq_filenames(bam_dir, fastq_dir):
+
+    normal_fastq_filenames = []
+    for bam_file in sorted(glob.glob(os.path.join(bam_dir, "*.bam"))):
+        basename = os.path.splitext(os.path.basename(bam_file))[0]
+        common_name = os.path.join(fastq_dir, basename)
+
+        normal_fastq_filenames.append((f'{common_name}.R1.fastq.gz',
+                                       f'{common_name}.R2.fastq.gz'))
+
+    return normal_fastq_filenames
+
+def get_cohort_in(directory):
+    cohorts = set()
+    for cohort in fnmatch.filter(os.listdir(directory), "*X_*p"):
+        if os.path.isdir(os.path.join(directory, cohort)):
+            cohorts.add(cohort)
+
+    return cohorts
+
+def build_sarek_files(output_dir, fastq_dir, diluitions_dir, normal_dir):
+    sarek_dir = os.path.join(output_dir, "sarek")
+
+    if (not os.path.exists(sarek_dir)):
+        os.mkdir(sarek_dir)
+
+    path = os.path.normpath(diluitions_dir)
+    diluitions_dir_depth = len(path.split(os.sep))
+
+    normal_fastq_filenames = get_fastq_filenames(os.path.join(normal_dir, "BAM"),
+                                                 os.path.join(fastq_dir, "normal"))
+
+    #cohorts = set()
+    #for root, dirnames, filenames in os.walk(diluitions_dir):
+    #    for filename in fnmatch.filter(filenames, "sarek*.csv"):
+    #        path = os.path.normpath(root)
+    #        cohorts.add(path.split(os.sep)[diluitions_dir_depth])
+
+    for cohort in get_cohort_in(diluitions_dir):
+        with open(os.path.join(sarek_dir, cohort + ".csv"), 'w') as outstream:
+            outstream.write('patient,sex,status,sample,'
+                            + 'lane,fastq_1,fastq_2\n')
+
+            for root, dirnames, filenames in os.walk(diluitions_dir):
+                for filename in fnmatch.filter(filenames, "sarek_*.csv"):
+                    with open(os.path.join(root, filename)) as instream:
+                        line_num = 0
+                        for line in instream:
+                            if line_num>0:
+                                outstream.write(f'{line.strip()}\n')
+                                if (line_num == 1):
+                                    (SPN, gender) = line.strip().split(',')[:2]
+
+                            line_num += 1
+
+            line = 1
+            for fastq_filenames in normal_fastq_filenames:
+                outstream.write(f'{SPN},{gender},0,normal_sample,'
+                                + f'L{str(line).zfill(3)},{fastq_filenames[0]},'
+                                + f'{fastq_filenames[1]}\n')
+                line += 1
+
+def build_tumour_reads(SPN, phylogenetic_forest, output_dir, account, partition):
+    return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "tumour"),
+                       account, partition, tumour_coverage, tumour_coverage, 
+                       min(BAM_parallel_jobs, tumour_coverage))
+
+def build_mixing_normal_reads(SPN, phylogenetic_forest, output_dir, account, partition):
+    return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "mixing_normal"),
+                       account, partition, mixing_normal_coverage, mixing_normal_coverage,
+                       min(BAM_parallel_jobs, mixing_normal_coverage), options=['-w'])
+
+def build_normal_reads(SPN, phylogenetic_forest, output_dir, account, partition):
+    return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "normal"),
+                       account, partition, normal_coverage, normal_coverage,
+                       min(BAM_parallel_jobs, normal_coverage), options=['-n'])
+
+
+if (__name__ == '__main__'):
+    parser = argparse.ArgumentParser(prog=sys.argv[0],
+                                     description=('Generates the cohorts of a tumour'))
+    parser.add_argument('SPN', type=str, help='The SPN name (e.g., SPN01)')
+    parser.add_argument('phylogenetic_forest', type=str,
+                        help = ('The phylogenetic forest absoluted path '
+                                + 'for the cluster nodes'))
+    parser.add_argument('output_dir', type=str,
+                        help = ('The output directory absoluted path '
+                                + 'for the cluster nodes'))
+    parser.add_argument('-p', '--partition', type=str, required=True,
+                        help="The cluster partition")
+    parser.add_argument('-A', '--account', type=str,
+                        help="The cluster account")
+
+    args = parser.parse_args()
+
+    if args.account is None:
+        process = subprocess.Popen(['whoami'],
+                                stdout=subprocess.PIPE)
+        account = process.communicate()
+    else:
+        account = args.account
+
+    output_dir = os.path.abspath(args.output_dir)
+
+    if not os.path.exists(output_dir):
+        os.mkdir(output_dir)
+
+    execution_track = []
+
+    execution_track.append(f"Starting at {datetime.datetime.now()}\n")
+    phylogenetic_forest = os.path.abspath(args.phylogenetic_forest)
+
+    tumour_dir = build_tumour_reads(args.SPN, phylogenetic_forest,
+                                    output_dir, account, args.partition)
+    execution_track.append(f"Tumour BAMs built at {datetime.datetime.now()}\n")
+
+    mixing_normal_dir = build_mixing_normal_reads(args.SPN, phylogenetic_forest,
+                                                  output_dir, account, args.partition)
+    execution_track.append(f"Mixing normal BAMs built at {datetime.datetime.now()}\n")
+
+    normal_dir = build_normal_reads(args.SPN, phylogenetic_forest,
+                                    output_dir, account, args.partition)
+    execution_track.append(f"Normal BAMs built at {datetime.datetime.now()}\n")
+
+    (diluitions_dir, fastq_dir) = diluite_reads(args.SPN, phylogenetic_forest, output_dir,
+                                                tumour_dir, mixing_normal_dir, account,
+                                                args.partition)
+    execution_track.append(f"Cohorts built at {datetime.datetime.now()}\n")
+
+    build_fastq(os.path.join(normal_dir, "BAM"),
+                os.path.join(fastq_dir, "normal"), account, args.partition)
+    execution_track.append("Normal FASTQs built at "
+                           + f"{datetime.datetime.now()}\n")
+
+    build_sarek_files(output_dir, fastq_dir, diluitions_dir, normal_dir)
+    execution_track.append("Sarek files built at "
+                           + f"{datetime.datetime.now()}\n")
+
+    with open(f'{os.path.join(output_dir, "cohorts.done")}', 'w') as outstream:
+        for line in execution_track:
+            outstream.write(line)