-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added some scripts to build SPN cohort on Orfeo
- Loading branch information
1 parent
4484562
commit 4833288
Showing
5 changed files
with
1,249 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
# How to Build The Cohort from The Phylogenetic Forests | ||
|
||
For every SPN, the cohort contains 13 sets of reads: 12 of them are produced by | ||
simulating the tumour sequencing with coverage in 50X, 100X, 150X, and 200X and | ||
purity varying among 0.3, 0.6, and 0.9; the last one is the 30X simulated | ||
sequencing of normal samples. The cells in the normal sample share the same | ||
mutations, i.e., the germline mutations. | ||
|
||
For efficiency purposes, the tumour reads are not directly produced by rRACES. | ||
Instead, the different coverages and purities are obtained by mixing two 200X | ||
cohorts with purity 1 and 0. These two cohorts are called "tumour" and | ||
"mixing_normal". Differently from the cohort normal sample, "mixing_normal" | ||
simulates the reads of a sample whose cells contain all the tumour | ||
pre-neoplastic mutations. | ||
|
||
## Splitting "tumour" and "mixing_normal" into Lots | ||
|
||
The `rRACES` functions `simulate_seq()` and `simulate_normal_seq()` can easily | ||
produce the cohorts "tumour" and "mixing_normal". However, to speed up their | ||
generation, we split each of them into 200 lots. Each of these lots simulates a | ||
1X sequencing of the tumour at purity 1 (for "tumour") and purity 0 (for | ||
"mixing_normal") using a different random seed so to guarantee the production | ||
of a different set of reads. | ||
|
||
The lot generation is distributed on a computational cluster managed by SLURM. | ||
The cluster consists of 8 nodes equipped with an AMD EPYC 7H12 64-core | ||
processor with 512GB of RAM and 352GB of local scratch. | ||
|
||
## Producing a Set of Reads with a Specific Coverage and Purity | ||
|
||
The number of reads $R$ to be simulated to reach a specific coverage $c$ is | ||
$R = c\frac{G}{s}$ where $G$ and $s$ are the reference genome size and the read | ||
size, respectively. | ||
|
||
Some of the $R$ reads comes from "tumour" and some from "mixed_normal" | ||
according to a probability that depends on the quantities of tumour and normal | ||
DNA in the sample ($T$ and $N$, respectively). In particular, the probability | ||
for one of the reads to come from a "mixed_normal" cell is $p_N=\frac{N}{T+N}$. | ||
|
||
The `rRACES` method `PhylogeneticForest$get_samples_info()` returns the number | ||
of tumour cells $t$ and the quantity of tumour DNA in the sample with respect | ||
to a normal cell $D$, i.e., the number of normal cells $e$ needed to equal the | ||
sample tumour DNA. Thus, | ||
$p_N = \frac{N}{T+N} = \frac{n D}{e D +n D} = \frac{n}{e+n}$ where $n$ is the | ||
number of normal cells in the sample. | ||
|
||
If $\pi$ is the aimed purity, then $n = t\frac{1-\pi}{\pi}$ is the number of | ||
"mixing_normal" cells that should be added to the tumour sample to have purity | ||
$\pi$. It follows that when the purity is $\pi$, the probability for one of the | ||
reads to come from a "mixing_normal" cell is | ||
$p_N=\frac{t(1-\pi)}{e\pi+t(1-\pi)}$. | ||
|
||
Dealing with 200 uniformly distributed lots, we can sample the binomial | ||
distribution $\mathcal{B}(p_N, R/200)$ 200 times and decide, lot by lot, how | ||
many reads should be extracted from "mixing_normal" and, consequently, how many | ||
from "tumour". | ||
|
||
## Generating the Tumour Cohort | ||
|
||
This directory contains 3 Python scripts that build the cohort of a SPN by | ||
using its phylogenetic forest file on Orfeo. | ||
|
||
To generate the cohort, copy all the Python scripts to a directory in your home | ||
on Orfeo, log in on Orfeo, and execute the following command. | ||
|
||
```{sh} | ||
nohup python3 build_cohorts.py <SPN> <PHYLO_FOREST> <OUTPUT_DIR> -p EPYC -A cdslab & | ||
``` | ||
|
||
where `<SPN>` is the name of the SPN, `<PHYLO_FOREST>` is the absolute path of | ||
the phylogenetic forest, and `<OUTPUT_DIR>` is the absolute path of the output | ||
directory. | ||
|
||
For faster executions, both `<PHYLO_FOREST>` and `<OUTPUT_DIR>` should be | ||
located in one of the `/fast` subdirectories. | ||
|
||
The Python scripts produce "tumour", "mixing_normal", and the normal sample | ||
reads. Moreover, they also generate the FASTQ files and the corresponding | ||
`sarek` files. | ||
|
||
> [!CAUTION] | ||
> The content of the `/fast` subdirectories is deleted every 30 days |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,295 @@ | ||
#!/usr/bin/python3 | ||
|
||
import os | ||
import sys | ||
import glob | ||
import time | ||
import argparse | ||
import subprocess | ||
import datetime | ||
import fnmatch | ||
|
||
tumour_coverage = 200 | ||
mixing_normal_coverage = 200 | ||
normal_coverage = 30 | ||
num_of_lots = 200 | ||
BAM_parallel_jobs = 40 | ||
diluition_parallel_jobs = 12 | ||
|
||
fastq_shell_script="""#!/bin/bash | ||
#SBATCH --nodes=1 | ||
#SBATCH --ntasks-per-node=1 | ||
#SBATCH --cpus-per-task=20 | ||
#SBATCH --time=6:00:00 | ||
#SBATCH --mem=20GB | ||
module load R | ||
module load samtools | ||
echo "Rscript generate_normal_fastq.R ${BAM_FILE} ${OUTPUT_DIR}" | ||
Rscript generate_normal_fastq.R ${BAM_FILE} ${OUTPUT_DIR} | ||
""" | ||
|
||
fastq_R_script="""rm(list = ls()) | ||
library(rRACES) | ||
generate_fastq <- function(orig_file, fastq_dir) { | ||
base_orig_file <- tools::file_path_sans_ext(basename(orig_file)) | ||
file_prefix <- file.path(fastq_dir, base_orig_file) | ||
R1 <- paste0(file_prefix, ".R1.fastq.gz") | ||
R2 <- paste0(file_prefix, ".R2.fastq.gz") | ||
unpaired <- paste0(file_prefix, ".unpaired.fastq.gz") | ||
singleton <- paste0(file_prefix, ".singleton.fastq.gz") | ||
cmd <- paste("samtools fastq -@ 20 -c 9 -N -1", R1, "-2", R2, "-0", unpaired, | ||
"-s", singleton, orig_file) | ||
invisible(system(cmd, intern = TRUE)) | ||
} | ||
args <- commandArgs(trailingOnly = TRUE) | ||
if (length(args) != 2) { | ||
stop(paste("Syntax error: generate_normal_fastq <BAM_file>", | ||
"<output_dir>"), | ||
call. = FALSE) | ||
} | ||
bam_file <- args[1] | ||
output_dir <- args[2] | ||
if (!file.exists(output_dir)) { | ||
dir.create(output_dir) | ||
} | ||
log_dir <- file.path(output_dir, "log") | ||
if (!file.exists(log_dir)) { | ||
dir.create(log_dir) | ||
} | ||
base_orig_file <- tools::file_path_sans_ext(basename(bam_file)) | ||
generate_fastq(bam_file, output_dir) | ||
done_file <- file.path(log_dir, paste0(base_orig_file, ".done")) | ||
invisible(file.create(done_file)) | ||
""" | ||
|
||
|
||
def build_reads(SPN, phylogenetic_forest, output_dir, account, partition, | ||
num_of_lots, coverage, parallel_jobs, options=[]): | ||
if not os.path.exists(output_dir): | ||
os.mkdir(output_dir) | ||
|
||
cmd = (['python3', './deploy_rRACES_seq.py', '-p', f'{partition}', | ||
'-A', f'{account}', '-l', f'{num_of_lots}', '-c', f'{coverage}', | ||
'-j', f'{parallel_jobs}', SPN, phylogenetic_forest, output_dir] | ||
+ options) | ||
|
||
subprocess.run(cmd) | ||
|
||
return output_dir | ||
|
||
|
||
def diluite_reads(SPN, phylogenetic_forest, output_dir, tumour_dir, mixing_normal_dir, | ||
account, partition): | ||
|
||
diluitions_dir = os.path.join(output_dir, "diluitions") | ||
if not os.path.exists(diluitions_dir): | ||
os.mkdir(diluitions_dir) | ||
|
||
fastq_dir = os.path.join(output_dir, "fastq") | ||
if not os.path.exists(fastq_dir): | ||
os.mkdir(fastq_dir) | ||
|
||
cmd = ['python3', './deploy_diluite_cohort.py', '-p', f'{partition}', | ||
'-A', f'{account}', '-j', f'{diluition_parallel_jobs}', SPN, | ||
phylogenetic_forest, diluitions_dir, fastq_dir, | ||
os.path.join(tumour_dir, "BAM"), os.path.join(mixing_normal_dir, "BAM")] | ||
|
||
subprocess.run(cmd) | ||
|
||
return (diluitions_dir, fastq_dir) | ||
|
||
|
||
def get_done(directory): | ||
return glob.glob(os.path.join(directory, "*.done")) | ||
|
||
|
||
def build_fastq(bam_dir, fastq_dir, account, partition): | ||
with open('generate_normal_fastq.R', 'w') as outstream: | ||
outstream.write(fastq_R_script) | ||
|
||
with open('generate_normal_fastq.sh', 'w') as outstream: | ||
outstream.write(fastq_shell_script) | ||
|
||
if not os.path.exists(fastq_dir): | ||
os.mkdir(fastq_dir) | ||
|
||
fastq_log_dir = os.path.join(fastq_dir, "log") | ||
if not os.path.exists(fastq_log_dir): | ||
os.mkdir(fastq_log_dir) | ||
|
||
bam_files = glob.glob(os.path.join(bam_dir, "*.bam")) | ||
|
||
for bam_file in bam_files: | ||
basename = os.path.splitext(os.path.basename(bam_file))[0] | ||
cmd = ['sbatch', f'--account={account}', | ||
f'--partition={partition}', | ||
(f'--export=BAM_FILE={bam_file},' | ||
+ f'OUTPUT_DIR={fastq_dir}'), | ||
f'--output={fastq_log_dir}/{basename}.log', | ||
'./generate_normal_fastq.sh'] | ||
#'./test.sh'] | ||
|
||
subprocess.run(cmd) | ||
|
||
while (len(bam_files) != len(get_done(fastq_log_dir))): | ||
time.sleep(60) | ||
|
||
return fastq_dir | ||
|
||
def get_fastq_filenames(bam_dir, fastq_dir): | ||
|
||
normal_fastq_filenames = [] | ||
for bam_file in sorted(glob.glob(os.path.join(bam_dir, "*.bam"))): | ||
basename = os.path.splitext(os.path.basename(bam_file))[0] | ||
common_name = os.path.join(fastq_dir, basename) | ||
|
||
normal_fastq_filenames.append((f'{common_name}.R1.fastq.gz', | ||
f'{common_name}.R2.fastq.gz')) | ||
|
||
return normal_fastq_filenames | ||
|
||
def get_cohort_in(directory): | ||
cohorts = set() | ||
for cohort in fnmatch.filter(os.listdir(directory), "*X_*p"): | ||
if os.path.isdir(os.path.join(directory, cohort)): | ||
cohorts.add(cohort) | ||
|
||
return cohorts | ||
|
||
def build_sarek_files(output_dir, fastq_dir, diluitions_dir, normal_dir): | ||
sarek_dir = os.path.join(output_dir, "sarek") | ||
|
||
if (not os.path.exists(sarek_dir)): | ||
os.mkdir(sarek_dir) | ||
|
||
path = os.path.normpath(diluitions_dir) | ||
diluitions_dir_depth = len(path.split(os.sep)) | ||
|
||
normal_fastq_filenames = get_fastq_filenames(os.path.join(normal_dir, "BAM"), | ||
os.path.join(fastq_dir, "normal")) | ||
|
||
#cohorts = set() | ||
#for root, dirnames, filenames in os.walk(diluitions_dir): | ||
# for filename in fnmatch.filter(filenames, "sarek*.csv"): | ||
# path = os.path.normpath(root) | ||
# cohorts.add(path.split(os.sep)[diluitions_dir_depth]) | ||
|
||
for cohort in get_cohort_in(diluitions_dir): | ||
with open(os.path.join(sarek_dir, cohort + ".csv"), 'w') as outstream: | ||
outstream.write('patient,sex,status,sample,' | ||
+ 'lane,fastq_1,fastq_2\n') | ||
|
||
for root, dirnames, filenames in os.walk(diluitions_dir): | ||
for filename in fnmatch.filter(filenames, "sarek_*.csv"): | ||
with open(os.path.join(root, filename)) as instream: | ||
line_num = 0 | ||
for line in instream: | ||
if line_num>0: | ||
outstream.write(f'{line.strip()}\n') | ||
if (line_num == 1): | ||
(SPN, gender) = line.strip().split(',')[:2] | ||
|
||
line_num += 1 | ||
|
||
line = 1 | ||
for fastq_filenames in normal_fastq_filenames: | ||
outstream.write(f'{SPN},{gender},0,normal_sample,' | ||
+ f'L{str(line).zfill(3)},{fastq_filenames[0]},' | ||
+ f'{fastq_filenames[1]}\n') | ||
line += 1 | ||
|
||
def build_tumour_reads(SPN, phylogenetic_forest, output_dir, account, partition): | ||
return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "tumour"), | ||
account, partition, tumour_coverage, tumour_coverage, | ||
min(BAM_parallel_jobs, tumour_coverage)) | ||
|
||
def build_mixing_normal_reads(SPN, phylogenetic_forest, output_dir, account, partition): | ||
return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "mixing_normal"), | ||
account, partition, mixing_normal_coverage, mixing_normal_coverage, | ||
min(BAM_parallel_jobs, mixing_normal_coverage), options=['-w']) | ||
|
||
def build_normal_reads(SPN, phylogenetic_forest, output_dir, account, partition): | ||
return build_reads(SPN, phylogenetic_forest, os.path.join(output_dir, "normal"), | ||
account, partition, normal_coverage, normal_coverage, | ||
min(BAM_parallel_jobs, normal_coverage), options=['-n']) | ||
|
||
|
||
if (__name__ == '__main__'): | ||
parser = argparse.ArgumentParser(prog=sys.argv[0], | ||
description=('Generates the cohorts of a tumour')) | ||
parser.add_argument('SPN', type=str, help='The SPN name (e.g., SPN01)') | ||
parser.add_argument('phylogenetic_forest', type=str, | ||
help = ('The phylogenetic forest absoluted path ' | ||
+ 'for the cluster nodes')) | ||
parser.add_argument('output_dir', type=str, | ||
help = ('The output directory absoluted path ' | ||
+ 'for the cluster nodes')) | ||
parser.add_argument('-p', '--partition', type=str, required=True, | ||
help="The cluster partition") | ||
parser.add_argument('-A', '--account', type=str, | ||
help="The cluster account") | ||
|
||
args = parser.parse_args() | ||
|
||
if args.account is None: | ||
process = subprocess.Popen(['whoami'], | ||
stdout=subprocess.PIPE) | ||
account = process.communicate() | ||
else: | ||
account = args.account | ||
|
||
output_dir = os.path.abspath(args.output_dir) | ||
|
||
if not os.path.exists(output_dir): | ||
os.mkdir(output_dir) | ||
|
||
execution_track = [] | ||
|
||
execution_track.append(f"Starting at {datetime.datetime.now()}\n") | ||
phylogenetic_forest = os.path.abspath(args.phylogenetic_forest) | ||
|
||
tumour_dir = build_tumour_reads(args.SPN, phylogenetic_forest, | ||
output_dir, account, args.partition) | ||
execution_track.append(f"Tumour BAMs built at {datetime.datetime.now()}\n") | ||
|
||
mixing_normal_dir = build_mixing_normal_reads(args.SPN, phylogenetic_forest, | ||
output_dir, account, args.partition) | ||
execution_track.append(f"Mixing normal BAMs built at {datetime.datetime.now()}\n") | ||
|
||
normal_dir = build_normal_reads(args.SPN, phylogenetic_forest, | ||
output_dir, account, args.partition) | ||
execution_track.append(f"Normal BAMs built at {datetime.datetime.now()}\n") | ||
|
||
(diluitions_dir, fastq_dir) = diluite_reads(args.SPN, phylogenetic_forest, output_dir, | ||
tumour_dir, mixing_normal_dir, account, | ||
args.partition) | ||
execution_track.append(f"Cohorts built at {datetime.datetime.now()}\n") | ||
|
||
build_fastq(os.path.join(normal_dir, "BAM"), | ||
os.path.join(fastq_dir, "normal"), account, args.partition) | ||
execution_track.append("Normal FASTQs built at " | ||
+ f"{datetime.datetime.now()}\n") | ||
|
||
build_sarek_files(output_dir, fastq_dir, diluitions_dir, normal_dir) | ||
execution_track.append("Sarek files built at " | ||
+ f"{datetime.datetime.now()}\n") | ||
|
||
with open(f'{os.path.join(output_dir, "cohorts.done")}', 'w') as outstream: | ||
for line in execution_track: | ||
outstream.write(line) |
Oops, something went wrong.