diff --git a/docs/source/chemistry-command.rst b/docs/source/chemistry-command.rst index 07e851c..f7dfe6d 100644 --- a/docs/source/chemistry-command.rst +++ b/docs/source/chemistry-command.rst @@ -43,7 +43,7 @@ The ``refresh`` sub-command takes no *required* arguments; it's usage is shown b Usage: simpleaf chemistry refresh [OPTIONS] Options: - -f, --force overwrite an existing matched chemistry even if the version isn't newer + -f, --force overwrite an existing matched chemistry even if the version is not newer -d, --dry-run report what would happen with a refresh without actually performing one on the actual chemistry registry -h, --help Print help @@ -79,9 +79,9 @@ Every chemistry added to the registry has three mandatory properties: ``name``, - ``name``: A unique name (within the existing registry) of the chemistry. It must be a valid UTF-8 identifier. If the name is already registered, the existing definition will be updated if a higher ``--version`` is provided (see below for details). Otherwise, simpleaf will complain and fail. -- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language `_ as used in the `quant command `. +- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language `_ as used in the `quant command `_. - ``expected-ori``: The expected orientation of the chemistry. It must be one of the following: fw (forward), rc (reverse complement), or both (both orientations). It describes the expected orientation relative to the first (most upstream) mappable biological sequence. -Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from Single Cell Genomics Library Structure _. +Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from `Single Cell Genomics Library Structure `_. If we map the biological sequence in read1s and read2s as paired-end reads (currently only supported when using the default mapper -- piscem), as biological read1s are the first mappable sequences, the expected orientation for this chemistry should be ``fw``, the orientation of read1s. However, if we only map read2s, the expected orientation should be ``rc``, because read2s are the first mappable sequences and are in the reverse complementary orientation. In addition to the required fields, there are 3 optional fields, as described below. A permit list file must be a TSV file without a header, and the first column must contain the sequence of permitted cell barcodes, i.e., the whitelist of cell barcodes. diff --git a/docs/source/index-command.rst b/docs/source/index-command.rst index db5035f..c17bf44 100644 --- a/docs/source/index-command.rst +++ b/docs/source/index-command.rst @@ -3,39 +3,78 @@ The ``index`` command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using `roers `_ (which is used as a library directly from ``simpleaf``, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode). -In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectoy, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference. +In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference. -In direct-ref mode, the provided fasta file (passed in with ``--refseq``) will be provided to ``piscem build`` or ``salmon index`` directly. The output diretory will contain an ``index`` subdirectory that contains the index built on this reference. +In direct-ref mode, if ``--refseq`` is passed, the provided FASTA file will be provided to ``piscem build`` or ``salmon index`` directly. If ``probe_csv`` or ``feature_csv`` is passed, a FASTA file will be created accordingly and provided to ``piscem build`` or ``salmon index``. The output directory will contain an ``index`` subdirectory that contains the index built on this reference. + +- ``probe_csv``: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of `10x Probe Set Reference CSV `_, containing four mandatory columns: `gene_id`, `probe_seq`, `probe_id`, and `included` (must be ``TRUE`` or ``FALSE``), and an optional column: `region` (must be ``spliced`` or ``unspliced``). When parsing the file, ``simpleaf`` will only use the rows where the `included` column is ``TRUE``. For each row, ``simpleaf`` first builds a FASTA record where the identifier is set as `probe_id`, and the sequence is set as `probe_seq`. Then, it will build a t2g file where the first column is `probe_id` and the second column is `gene_id`. If the `region` column exists, the t2g file will include the region information, so as to trigger the USA mode in ``simpleaf quant`` to generate spliced and unspliced count separately. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set. +- ``feature_csv``: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of `10x Feature Reference CSV `_. Currently, only three columns are used: `id`, `name`, and `sequence`. When parsing the file, ``simpleaf`` first builds a FASTA file using the `id` and `sequence` columns. Then, it will build a t2g file where the transcript is set as `id` and the gene is set as `name`. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set. The relevant options (which you can obtain by running ``simpleaf index -h``) are: .. code-block:: console - + build the (expanded) reference index - - Usage: simpleaf index [OPTIONS] --output <--fasta |--ref-seq > - + + Usage: simpleaf index [OPTIONS] --output <--fasta |--ref-seq |--probe-csv |--feature-csv > + Options: - -o, --output path to output directory (will be created if it doesn't exist) - -t, --threads number of threads to use when running [default: 16] - -k, --kmer-length the value of k to be used to construct the index [default: 31] - --keep-duplicates keep duplicated identical sequences when constructing the index - -p, --sparse if this flag is passed, build the sparse rather than dense index for mapping - -h, --help Print help information - -V, --version Print version information - + -o, --output Path to output directory (will be created if it doesn't exist) + -t, --threads Number of threads to use when running [default: 16] + -k, --kmer-length The value of k to be used to construct the index [default: 31] + --gff3-format Denotes that the input annotation is a GFF3 (instead of GTF) + file + --keep-duplicates Keep duplicated identical sequences when constructing the index + --overwrite Overwrite existing files if the output directory is already + populated + -h, --help Print help + -V, --version Print version + Expanded Reference Options: - --ref-type specify whether an expanded reference, spliced+intronic (or splici) or spliced+unspliced (or spliceu), should be built [default: spliced+intronic] - -f, --fasta reference genome to be used for the expanded reference construction - -g, --gtf reference GTF file to be used for the expanded reference construction - -r, --rlen the target read length the splici index will be built for - --dedup deduplicate identical sequences in roers when building an expanded reference reference - --spliced path to FASTA file with extra spliced sequence to add to the index - --unspliced path to FASTA file with extra unspliced sequence to add to the index - + --ref-type Specify whether an expanded reference, spliced+intronic (or splici) + or spliced+unspliced (or spliceu), should be built [default: + spliced+intronic] + -f, --fasta Path to a reference genome to be used for the expanded reference + construction + -g, --gtf Path to a reference GTF/GFF3 file to be used for the expanded + reference construction + -r, --rlen The Read length used in roers to add flanking lengths to intronic + sequences + --dedup Deduplicate identical sequences in roers when building the expanded + reference + --spliced Path to FASTA file with extra spliced sequence to add to the index + --unspliced Path to a FASTA file with extra unspliced sequence to add to the + index + Direct Reference Options: - --ref-seq target sequences (provide target sequences directly; avoid expanded reference construction) - + --feature-csv A CSV file containing feature barcode sequences to use for + direct reference indexing. The file must follow the format of + 10x Feature Reference CSV. Currently, only three columns are + used: id, name, and sequence + --probe-csv A CSV file containing probe sequences to use for direct + reference indexing. The file must follow the format of 10x Probe + Set Reference v2 CSV, containing four mandatory columns: + gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and + an optional column: region (spliced or unspliced) + --ref-seq A FASTA file containing reference sequences to directly build + index on, and avoid expanded reference construction + Piscem Index Options: - --use-piscem use piscem instead of salmon for indexing and mapping - -m, --minimizer-length the value of m to be used to construct the piscem index (must be < k) [default: 19] + -m, --minimizer-length + Minimizer length to be used to construct the piscem index (must be < k) [default: 19] + --decoy-paths + Paths to decoy sequence FASTA files used to insert poison k-mer information into the + index (only if using piscem >= 0.7) + --seed + The seed value to use in SSHash index construction (try changing this in the rare event + index build fails) [default: 1] + --work-dir + The working directory where temporary files should be placed [default: ./workdir.noindex] + --use-piscem + Use piscem instead of salmon for indexing and mapping (default) + + Alternative salmon-alevin Index Options: + -p, --sparse If this flag is passed, build the sparse rather than dense index for mapping + --no-piscem Don't use the default piscem mapper, instead, use salmon-alevin + + diff --git a/docs/source/quant-command.rst b/docs/source/quant-command.rst index 1523a06..2bb78e7 100644 --- a/docs/source/quant-command.rst +++ b/docs/source/quant-command.rst @@ -30,7 +30,7 @@ The custom format is as follows; you must specify the content of read 1 and read In particular, this is how one would specify the 10x Chromium v3 geometry using the custom syntax. The format string says that the read pair should be interpreted as read 1 ``1{...}`` followed by read 2 ``2{...}``. The syntax inside the ``{}`` says how the read should be interpreted. Here ``b[16]u[12]x:`` means that the first 16 bases constitute the barcode, the next 12 constitute the UMI, and anything that comes after that (if it exists) until the end of read 1 should be discarded (``x``). For read 2, we have ``2{r:}``, meaning that we should interpret read 2, in it's full length, as biological sequence. -It is possible to have pieces of geometry repeated, in which case they will be extracted and concatenated together. For example, ``1{b[16]u[12]b[4]x:}`` would mean that we should obtain the barcode by extracting bases 1-16 (1-based indexing) and 29-32 and concatenating them togehter to obtain the full barcode. A +It is possible to have pieces of geometry repeated, in which case they will be extracted and concatenated together. For example, ``1{b[16]u[12]b[4]x:}`` would mean that we should obtain the barcode by extracting bases 1-16 (1-based indexing) and 29-32 and concatenating them together to obtain the full barcode. A .. note:: @@ -46,35 +46,78 @@ The relevant options (which you can obtain by running ``simpleaf quant -h``) are .. code-block:: bash - quantify a sample - - Usage: simpleaf quant [OPTIONS] --chemistry --output --resolution <--knee|--unfiltered-pl []|--forced-cells |--expect-cells > <--index |--map-dir > - - Options: - -c, --chemistry chemistry - -o, --output output directory - -t, --threads number of threads to use when running [default: 16] - -h, --help Print help information - -V, --version Print version information - - Mapping Options: - -i, --index path to index - -1, --reads1 comma-separated list of paths to read 1 files - -2, --reads2 comma-separated list of paths to read 2 files - -s, --use-selective-alignment use selective-alignment for mapping (instead of pseudoalignment with structural constraints) - --use-piscem use piscem for mapping (requires that index points to the piscem index) - --map-dir path to a mapped output directory containing a RAD file to skip mapping - - Permit List Generation Options: - -k, --knee use knee filtering mode - -u, --unfiltered-pl [] use unfiltered permit list - -f, --forced-cells use forced number of cells - -x, --explicit-pl use a filtered, explicit permit list - -e, --expect-cells use expected number of cells - -d, --expected-ori The expected direction/orientation of alignments in the chemistry being processed. If not provided, will default to `fw` for 10xv2/10xv3, otherwise `both` [possible - values: fw, rc, both] - --min-reads minimum read count threshold for a cell to be retained/processed; only used with --unfiltered-pl [default: 10] - - UMI Resolution Options: - -m, --t2g-map transcript to gene map - -r, --resolution resolution mode [possible values: cr-like, cr-like-em, parsimony, parsimony-em, parsimony-gene, parsimony-gene-em] + quantify a sample + + Usage: simpleaf quant [OPTIONS] --chemistry --output --resolution <--expect-cells |--explicit-pl |--forced-cells |--knee|--unfiltered-pl []> <--index |--map-dir > + + Options: + -c, --chemistry The name of a registered chemistry or a quoted string representing a + custom geometry specification + -o, --output Path to the output directory + -t, --threads Number of threads to use when running [default: 16] + -h, --help Print help + -V, --version Print version + + Mapping Options: + -i, --index Path to a folder containing the index files + -1, --reads1 Comma-separated list of paths to read 1 files. The order must + match the read 2 files + -2, --reads2 Comma-separated list of paths to read 2 files. The order must + match the read 1 files + --no-piscem Don't use the default piscem mapper, instead, use salmon-alevin + --use-piscem Use piscem for mapping (requires that index points to the piscem + index) + -s, --use-selective-alignment Use selective-alignment for mapping (only if using salmon alevin + as the underlying mapper) + --map-dir Path to a mapped output directory containing a RAD file to skip + mapping + + Piscem Mapping Options: + --struct-constraints + If piscem >= 0.7.0, enable structural constraints + --ignore-ambig-hits + Skip checking of the equivalence classes of k-mers that were too ambiguous to be + otherwise considered (passing this flag can speed up mapping slightly, but may reduce + specificity) + --no-poison + Do not consider poison k-mers, even if the underlying index contains them. In this case, + the mapping results will be identical to those obtained as if no poison table was added + to the index + --skipping-strategy + The skipping strategy to use for k-mer collection [default: permissive] [possible values: + permissive, strict] + --max-ec-card + Determines the maximum cardinality equivalence class (number of (txp, orientation status) + pairs) to examine (cannot be used with --ignore-ambig-hits) [default: 4096] + --max-hit-occ + In the first pass, consider only collected and matched k-mers of a read having <= + --max-hit-occ hits [default: 256] + --max-hit-occ-recover + If all collected and matched k-mers of a read have > --max-hit-occ hits, then make a + second pass and consider k-mers having <= --max-hit-occ-recover hits [default: 1024] + --max-read-occ + Threshold for discarding reads with too many mappings [default: 2500] + + Permit List Generation Options: + -k, --knee + Use knee filtering mode + -u, --unfiltered-pl [] + Use unfiltered permit list + -f, --forced-cells + Use forced number of cells + -x, --explicit-pl + Use a filtered, explicit permit list + -e, --expect-cells + Use expected number of cells + -d, --expected-ori + The expected direction/orientation of alignments in the chemistry being processed. If not + provided, will default to `fw` for 10xv2/10xv3, otherwise `both` [possible values: fw, + rc, both] + --min-reads + Minimum read count threshold for a cell to be retained/processed; only use with + --unfiltered-pl [default: 10] + + UMI Resolution Options: + -m, --t2g-map Path to a transcript to gene map file + -r, --resolution UMI resolution mode [possible values: cr-like, cr-like-em, + parsimony, parsimony-em, parsimony-gene, parsimony-gene-em] \ No newline at end of file diff --git a/src/simpleaf_commands.rs b/src/simpleaf_commands.rs index 3bf3b08..f70ca6a 100644 --- a/src/simpleaf_commands.rs +++ b/src/simpleaf_commands.rs @@ -200,7 +200,7 @@ pub struct MapQuantOpts { conflicts_with = "use_piscem")] pub max_hit_occ_recover: u32, - /// Threshold for discarding reads with too many mappings + /// Threshold for discarding reads with too many mappings #[arg(long, default_value_t = DefaultParams::MAX_READ_OCC, help_heading = "Piscem Mapping Options", @@ -258,7 +258,7 @@ pub struct MapQuantOpts { #[command(group( ArgGroup::new("reftype") .required(true) - .args(["fasta", "ref_seq"]) + .args(["fasta", "ref_seq", "probe_csv", "feature_csv"]) ))] pub struct IndexOpts { /// Specify whether an expanded reference, spliced+intronic (or splici) or spliced+unspliced (or spliceu), should be built @@ -270,7 +270,7 @@ pub struct IndexOpts { requires_ifs([ (ArgPredicate::IsPresent, "gtf") ]), - conflicts_with = "ref_seq")] + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"])] pub fasta: Option, /// Path to a reference GTF/GFF3 file to be used for the expanded reference construction @@ -280,7 +280,7 @@ pub struct IndexOpts { help_heading = "Expanded Reference Options", display_order = 3, requires = "fasta", - conflicts_with = "ref_seq" + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"] )] pub gtf: Option, @@ -289,7 +289,7 @@ pub struct IndexOpts { long, display_order = 4, requires = "fasta", - conflicts_with = "ref_seq" + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"] )] pub gff3_format: bool, @@ -300,7 +300,7 @@ pub struct IndexOpts { help_heading = "Expanded Reference Options", display_order = 5, requires = "fasta", - conflicts_with = "ref_seq", + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"], default_value_t = 91, hide_default_value = true )] @@ -312,22 +312,22 @@ pub struct IndexOpts { help_heading = "Expanded Reference Options", display_order = 6, requires = "fasta", - conflicts_with = "ref_seq" + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"] )] pub dedup: bool, - /// Reference sequences to directly build index on, and avoid expanded reference construction + /// Path to a FASTA file containing reference sequences to directly build index on, and avoid expanded reference construction #[arg(long, alias = "refseq", help_heading = "Direct Reference Options", display_order = 7, - conflicts_with_all = ["dedup", "unspliced", "spliced", "rlen", "gtf", "fasta"])] + conflicts_with_all = ["dedup", "unspliced", "spliced", "rlen", "gtf", "fasta", "feature_csv", "probe_csv"])] pub ref_seq: Option, - /// Path to FASTA file with extra spliced sequence to add to the index + /// Path to a FASTA file with extra spliced sequence to add to the index #[arg( long, help_heading = "Expanded Reference Options", display_order = 8, requires = "fasta", - conflicts_with = "ref_seq" + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"] )] pub spliced: Option, @@ -337,7 +337,7 @@ pub struct IndexOpts { help_heading = "Expanded Reference Options", display_order = 9, requires = "fasta", - conflicts_with = "ref_seq" + conflicts_with_all = ["ref_seq", "feature_csv", "probe_csv"] )] pub unspliced: Option, @@ -438,6 +438,16 @@ pub struct IndexOpts { display_order = 2 )] pub sparse: bool, + + /// Path to a CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of 10x Probe Set Reference v2 CSV, containing four mandatory columns: gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and an optional column: region (spliced or unspliced). + #[arg(long, help_heading = "Direct Reference Options", display_order = 7, + conflicts_with_all = ["dedup", "unspliced", "spliced", "rlen", "gtf", "fasta", "ref_seq", "feature_csv"])] + pub probe_csv: Option, + + /// Path to a CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of 10x Feature Reference CSV. Currently, only three columns are used: id, name, and sequence. + #[arg(long, help_heading = "Direct Reference Options", display_order = 7, + conflicts_with_all = ["dedup", "unspliced", "spliced", "rlen", "gtf", "fasta", "ref_seq", "probe_csv"])] + pub feature_csv: Option, } /// Remove chemistries from the local chemistry registry @@ -547,7 +557,7 @@ pub enum ChemistryCommand { } #[derive(Args, Clone, Debug)] -#[command(arg_required_else_help = true)] +#[command(arg_required_else_help = false)] pub struct SetPathOpts { /// path to salmon to use #[arg(short, long)] diff --git a/src/simpleaf_commands/indexing.rs b/src/simpleaf_commands/indexing.rs index 6027c27..a589c14 100644 --- a/src/simpleaf_commands/indexing.rs +++ b/src/simpleaf_commands/indexing.rs @@ -1,11 +1,15 @@ +use crate::utils::af_utils::create_dir_if_absent; use crate::utils::prog_utils; use crate::utils::prog_utils::{CommandVerbosityLevel, ReqProgs}; -use anyhow::{bail, Context}; -use cmd_lib::run_fun; +use anyhow::{anyhow, bail, Context}; use roers; +use serde::Deserialize; use serde_json::json; use serde_json::Value; +use std::collections::HashSet; +use std::fs::File; +use std::io::{BufWriter, Write}; use std::path::{Path, PathBuf}; use std::time::Instant; use tracing::{error, info, warn}; @@ -29,6 +33,199 @@ fn validate_index_type_opts(opts: &IndexOpts) -> anyhow::Result<()> { Ok(()) } +enum CsvReader { + Probe(csv::Reader), + Feature(csv::Reader), +} + +impl CsvReader { + fn has_region(&mut self) -> anyhow::Result { + let rdr = match self { + CsvReader::Probe(rdr) => rdr, + CsvReader::Feature(rdr) => rdr, + }; + + let headers = rdr.headers()?.iter().collect::>(); + Ok(headers.contains(&"region")) + } +} + +trait CsvRow<'de> { + fn ref_id(&self) -> &str; + fn sequence(&self) -> &str; + fn seq_id(&self) -> &str; + fn included(&self) -> bool; + fn region(&self) -> Option<&ProbeRegion>; +} + +#[derive(Deserialize, Debug)] +#[allow(clippy::upper_case_acronyms)] +enum Included { + TRUE, + FALSE, +} + +#[allow(dead_code)] +impl Included { + fn from_str(s: &str) -> anyhow::Result { + match s { + "TRUE" => Ok(Included::TRUE), + "FALSE" => Ok(Included::FALSE), + _ => Err(anyhow!("Invalid include value: {}", s)), + } + } +} + +impl From for bool { + fn from(f: Included) -> bool { + match f { + Included::TRUE => true, + Included::FALSE => false, + } + } +} + +#[derive(Deserialize, Debug)] +struct ProbeRow { + gene_id: String, + probe_seq: String, + probe_id: String, + included: Option, + region: Option, +} + +impl CsvRow<'_> for ProbeRow { + fn ref_id(&self) -> &str { + &self.gene_id + } + + fn sequence(&self) -> &str { + &self.probe_seq + } + + fn seq_id(&self) -> &str { + &self.probe_id + } + + fn included(&self) -> bool { + !matches!(self.included, Some(Included::FALSE)) + } + + fn region(&self) -> Option<&ProbeRegion> { + self.region.as_ref() + } +} + +#[derive(Deserialize, Debug)] +#[allow(dead_code)] +struct FeatureRow { + id: String, + name: String, + sequence: String, + read: Option, + pattern: Option, + feature_type: Option, + mhc_allele: Option, +} + +impl CsvRow<'_> for FeatureRow { + fn ref_id(&self) -> &str { + &self.name + } + + fn sequence(&self) -> &str { + &self.sequence + } + + fn seq_id(&self) -> &str { + &self.id + } + + fn included(&self) -> bool { + true + } + + fn region(&self) -> Option<&ProbeRegion> { + None + } +} + +#[derive(serde::Deserialize, Debug)] +#[allow(non_camel_case_types)] +enum ProbeRegion { + spliced, + unspliced, +} + +#[allow(dead_code)] +impl ProbeRegion { + fn from_str(s: &str) -> anyhow::Result { + match s { + "spliced" => Ok(ProbeRegion::spliced), + "unspliced" => Ok(ProbeRegion::unspliced), + _ => Err(anyhow!("Invalid probe region: {}", s)), + } + } + + fn as_str(&self) -> &str { + match self { + ProbeRegion::spliced => "S", + ProbeRegion::unspliced => "U", + } + } +} + +// implement display +impl std::fmt::Display for ProbeRegion { + fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { + write!(f, "{}", self.as_str()) + } +} + +#[allow(clippy::too_many_arguments)] +fn parse_csv_record( + ref_id: &str, + seq_id: &str, + sequence: &str, + include: bool, + region: Option<&ProbeRegion>, + has_region: bool, + seq_id_hs: &mut HashSet, + ref_seq_writer: &mut BufWriter, + // id_to_name_writer: &mut BufWriter, + t2g_writer: &mut BufWriter, +) -> anyhow::Result<()> { + if !include { + return Ok(()); + } + + // check if probe id is duplicated + if seq_id_hs.contains(seq_id) { + bail!("Found duplicated sequence id {}; cannot proceed", seq_id); + } else { + seq_id_hs.insert(seq_id.to_string()); + } + + // check if region is expected + if has_region != region.is_some() { + bail!("Found invalid sequence, {}, with missing region.", seq_id); + } + + // insert into t2g + if let Some(r) = region { + writeln!(t2g_writer, "{}\t{}\t{}", seq_id, ref_id, r.as_str())?; + } else { + writeln!(t2g_writer, "{}\t{}", seq_id, ref_id)?; + }; + + // insert into gene id to name + // writeln!(id_to_name_writer, "{}\t{}", ref_id, ref_id)?; + + // insert into ref seq + writeln!(ref_seq_writer, ">{}\n{}", seq_id, sequence)?; + Ok(()) +} + pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Result<()> { validate_index_type_opts(&opts)?; let mut threads = opts.threads; @@ -57,7 +254,7 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu } }); - run_fun!(mkdir -p $output)?; + create_dir_if_absent(&output)?; // wow, the compiler is smart enough to // figure out that this one need not be @@ -66,9 +263,12 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu let reference_sequence; // these may or may not be set, so must be // mutable. - let mut splici_t2g = None; + let mut t2g = None; + let mut _gene_id_to_name = None; let mut roers_duration = None; let mut roers_aug_ref_opt = None; + let outref = output.join("ref"); + let min_seq_len: Option; // if we are generating a splici reference if let (Some(fasta), Some(gtf)) = (opts.fasta, opts.gtf) { @@ -87,8 +287,7 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu ReferenceType::SplicedUnspliced => Some(vec![roers::AugType::GeneBody]), }; - let outref = output.join("ref"); - run_fun!(mkdir -p $outref)?; + create_dir_if_absent(&outref)?; let roers_opts = roers::AugRefOpts { // The path to a genome fasta file. @@ -113,23 +312,16 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu let ref_file = outref.join("roers_ref.fa"); let t2g_file = outref.join("t2g_3col.tsv"); + let gene_id_to_name_file = outref.join("gene_id_to_name.tsv"); index_info["t2g_file"] = json!(&t2g_file); + index_info["gene_id_to_name"] = json!(&gene_id_to_name_file); index_info["args"]["fasta"] = json!(&fasta); index_info["args"]["gtf"] = json!(>f); index_info["args"]["spliced"] = json!(&opts.spliced); index_info["args"]["unspliced"] = json!(&opts.unspliced); index_info["args"]["dedup"] = json!(opts.dedup); - std::fs::write( - &info_file, - serde_json::to_string_pretty(&index_info).unwrap(), - ) - .with_context(|| format!("could not write {}", info_file.display()))?; - - // set the splici_t2g option - splici_t2g = Some(t2g_file); - prog_utils::check_files_exist(&input_files)?; info!("preparing to make reference with roers"); @@ -138,24 +330,117 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu roers::make_ref(roers_opts)?; roers_duration = Some(roers_start.elapsed()); + min_seq_len = None; reference_sequence = Some(ref_file); + // set the splici_t2g option + t2g = Some(t2g_file); + _gene_id_to_name = Some(gene_id_to_name_file); + } else if let Some(ref_seq) = &opts.ref_seq { + // if we have a ref-seq fasta file + min_seq_len = None; + index_info["args"]["ref-seq"] = json!(ref_seq); + reference_sequence = Some(ref_seq.clone()); } else { - // we are running on a set of references directly + // now, we have to have a probe csv or feature csv + create_dir_if_absent(&outref)?; + + let mut csv_reader = if let Some(probe_csv) = &opts.probe_csv { + index_info["args"]["probe-csv"] = json!(probe_csv); + let rdr = csv::ReaderBuilder::new() + .comment(Some(b'#')) + .from_path(probe_csv)?; + CsvReader::Probe(rdr) + } else if let Some(feature_csv) = &opts.feature_csv { + index_info["args"]["feature-csv"] = json!(feature_csv); + + let rdr = csv::ReaderBuilder::new() + .has_headers(true) + .comment(Some(b'#')) + .from_path(feature_csv)?; + + CsvReader::Feature(rdr) + } else { + bail!("No reference sequence provided. It should not happen, please report this issue on GitHub."); + }; - // in this path (due to the argument parser requiring - // either --fasta or --ref-seq, ref-seq should be safe to - // unwrap). - index_info["args"]["ref-seq"] = json!(opts.ref_seq.clone().unwrap()); + // determine the format of t2g file + let has_region = csv_reader.has_region()?; - std::fs::write( - &info_file, - serde_json::to_string_pretty(&index_info).unwrap(), - ) - .with_context(|| format!("could not write {}", info_file.display()))?; + // we process the csv file and record the t2g and gene id to name vector + let mut seq_id_hs: HashSet = HashSet::new(); + + // define file names + let ref_seq_path = outref.join("ref.fa"); + // let id_to_name_path = outref.join("gene_id_to_name.tsv"); + let t2g_path = if has_region { + outref.join("t2g_3col.tsv") + } else { + outref.join("t2g.tsv") + }; + + // define buffer writers + let mut ref_seq_writer = BufWriter::new(File::create(&ref_seq_path)?); + // let mut id_to_name_writer = BufWriter::new(File::create(&id_to_name_path)?); + let mut t2g_writer = BufWriter::new(File::create(&t2g_path)?); + let mut msl = u32::MAX; + + match csv_reader { + CsvReader::Feature(mut rdr) => { + // process the csv file + for row in rdr.deserialize() { + let record: FeatureRow = row?; + msl = msl.min(record.sequence().len() as u32); + + parse_csv_record( + record.ref_id(), + record.seq_id(), + record.sequence(), + record.included(), + record.region(), + has_region, + &mut seq_id_hs, + &mut ref_seq_writer, + // &mut id_to_name_writer, + &mut t2g_writer, + )?; + } + } + CsvReader::Probe(mut rdr) => { + // process the csv file + for row in rdr.deserialize() { + let record: ProbeRow = row?; + + parse_csv_record( + record.ref_id(), + record.seq_id(), + record.sequence(), + record.included(), + record.region(), + has_region, + &mut seq_id_hs, + &mut ref_seq_writer, + // &mut id_to_name_writer, + &mut t2g_writer, + )?; + } + } + } + + index_info["t2g_file"] = json!(&t2g_path); + // index_info["gene_id_to_name"] = json!(&id_to_name_path); - reference_sequence = opts.ref_seq; + min_seq_len = Some(msl); + reference_sequence = Some(ref_seq_path); + t2g = Some(t2g_path); + // _gene_id_to_name = Some(id_to_name_path); } + std::fs::write( + &info_file, + serde_json::to_string_pretty(&index_info).unwrap(), + ) + .with_context(|| format!("could not write {}", info_file.display()))?; + let ref_seq = reference_sequence.with_context(|| "Reference sequence should either be generated from --fasta with reftype spliced+intronic / spliced+unspliced or set with --ref-seq", )?; @@ -163,6 +448,28 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu let input_files = vec![ref_seq.clone()]; prog_utils::check_files_exist(&input_files)?; + let kmer_length: u32; + let minimizer_length: u32; + if let Some(msl) = min_seq_len { + if msl < 10 { + bail!("The reference sequences are too short for indexing. Please provide sequences with a minimum length of at least 10 bases."); + } + // only if the value is not the default value + if (msl / 2) < opts.kmer_length && opts.kmer_length == 31 { + kmer_length = msl / 2; + // https://github.com/COMBINE-lab/protocol-estuary/blob/2ecc65f1891ebfafff2a4a17460550e4dd1f4bb6/utils/simpleaf_workflow_utils.libsonnet#L232 + minimizer_length = (kmer_length as f32 / 1.8).ceil() as u32 + 1; + + info!("Using kmer_length = {} and minimizer_length = {} because the default values are too big for the reference sequences.", kmer_length, minimizer_length); + } else { + kmer_length = opts.kmer_length; + minimizer_length = opts.minimizer_length; + }; + } else { + kmer_length = opts.kmer_length; + minimizer_length = opts.minimizer_length; + } + let output_index_dir = output.join("index"); let index_duration; let index_cmd_string: String; @@ -182,15 +489,15 @@ pub fn build_ref_and_index(af_home_path: &Path, opts: IndexOpts) -> anyhow::Resu let mut piscem_index_cmd = std::process::Command::new(format!("{}", piscem_prog_info.exe_path.display())); - run_fun!(mkdir -p $output_index_dir)?; + create_dir_if_absent(&output_index_dir)?; let output_index_stem = output_index_dir.join("piscem_idx"); piscem_index_cmd .arg("build") .arg("-k") - .arg(opts.kmer_length.to_string()) + .arg(kmer_length.to_string()) .arg("-m") - .arg(opts.minimizer_length.to_string()) + .arg(minimizer_length.to_string()) .arg("-o") .arg(&output_index_stem) .arg("-s") @@ -264,7 +571,7 @@ simpleaf"#, // copy over the t2g file to the index let mut t2g_out_path: Option = None; - if let Some(t2g_file) = splici_t2g { + if let Some(t2g_file) = t2g { let index_t2g_path = output_index_dir.join("t2g_3col.tsv"); t2g_out_path = Some(PathBuf::from("t2g_3col.tsv")); std::fs::copy(t2g_file, index_t2g_path)?; @@ -276,8 +583,8 @@ simpleaf"#, "index_type" : "piscem", "t2g_file" : t2g_out_path, "piscem_index_parameters" : { - "k" : opts.kmer_length, - "m" : opts.minimizer_length, + "k" : kmer_length, + "m" : minimizer_length, "overwrite" : opts.overwrite, "threads" : threads, "ref" : ref_seq @@ -301,7 +608,7 @@ simpleaf"#, salmon_index_cmd .arg("index") .arg("-k") - .arg(opts.kmer_length.to_string()) + .arg(kmer_length.to_string()) .arg("-i") .arg(&output_index_dir) .arg("-t") @@ -356,7 +663,7 @@ simpleaf"#, // copy over the t2g file to the index let mut t2g_out_path: Option = None; - if let Some(t2g_file) = splici_t2g { + if let Some(t2g_file) = t2g { let index_t2g_path = output_index_dir.join("t2g_3col.tsv"); t2g_out_path = Some(PathBuf::from("t2g_3col.tsv")); std::fs::copy(t2g_file, index_t2g_path)?; @@ -364,7 +671,8 @@ simpleaf"#, let index_json_file = output_index_dir.join("simpleaf_index.json"); let index_json = json!({ - "cmd" : index_cmd_string, "index_type" : "salmon", + "cmd" : index_cmd_string, + "index_type" : "salmon", "t2g_file" : t2g_out_path, "salmon_index_parameters" : { "k" : opts.kmer_length, diff --git a/src/simpleaf_commands/quant.rs b/src/simpleaf_commands/quant.rs index c2b1d87..c232c25 100644 --- a/src/simpleaf_commands/quant.rs +++ b/src/simpleaf_commands/quant.rs @@ -176,6 +176,8 @@ impl CBListInfo { writeln!(final_barcodes_bw, "{}", l)?; } + // we remove the intermediate cb_list file we created + std::fs::remove_file(&self.final_file)?; Ok(()) } } diff --git a/src/utils/af_utils.rs b/src/utils/af_utils.rs index a04e628..c973471 100644 --- a/src/utils/af_utils.rs +++ b/src/utils/af_utils.rs @@ -378,7 +378,7 @@ pub fn create_dir_if_absent>(odir: T) -> Result<()> { "The directory {} doesn't yet exist; attempting to create it.", pdir.display() ); - std::fs::create_dir(pdir) + std::fs::create_dir_all(pdir) .with_context(|| format!("Couldn't create the directory at {}", pdir.display()))?; } Ok(()) diff --git a/src/utils/workflow_utils.rs b/src/utils/workflow_utils.rs index e7abc14..cbf4a71 100644 --- a/src/utils/workflow_utils.rs +++ b/src/utils/workflow_utils.rs @@ -5,7 +5,7 @@ use anyhow::{anyhow, bail, Context}; use chrono::{DateTime, Local}; use clap::Parser; -use cmd_lib::run_cmd; +// use cmd_lib::run_cmd; use serde_json::{json, Map, Value}; use std::boxed::Box; use std::collections::HashMap; @@ -20,6 +20,7 @@ use crate::utils::prog_utils; use crate::utils::prog_utils::CommandVerbosityLevel; use crate::{Cli, Commands}; +use super::af_utils::create_dir_if_absent; use super::jrsonnet_main::ParseAction; use super::prog_utils::shell; @@ -1288,9 +1289,7 @@ pub fn get_protocol_estuary>( Ok(protocol_estuary) } else { // make pe - if !pe_dir.exists() { - run_cmd!(mkdir -p $pe_dir)?; - } + create_dir_if_absent(&pe_dir)?; prog_utils::download_to_file(dl_url, &pe_zip_file)?;