Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make simpleaf index take probe and feature CSV files #175

Merged
merged 7 commits into from
Jan 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/chemistry-command.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ The ``refresh`` sub-command takes no *required* arguments; it's usage is shown b
Usage: simpleaf chemistry refresh [OPTIONS]

Options:
-f, --force overwrite an existing matched chemistry even if the version isn't newer
-f, --force overwrite an existing matched chemistry even if the version is not newer
-d, --dry-run report what would happen with a refresh without actually performing one on the actual chemistry registry
-h, --help Print help

Expand Down Expand Up @@ -79,9 +79,9 @@ Every chemistry added to the registry has three mandatory properties: ``name``,


- ``name``: A unique name (within the existing registry) of the chemistry. It must be a valid UTF-8 identifier. If the name is already registered, the existing definition will be updated if a higher ``--version`` is provided (see below for details). Otherwise, simpleaf will complain and fail.
- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language <https://hackmd.io/@PI7Og0l1ReeBZu_pjQGUQQ/rJMgmvr13>`_ as used in the `quant command <https://simpleaf.readthedocs.io/en/latest/quant-command.html#a-note-on-the-chemistry-flag>`.
- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language <https://hackmd.io/@PI7Og0l1ReeBZu_pjQGUQQ/rJMgmvr13>`_ as used in the `quant command <https://simpleaf.readthedocs.io/en/latest/quant-command.html#a-note-on-the-chemistry-flag>`_.
- ``expected-ori``: The expected orientation of the chemistry. It must be one of the following: fw (forward), rc (reverse complement), or both (both orientations). It describes the expected orientation relative to the first (most upstream) mappable biological sequence.
Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from Single Cell Genomics Library Structure <https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium5.html>_.
Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from `Single Cell Genomics Library Structure <https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium5.html>`_.
If we map the biological sequence in read1s and read2s as paired-end reads (currently only supported when using the default mapper -- piscem), as biological read1s are the first mappable sequences, the expected orientation for this chemistry should be ``fw``, the orientation of read1s. However, if we only map read2s, the expected orientation should be ``rc``, because read2s are the first mappable sequences and are in the reverse complementary orientation.

In addition to the required fields, there are 3 optional fields, as described below. A permit list file must be a TSV file without a header, and the first column must contain the sequence of permitted cell barcodes, i.e., the whitelist of cell barcodes.
Expand Down
91 changes: 65 additions & 26 deletions docs/source/index-command.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,39 +3,78 @@

The ``index`` command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using `roers <https://github.com/COMBINE-lab/roers>`_ (which is used as a library directly from ``simpleaf``, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode).

In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectoy, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference.
In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference.

In direct-ref mode, the provided fasta file (passed in with ``--refseq``) will be provided to ``piscem build`` or ``salmon index`` directly. The output diretory will contain an ``index`` subdirectory that contains the index built on this reference.
In direct-ref mode, if ``--refseq`` is passed, the provided FASTA file will be provided to ``piscem build`` or ``salmon index`` directly. If ``probe_csv`` or ``feature_csv`` is passed, a FASTA file will be created accordingly and provided to ``piscem build`` or ``salmon index``. The output directory will contain an ``index`` subdirectory that contains the index built on this reference.

- ``probe_csv``: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of `10x Probe Set Reference CSV <https://www.10xgenomics.com/support/cytassist-spatial-gene-expression/documentation/steps/probe-sets/visium-ffpe-probe-sets-files#:~:text=probe%20set%20downloads-,Probe%20set%20reference%20CSV%20file,-This%20CSV%20file>`_, containing four mandatory columns: `gene_id`, `probe_seq`, `probe_id`, and `included` (must be ``TRUE`` or ``FALSE``), and an optional column: `region` (must be ``spliced`` or ``unspliced``). When parsing the file, ``simpleaf`` will only use the rows where the `included` column is ``TRUE``. For each row, ``simpleaf`` first builds a FASTA record where the identifier is set as `probe_id`, and the sequence is set as `probe_seq`. Then, it will build a t2g file where the first column is `probe_id` and the second column is `gene_id`. If the `region` column exists, the t2g file will include the region information, so as to trigger the USA mode in ``simpleaf quant`` to generate spliced and unspliced count separately. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set.
- ``feature_csv``: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of `10x Feature Reference CSV <https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-feature-ref-csv#columns>`_. Currently, only three columns are used: `id`, `name`, and `sequence`. When parsing the file, ``simpleaf`` first builds a FASTA file using the `id` and `sequence` columns. Then, it will build a t2g file where the transcript is set as `id` and the gene is set as `name`. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set.

The relevant options (which you can obtain by running ``simpleaf index -h``) are:

.. code-block:: console

build the (expanded) reference index
Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>>

Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>|--probe-csv <PROBE_CSV>|--feature-csv <FEATURE_CSV>>

Options:
-o, --output <OUTPUT> path to output directory (will be created if it doesn't exist)
-t, --threads <THREADS> number of threads to use when running [default: 16]
-k, --kmer-length <KMER_LENGTH> the value of k to be used to construct the index [default: 31]
--keep-duplicates keep duplicated identical sequences when constructing the index
-p, --sparse if this flag is passed, build the sparse rather than dense index for mapping
-h, --help Print help information
-V, --version Print version information

-o, --output <OUTPUT> Path to output directory (will be created if it doesn't exist)
-t, --threads <THREADS> Number of threads to use when running [default: 16]
-k, --kmer-length <KMER_LENGTH> The value of k to be used to construct the index [default: 31]
--gff3-format Denotes that the input annotation is a GFF3 (instead of GTF)
file
--keep-duplicates Keep duplicated identical sequences when constructing the index
--overwrite Overwrite existing files if the output directory is already
populated
-h, --help Print help
-V, --version Print version

Expanded Reference Options:
--ref-type <REF_TYPE> specify whether an expanded reference, spliced+intronic (or splici) or spliced+unspliced (or spliceu), should be built [default: spliced+intronic]
-f, --fasta <FASTA> reference genome to be used for the expanded reference construction
-g, --gtf <GTF> reference GTF file to be used for the expanded reference construction
-r, --rlen <RLEN> the target read length the splici index will be built for
--dedup deduplicate identical sequences in roers when building an expanded reference reference
--spliced <SPLICED> path to FASTA file with extra spliced sequence to add to the index
--unspliced <UNSPLICED> path to FASTA file with extra unspliced sequence to add to the index

--ref-type <REF_TYPE> Specify whether an expanded reference, spliced+intronic (or splici)
or spliced+unspliced (or spliceu), should be built [default:
spliced+intronic]
-f, --fasta <FASTA> Path to a reference genome to be used for the expanded reference
construction
-g, --gtf <GTF> Path to a reference GTF/GFF3 file to be used for the expanded
reference construction
-r, --rlen <RLEN> The Read length used in roers to add flanking lengths to intronic
sequences
--dedup Deduplicate identical sequences in roers when building the expanded
reference
--spliced <SPLICED> Path to FASTA file with extra spliced sequence to add to the index
--unspliced <UNSPLICED> Path to a FASTA file with extra unspliced sequence to add to the
index

Direct Reference Options:
--ref-seq <REF_SEQ> target sequences (provide target sequences directly; avoid expanded reference construction)

--feature-csv <FEATURE_CSV> A CSV file containing feature barcode sequences to use for
direct reference indexing. The file must follow the format of
10x Feature Reference CSV. Currently, only three columns are
used: id, name, and sequence
--probe-csv <PROBE_CSV> A CSV file containing probe sequences to use for direct
reference indexing. The file must follow the format of 10x Probe
Set Reference v2 CSV, containing four mandatory columns:
gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and
an optional column: region (spliced or unspliced)
--ref-seq <REF_SEQ> A FASTA file containing reference sequences to directly build
index on, and avoid expanded reference construction

Piscem Index Options:
--use-piscem use piscem instead of salmon for indexing and mapping
-m, --minimizer-length <MINIMIZER_LENGTH> the value of m to be used to construct the piscem index (must be < k) [default: 19]
-m, --minimizer-length <MINIMIZER_LENGTH>
Minimizer length to be used to construct the piscem index (must be < k) [default: 19]
--decoy-paths <DECOY_PATHS>
Paths to decoy sequence FASTA files used to insert poison k-mer information into the
index (only if using piscem >= 0.7)
--seed <HASH_SEED>
The seed value to use in SSHash index construction (try changing this in the rare event
index build fails) [default: 1]
--work-dir <WORK_DIR>
The working directory where temporary files should be placed [default: ./workdir.noindex]
--use-piscem
Use piscem instead of salmon for indexing and mapping (default)

Alternative salmon-alevin Index Options:
-p, --sparse If this flag is passed, build the sparse rather than dense index for mapping
--no-piscem Don't use the default piscem mapper, instead, use salmon-alevin


Loading