COMBINE-lab · rob-p · Jan 6, 2025 · Jan 5, 2025 · Jan 5, 2025 · Jan 5, 2025
diff --git a/docs/source/chemistry-command.rst b/docs/source/chemistry-command.rst
@@ -43,7 +43,7 @@ The ``refresh`` sub-command takes no *required* arguments; it's usage is shown b
   Usage: simpleaf chemistry refresh [OPTIONS]
 
   Options:
-    -f, --force    overwrite an existing matched chemistry even if the version isn't newer
+    -f, --force    overwrite an existing matched chemistry even if the version is not newer
     -d, --dry-run  report what would happen with a refresh without actually performing one on the actual chemistry registry
     -h, --help     Print help
 
@@ -79,9 +79,9 @@ Every chemistry added to the registry has three mandatory properties: ``name``,
 
 
 - ``name``: A unique name (within the existing registry) of the chemistry. It must be a valid UTF-8 identifier. If the name is already registered, the existing definition will be updated if a higher ``--version`` is provided (see below for details). Otherwise, simpleaf will complain and fail.
-- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language <https://hackmd.io/@PI7Og0l1ReeBZu_pjQGUQQ/rJMgmvr13>`_ as used in the `quant command <https://simpleaf.readthedocs.io/en/latest/quant-command.html#a-note-on-the-chemistry-flag>`. 
+- ``geometry``: The geometry specification must be provided as a quoted string, and must follow the `Sequence Fragment Geometry Description Language <https://hackmd.io/@PI7Og0l1ReeBZu_pjQGUQQ/rJMgmvr13>`_ as used in the `quant command <https://simpleaf.readthedocs.io/en/latest/quant-command.html#a-note-on-the-chemistry-flag>`_. 
 - ``expected-ori``: The expected orientation of the chemistry. It must be one of the following: fw (forward), rc (reverse complement), or both (both orientations). It describes the expected orientation relative to the first (most upstream) mappable biological sequence.
-Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from Single Cell Genomics Library Structure <https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium5.html>_.
+Imagine we have reads from 10x Chromium 5' protocols with read1s and read2s both of 150 base pairs. With this specification, a read1, which is in the forward orientation, contains, from 5' to 3', a cell barcode, a UMI, a fixed fragment, and a fragment representing the 5' end of the cDNA. A read2, which is in the reverse complementary orientation, contains the second (downstream) cDNA fragment relative to its read1. You can find a detailed explanation of the 10x Chromium 5' protocol from `Single Cell Genomics Library Structure <https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium5.html>`_.
 If we map the biological sequence in read1s and read2s as paired-end reads (currently only supported when using the default mapper -- piscem), as biological read1s are the first mappable sequences, the expected orientation for this chemistry should be ``fw``, the orientation of read1s. However, if we only map read2s, the expected orientation should be ``rc``, because read2s are the first mappable sequences and are in the reverse complementary orientation.
 
 In addition to the required fields, there are 3 optional fields, as described below. A permit list file must be a TSV file without a header, and the first column must contain the sequence of permitted cell barcodes, i.e., the whitelist of cell barcodes.

diff --git a/docs/source/index-command.rst b/docs/source/index-command.rst
@@ -3,39 +3,78 @@
 
 The ``index`` command has two forms of input; either it will take a reference genome FASTA and GTF as input, from which it can build a spliced+intronic (splici) reference or a spliced+unspliced (spliceu) reference using `roers <https://github.com/COMBINE-lab/roers>`_  (which is used as a library directly from ``simpleaf``, and so need not be installed independently), or it will take a single reference sequence file (i.e. FASTA file) as input (direct-ref mode).  
 
-In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectoy, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference. 
+In expanded reference mode, after the expanded reference is constructed, the resulting reference will be indexed with ``piscem build`` or ``salmon index`` command (depending on the mapper you choose to use), and a copy of the 3-column transcript-to-gene file will be placed in the index directory for subsequent use. The output directory will contain both a ``ref`` and ``index`` subdirectory, with the first containing the splici reference that was extracted from the provided genome and GTF, and the latter containing the index built on this reference. 
 
-In direct-ref mode, the provided fasta file (passed in with ``--refseq``) will be provided to ``piscem build`` or ``salmon index`` directly.  The output diretory will contain an ``index`` subdirectory that contains the index built on this reference.
+In direct-ref mode, if ``--refseq`` is passed, the provided FASTA file will be provided to ``piscem build`` or ``salmon index`` directly.  If ``probe_csv`` or ``feature_csv`` is passed, a FASTA file will be created accordingly and provided to ``piscem build`` or ``salmon index``. The output directory will contain an ``index`` subdirectory that contains the index built on this reference.
+
+- ``probe_csv``: A CSV file containing probe sequences to use for direct reference indexing. The file must follow the format of `10x Probe Set Reference CSV <https://www.10xgenomics.com/support/cytassist-spatial-gene-expression/documentation/steps/probe-sets/visium-ffpe-probe-sets-files#:~:text=probe%20set%20downloads-,Probe%20set%20reference%20CSV%20file,-This%20CSV%20file>`_, containing four mandatory columns: `gene_id`, `probe_seq`, `probe_id`, and `included` (must be ``TRUE`` or ``FALSE``), and an optional column: `region` (must be ``spliced`` or ``unspliced``). When parsing the file, ``simpleaf`` will only use the rows where the `included` column is ``TRUE``. For each row, ``simpleaf`` first builds a FASTA record where the identifier is set as `probe_id`, and the sequence is set as `probe_seq`. Then, it will build a t2g file where the first column is `probe_id` and the second column is `gene_id`. If the `region` column exists, the t2g file will include the region information, so as to trigger the USA mode in ``simpleaf quant`` to generate spliced and unspliced count separately. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set.
+- ``feature_csv``: A CSV file containing feature barcode sequences to use for direct reference indexing. The file must follow the format of `10x Feature Reference CSV <https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/inputs/cr-feature-ref-csv#columns>`_. Currently, only three columns are used: `id`, `name`, and `sequence`. When parsing the file, ``simpleaf`` first builds a FASTA file using the `id` and `sequence` columns. Then, it will build a t2g file where the transcript is set as `id` and the gene is set as `name`. The t2g file will be identified by ``simpleaf quant`` automatically if ``--t2g-map`` is not set.
 
 The relevant options (which you can obtain by running ``simpleaf index -h``) are:
 
 .. code-block:: console
-  
+
     build the (expanded) reference index
-  
-    Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>>
-    
+
+    Usage: simpleaf index [OPTIONS] --output <OUTPUT> <--fasta <FASTA>|--ref-seq <REF_SEQ>|--probe-csv <PROBE_CSV>|--feature-csv <FEATURE_CSV>>
+
     Options:
-      -o, --output <OUTPUT>            path to output directory (will be created if it doesn't exist)
-      -t, --threads <THREADS>          number of threads to use when running [default: 16]
-      -k, --kmer-length <KMER_LENGTH>  the value of k to be used to construct the index [default: 31]
-          --keep-duplicates            keep duplicated identical sequences when constructing the index
-      -p, --sparse                     if this flag is passed, build the sparse rather than dense index for mapping
-      -h, --help                       Print help information
-      -V, --version                    Print version information
-
+    -o, --output <OUTPUT>            Path to output directory (will be created if it doesn't exist)
+    -t, --threads <THREADS>          Number of threads to use when running [default: 16]
+    -k, --kmer-length <KMER_LENGTH>  The value of k to be used to construct the index [default: 31]
+        --gff3-format                Denotes that the input annotation is a GFF3 (instead of GTF)
+                                    file
+        --keep-duplicates            Keep duplicated identical sequences when constructing the index
+        --overwrite                  Overwrite existing files if the output directory is already
+                                    populated
+    -h, --help                       Print help
+    -V, --version                    Print version
+
     Expanded Reference Options:
-          --ref-type <REF_TYPE>    specify whether an expanded reference, spliced+intronic (or splici) or spliced+unspliced (or spliceu), should be built [default: spliced+intronic]
-      -f, --fasta <FASTA>          reference genome to be used for the expanded reference construction
-      -g, --gtf <GTF>              reference GTF file to be used for the expanded reference construction
-      -r, --rlen <RLEN>            the target read length the splici index will be built for
-          --dedup                  deduplicate identical sequences in roers when building an expanded reference  reference
-          --spliced <SPLICED>      path to FASTA file with extra spliced sequence to add to the index
-          --unspliced <UNSPLICED>  path to FASTA file with extra unspliced sequence to add to the index
-
+        --ref-type <REF_TYPE>    Specify whether an expanded reference, spliced+intronic (or splici)
+                                or spliced+unspliced (or spliceu), should be built [default:
+                                spliced+intronic]
+    -f, --fasta <FASTA>          Path to a reference genome to be used for the expanded reference
+                                construction
+    -g, --gtf <GTF>              Path to a reference GTF/GFF3 file to be used for the expanded
+                                reference construction
+    -r, --rlen <RLEN>            The Read length used in roers to add flanking lengths to intronic
+                                sequences
+        --dedup                  Deduplicate identical sequences in roers when building the expanded
+                                reference
+        --spliced <SPLICED>      Path to FASTA file with extra spliced sequence to add to the index
+        --unspliced <UNSPLICED>  Path to a FASTA file with extra unspliced sequence to add to the
+                                index
+
     Direct Reference Options:
-          --ref-seq <REF_SEQ>  target sequences (provide target sequences directly; avoid expanded reference construction)
-
+        --feature-csv <FEATURE_CSV>  A CSV file containing feature barcode sequences to use for
+                                    direct reference indexing. The file must follow the format of
+                                    10x Feature Reference CSV. Currently, only three columns are
+                                    used: id, name, and sequence
+        --probe-csv <PROBE_CSV>      A CSV file containing probe sequences to use for direct
+                                    reference indexing. The file must follow the format of 10x Probe
+                                    Set Reference v2 CSV, containing four mandatory columns:
+                                    gene_id, probe_seq, probe_id, and included (TRUE or FALSE), and
+                                    an optional column: region (spliced or unspliced)
+        --ref-seq <REF_SEQ>          A FASTA file containing reference sequences to directly build
+                                    index on, and avoid expanded reference construction
+
     Piscem Index Options:
-          --use-piscem                           use piscem instead of salmon for indexing and mapping
-      -m, --minimizer-length <MINIMIZER_LENGTH>  the value of m to be used to construct the piscem index (must be < k) [default: 19]
+    -m, --minimizer-length <MINIMIZER_LENGTH>
+            Minimizer length to be used to construct the piscem index (must be < k) [default: 19]
+        --decoy-paths <DECOY_PATHS>
+            Paths to decoy sequence FASTA files used to insert poison k-mer information into the
+            index (only if using piscem >= 0.7)
+        --seed <HASH_SEED>
+            The seed value to use in SSHash index construction (try changing this in the rare event
+            index build fails) [default: 1]
+        --work-dir <WORK_DIR>
+            The working directory where temporary files should be placed [default: ./workdir.noindex]
+        --use-piscem
+            Use piscem instead of salmon for indexing and mapping (default)
+
+    Alternative salmon-alevin Index Options:
+    -p, --sparse     If this flag is passed, build the sparse rather than dense index for mapping
+        --no-piscem  Don't use the default piscem mapper, instead, use salmon-alevin
+
+