atac-seq schema #1032

jahilton · 2024-10-07T15:45:03Z

Not ready to merge. PR for review purposes only.
As the standards & how those are written have solidified, we can update the PR to merge them into a draft of the full schema. It only includes the fragment file standards, but I will add standards for a genome track data product.

brianraymor · 2024-10-08T16:32:38Z

schema/atac_schema.md

+## scATAC-seq assay types
+
+<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>
+


paired is effectively only "10x multiome" ? I'm trying to understand the rationale for writing the general definition. For example, are there pending NTR(s) for other assays that would meet the requirement?

No specific NTR issues open, but this is future-proofing. And we would quickly go to EFO with suggestions to group more under "single-cell RNA sequencing" to ensure future NTRs get links appropriately.

Is there the possibility that an unexpected term might be added causing validation issues - a'la One of These Things (Is Not Like the Others). ? I guess we could simply include it in the review of EFO updates. Just thinking out loud.

We wouldn't accept a new assay without reviewing it in EFO or adding it there ourselves. So the edge case would be a term we have reviewed and accepted later gets the ontology links updated to where it suddenly fits the criteria. So possible? yes. Though seems unlikely.

brianraymor · 2024-10-08T16:36:39Z

schema/atac_schema.md

+<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>
+
+<i>unpaired assay</i>: <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> or its descendants that is not a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>
+


Is there an expectation that there would be support for scATAC-seq (cell index)and scATAC-seq (Microfluidics)

schema/atac_schema.md

brianraymor · 2024-10-08T23:43:09Z

schema/atac_schema.md

+
+CELLxGENE Discover MUST generate a <a href="https://www.htslib.org/doc/tabix.html">tabix</a> index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`.
+
+## Chromosome Tables


Is there code that we can check into single-cell-curation for re-creating the tables in the future?

Yes. I will share that.

brianraymor · 2024-10-08T23:50:47Z

schema/atac_schema.md

+If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset.
+
+## Fragment File
+


The tabix documentation states:

The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.

The 10X documentation only mentions:

The data is block-gzipped to allow indexing and to save disk space.

So This MUST be a gzipped tab-separated values (TSV) file. is not strict enough?
This MUST be a tab-separated values (TSV) file position-sorted and compressed by bgzip.

I'd recommend that we test to confirm the requirement.

Have we looked into rolling our own vs tabix?

I don't know what "rolling our own" looks like, so no haven't looked into it.
The common analysis software consumer both the tsv & the index together so it must be interoperable with those.

Bento007 · 2024-10-21T21:32:23Z

schema/atac_schema.md

+
+If the <code>obs['assay_ontology_term_id']</code> values are all <i>paired assays</i> then a fragment file MAY be attached to the Dataset.
+
+If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset.


This requirement will complicate validation in the ingest pipeline. We will have to wait for the fragment file to be present before knowing if the anndata is valid. If more requirements like this are expected in the future then we can pay the engineer cost to simplify change like this for the future.

It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset? It's an invalid collection.

It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset?

💯 we need to block in this case. I'm thinking about how the backend will process this when multiple files are required before being a valid dataset. There will additional complexity added to support this and potential future cases.

when multiple files are required before being a valid dataset.

I'm suggesting that the dataset is valid, but the collection is invalid when there's a missing/required fragment file.

Upload and validate dataset - set collection as invalid due to missing requirement fragment file.

Upload and validate fragment file - set collection as valid if fragment validation passes.

Another approach is to always upload+validate the dataset and fragment file together. The two files are tarred or gzipped together.

Bento007 · 2024-10-22T23:21:34Z

schema/atac_schema.md

+
+## Fragment File
+
+This MUST be a gzipped tab-separated values (TSV) file.


So the ingest format is going to be this tsv.gz, and we also want to make this same format available to the users to download?

sorry for being redundant. linking my similar comment from a different issue for posterity. #1013 (comment)

Bento007 · 2024-10-30T23:21:28Z

schema/drafts/atac_schema.md

+
+### <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz">human (GRCh38.p14)</a>
+
+<table>


Is this the expected order that chromosome should be sorted by? When doing a alphabetical sort I get something very different.

['chrY', 'chrX', 'chrM', 'chr9', 'chr8', 'chr7', 'chr6', 'chr5', 'chr4', 'chr3', 'chr22', 'chr21', 'chr20', 'chr2', 'chr19', 'chr18', 'chr17', 'chr16', 'chr15', 'chr14', 'chr13', 'chr12', 'chr11', 'chr10', 'chr1', 'KI270750.1', 'KI270744.1', 'KI270734.1', 'KI270733.1', 'KI270731.1', 'KI270728.1', 'KI270727.1', 'KI270726.1', 'KI270721.1', 'KI270713.1', 'KI270711.1', 'KI270442.1', 'GL000225.1', 'GL000220.1', 'GL000219.1', 'GL000218.1', 'GL000216.2', 'GL000213.1', 'GL000205.2', 'GL000195.1', 'GL000194.1', 'GL000009.2']

NVM, this depend on the ordering that tabix expects.

…/single-cell-curation into jason/atac-schema

jahilton mentioned this pull request Oct 7, 2024

Add requirements for 10X multiome #1013

Open

jahilton requested review from brianraymor and BAevermann October 7, 2024 16:35

brianraymor reviewed Oct 8, 2024

View reviewed changes

schema/atac_schema.md Outdated Show resolved Hide resolved

brianraymor reviewed Oct 8, 2024

View reviewed changes

Bento007 reviewed Oct 21, 2024

View reviewed changes

Bento007 mentioned this pull request Oct 22, 2024

Define the cellxgene schema for 10x multiome chanzuckerberg/single-cell#714

Closed

Bento007 reviewed Oct 22, 2024

View reviewed changes

jahilton added 4 commits October 23, 2024 16:52

initial draft of fragments file

d6b90c1

move atac schema into drafts

fc3a9ac

use asset language

6cb2849

add Genome Tracks notes, consistent fragmentS file term

01bb4c1

jahilton force-pushed the jason/atac-schema branch from 18d177f to 01bb4c1 Compare October 24, 2024 17:30

block ontology id fields from peak_grouping

2458bad

Bento007 reviewed Oct 30, 2024

View reviewed changes

jahilton added 4 commits October 31, 2024 16:29

split fragments from tracks

427951a

Merge branch 'main' into jason/atac-schema

8ea9cac

specify naming convention for fragments file and index

2dfd906

Merge branch 'jason/atac-schema' of https://github.com/chanzuckerberg…

4077578

…/single-cell-curation into jason/atac-schema

jahilton enabled auto-merge (squash) November 4, 2024 23:32

Merge branch 'main' into jason/atac-schema

23b6b4b

jahilton merged commit daa3008 into main Nov 4, 2024
7 of 8 checks passed

jahilton deleted the jason/atac-schema branch November 4, 2024 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atac-seq schema #1032

atac-seq schema #1032

jahilton commented Oct 7, 2024

brianraymor Oct 8, 2024

jahilton Oct 9, 2024

brianraymor Oct 9, 2024

jahilton Oct 10, 2024

brianraymor Oct 8, 2024

jahilton Oct 9, 2024

brianraymor Oct 8, 2024

jahilton Oct 9, 2024

brianraymor Oct 8, 2024

jahilton Oct 9, 2024

brianraymor Oct 9, 2024

Bento007 Oct 21, 2024

jahilton Oct 21, 2024

Bento007 Oct 21, 2024

brianraymor Oct 22, 2024 •

edited

Loading

Bento007 Oct 22, 2024 •

edited

Loading

brianraymor Oct 22, 2024

Bento007 Oct 22, 2024 •

edited

Loading

Bento007 Oct 30, 2024

Bento007 Oct 30, 2024

		## scATAC-seq assay types

		<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>

		<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>

		<i>unpaired assay</i>: <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> or its descendants that is not a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>


		CELLxGENE Discover MUST generate a <a href="https://www.htslib.org/doc/tabix.html">tabix</a> index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`.

		## Chromosome Tables

		If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset.

		## Fragment File


		## Fragment File

		This MUST be a gzipped tab-separated values (TSV) file.


		### <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz">human (GRCh38.p14)</a>

		<table>

atac-seq schema #1032

atac-seq schema #1032

Conversation

jahilton commented Oct 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianraymor Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Bento007 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bento007 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brianraymor Oct 22, 2024 •

edited

Loading

Bento007 Oct 22, 2024 •

edited

Loading

Bento007 Oct 22, 2024 •

edited

Loading