-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
atac-seq schema #1032
atac-seq schema #1032
Conversation
schema/atac_schema.md
Outdated
## scATAC-seq assay types | ||
|
||
<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paired is effectively only "10x multiome" ? I'm trying to understand the rationale for writing the general definition. For example, are there pending NTR(s) for other assays that would meet the requirement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific NTR issues open, but this is future-proofing. And we would quickly go to EFO with suggestions to group more under "single-cell RNA sequencing" to ensure future NTRs get links appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there the possibility that an unexpected term might be added causing validation issues - a'la One of These Things (Is Not Like the Others). ? I guess we could simply include it in the review of EFO updates. Just thinking out loud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wouldn't accept a new assay without reviewing it in EFO or adding it there ourselves. So the edge case would be a term we have reviewed and accepted later gets the ontology links updated to where it suddenly fits the criteria. So possible? yes. Though seems unlikely.
schema/atac_schema.md
Outdated
<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i> | ||
|
||
<i>unpaired assay</i>: <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> or its descendants that is not a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an expectation that there would be support for scATAC-seq (cell index)and scATAC-seq (Microfluidics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
schema/atac_schema.md
Outdated
|
||
CELLxGENE Discover MUST generate a <a href="https://www.htslib.org/doc/tabix.html">tabix</a> index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`. | ||
|
||
## Chromosome Tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there code that we can check into single-cell-curation for re-creating the tables in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will share that.
schema/atac_schema.md
Outdated
If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset. | ||
|
||
## Fragment File | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tabix documentation states:
The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.
The 10X documentation only mentions:
The data is block-gzipped to allow indexing and to save disk space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So This MUST be a gzipped tab-separated values (TSV) file.
is not strict enough?
This MUST be a tab-separated values (TSV) file position-sorted and compressed by bgzip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend that we test to confirm the requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we looked into rolling our own vs tabix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what "rolling our own" looks like, so no haven't looked into it.
The common analysis software consumer both the tsv & the index together so it must be interoperable with those.
schema/atac_schema.md
Outdated
|
||
If the <code>obs['assay_ontology_term_id']</code> values are all <i>paired assays</i> then a fragment file MAY be attached to the Dataset. | ||
|
||
If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requirement will complicate validation in the ingest pipeline. We will have to wait for the fragment file to be present before knowing if the anndata is valid. If more requirements like this are expected in the future then we can pay the engineer cost to simplify change like this for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset? It's an invalid collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset?
💯 we need to block in this case. I'm thinking about how the backend will process this when multiple files are required before being a valid dataset. There will additional complexity added to support this and potential future cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when multiple files are required before being a valid dataset.
I'm suggesting that the dataset is valid, but the collection is invalid when there's a missing/required fragment file.
- Upload and validate dataset - set collection as invalid due to missing requirement fragment file.
- Upload and validate fragment file - set collection as valid if fragment validation passes.
Another approach is to always upload+validate the dataset and fragment file together. The two files are tarred or gzipped together.
schema/atac_schema.md
Outdated
|
||
## Fragment File | ||
|
||
This MUST be a gzipped tab-separated values (TSV) file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the ingest format is going to be this tsv.gz, and we also want to make this same format available to the users to download?
sorry for being redundant. linking my similar comment from a different issue for posterity. #1013 (comment)
18d177f
to
01bb4c1
Compare
schema/drafts/atac_schema.md
Outdated
|
||
### <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz">human (GRCh38.p14)</a> | ||
|
||
<table> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the expected order that chromosome should be sorted by? When doing a alphabetical sort I get something very different.
['chrY',
'chrX',
'chrM',
'chr9',
'chr8',
'chr7',
'chr6',
'chr5',
'chr4',
'chr3',
'chr22',
'chr21',
'chr20',
'chr2',
'chr19',
'chr18',
'chr17',
'chr16',
'chr15',
'chr14',
'chr13',
'chr12',
'chr11',
'chr10',
'chr1',
'KI270750.1',
'KI270744.1',
'KI270734.1',
'KI270733.1',
'KI270731.1',
'KI270728.1',
'KI270727.1',
'KI270726.1',
'KI270721.1',
'KI270713.1',
'KI270711.1',
'KI270442.1',
'GL000225.1',
'GL000220.1',
'GL000219.1',
'GL000218.1',
'GL000216.2',
'GL000213.1',
'GL000205.2',
'GL000195.1',
'GL000194.1',
'GL000009.2']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM, this depend on the ordering that tabix expects.
Not ready to merge. PR for review purposes only.
As the standards & how those are written have solidified, we can update the PR to merge them into a draft of the full schema. It only includes the fragment file standards, but I will add standards for a genome track data product.