Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atac-seq schema #1032

Merged
merged 10 commits into from
Nov 4, 2024
Merged

atac-seq schema #1032

merged 10 commits into from
Nov 4, 2024

Conversation

jahilton
Copy link
Collaborator

@jahilton jahilton commented Oct 7, 2024

Not ready to merge. PR for review purposes only.
As the standards & how those are written have solidified, we can update the PR to merge them into a draft of the full schema. It only includes the fragment file standards, but I will add standards for a genome track data product.

## scATAC-seq assay types

<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paired is effectively only "10x multiome" ? I'm trying to understand the rationale for writing the general definition. For example, are there pending NTR(s) for other assays that would meet the requirement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific NTR issues open, but this is future-proofing. And we would quickly go to EFO with suggestions to group more under "single-cell RNA sequencing" to ensure future NTRs get links appropriately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there the possibility that an unexpected term might be added causing validation issues - a'la One of These Things (Is Not Like the Others). ? I guess we could simply include it in the review of EFO updates. Just thinking out loud.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't accept a new assay without reviewing it in EFO or adding it there ourselves. So the edge case would be a term we have reviewed and accepted later gets the ontology links updated to where it suddenly fits the criteria. So possible? yes. Though seems unlikely.

<i>paired assay</i>: any descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> that is also a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>

<i>unpaired assay</i>: <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010891"><code>"EFO:0010891"</code></a> for <i>scATAC-seq</i> or its descendants that is not a descendant of <a href="https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0008913"><code>"EFO:0008913"</code></a> for <i>single-cell RNA sequencing</i>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an expectation that there would be support for scATAC-seq (cell index)and scATAC-seq (Microfluidics)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

schema/atac_schema.md Outdated Show resolved Hide resolved

CELLxGENE Discover MUST generate a <a href="https://www.htslib.org/doc/tabix.html">tabix</a> index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`.

## Chromosome Tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there code that we can check into single-cell-curation for re-creating the tables in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will share that.

If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset.

## Fragment File

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tabix documentation states:

The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.

The 10X documentation only mentions:

The data is block-gzipped to allow indexing and to save disk space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So This MUST be a gzipped tab-separated values (TSV) file. is not strict enough?
This MUST be a tab-separated values (TSV) file position-sorted and compressed by bgzip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we test to confirm the requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we looked into rolling our own vs tabix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what "rolling our own" looks like, so no haven't looked into it.
The common analysis software consumer both the tsv & the index together so it must be interoperable with those.


If the <code>obs['assay_ontology_term_id']</code> values are all <i>paired assays</i> then a fragment file MAY be attached to the Dataset.

If the <code>obs['assay_ontology_term_id']</code> values are all <i>unpaired assays</i> then a fragment file MUST be attached to the Dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement will complicate validation in the ingest pipeline. We will have to wait for the fragment file to be present before knowing if the anndata is valid. If more requirements like this are expected in the future then we can pay the engineer cost to simplify change like this for the future.

Copy link
Contributor

@brianraymor brianraymor Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset? It's an invalid collection.

Copy link
Contributor

@Bento007 Bento007 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would seem easier to simply block publication of the collection if a fragment file is not attached to a dataset?

💯 we need to block in this case. I'm thinking about how the backend will process this when multiple files are required before being a valid dataset. There will additional complexity added to support this and potential future cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when multiple files are required before being a valid dataset.

I'm suggesting that the dataset is valid, but the collection is invalid when there's a missing/required fragment file.

  • Upload and validate dataset - set collection as invalid due to missing requirement fragment file.
  • Upload and validate fragment file - set collection as valid if fragment validation passes.

Another approach is to always upload+validate the dataset and fragment file together. The two files are tarred or gzipped together.


## Fragment File

This MUST be a gzipped tab-separated values (TSV) file.
Copy link
Contributor

@Bento007 Bento007 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the ingest format is going to be this tsv.gz, and we also want to make this same format available to the users to download?

sorry for being redundant. linking my similar comment from a different issue for posterity. #1013 (comment)


### <a href="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz">human (GRCh38.p14)</a>

<table>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the expected order that chromosome should be sorted by? When doing a alphabetical sort I get something very different.

['chrY',
 'chrX',
 'chrM',
 'chr9',
 'chr8',
 'chr7',
 'chr6',
 'chr5',
 'chr4',
 'chr3',
 'chr22',
 'chr21',
 'chr20',
 'chr2',
 'chr19',
 'chr18',
 'chr17',
 'chr16',
 'chr15',
 'chr14',
 'chr13',
 'chr12',
 'chr11',
 'chr10',
 'chr1',
 'KI270750.1',
 'KI270744.1',
 'KI270734.1',
 'KI270733.1',
 'KI270731.1',
 'KI270728.1',
 'KI270727.1',
 'KI270726.1',
 'KI270721.1',
 'KI270713.1',
 'KI270711.1',
 'KI270442.1',
 'GL000225.1',
 'GL000220.1',
 'GL000219.1',
 'GL000218.1',
 'GL000216.2',
 'GL000213.1',
 'GL000205.2',
 'GL000195.1',
 'GL000194.1',
 'GL000009.2']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM, this depend on the ordering that tabix expects.

@jahilton jahilton enabled auto-merge (squash) November 4, 2024 23:32
@jahilton jahilton merged commit daa3008 into main Nov 4, 2024
7 of 8 checks passed
@jahilton jahilton deleted the jason/atac-schema branch November 4, 2024 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants