From d6b90c1c979296b4c7444787f9da7f1af54232f3 Mon Sep 17 00:00:00 2001 From: jahilton Date: Mon, 7 Oct 2024 08:43:09 -0700 Subject: [PATCH 1/7] initial draft of fragments file --- schema/atac_schema.md | 473 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 473 insertions(+) create mode 100644 schema/atac_schema.md diff --git a/schema/atac_schema.md b/schema/atac_schema.md new file mode 100644 index 000000000..fc930c1b2 --- /dev/null +++ b/schema/atac_schema.md @@ -0,0 +1,473 @@ +## scATAC-seq assay types + +paired assay: any descendant of "EFO:0010891" for scATAC-seq that is also a descendant of "EFO:0008913" for single-cell RNA sequencing + +unpaired assay: "EFO:0010891" for scATAC-seq or its descendants that is not a descendant of "EFO:0008913" for single-cell RNA sequencing + +## Fragment File Dataset Criteria + +A Dataset MUST meet each of the following criteria in order to be eligible for an attached Fragment File: +* the obs['assay_ontology_term_id'] values MUST all be paired assays or MUST all be unpaired assays +* the obs['is_primary_data'] values MUST be all `True` +* the var['feature_reference'] values MUST include one of "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus, but not both. The value that is present will determine the appropriate Chromosome Table for standards. + +If the obs['assay_ontology_term_id'] values are all paired assays then a fragment file MAY be attached to the Dataset. + +If the obs['assay_ontology_term_id'] values are all unpaired assays then a fragment file MUST be attached to the Dataset. + +## Fragment File + +This MUST be a gzipped tab-separated values (TSV) file. + +The curator MUST annotate the following header-less columns. Additional columns and header lines beginning with `#` MUST NOT be included. + +### first column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valuestr. This MUST be the reference genome chromosome the fragment is located on. The value MUST be one of the values from the Chromosome column in the appropriate Chromosome Table. +
+
+ +### second column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based start coordinate of the fragment. +
+
+ +### third column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based end coordinate of the fragment. The end position is exclusive, so represents the position immediately following the fragment interval. The value MUST be greater than the start coordinate specified in the second column and less than or equal to the Length of the Chromosome specified in the first column, as specified in the appropriate Chromosome Table. +
+
+ +### fourth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valuestr. This MUST be the cell identifier. The value MUST be found in the obs index of the associated Dataset. Every obs index value of the associated Dataset MUST appear at least once in this column of the fragment file. +
+
+ +### fifth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the total number of read pairs associated with this fragment. The value MUST be 1 or greater. +
+
+ +## Fragment File index + +CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`. + +## Chromosome Tables + +As determined by the reference assembly used by the gene annotation versions pinned for this version of the schema. Only chromosomes or scaffolds that have at least one gene feature present are included. + +### human (GRCh38.p14) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1248956422
chr2242193529
chr3198295559
chr4190214555
chr5181538259
chr6170805979
chr7159345973
chr8145138636
chr9138394717
chr10133797422
chr11135086622
chr12133275309
chr13114364328
chr14107043718
chr15101991189
chr1690338345
chr1783257441
chr1880373285
chr1958617616
chr2064444167
chr2146709983
chr2250818468
chrX156040895
chrY57227415
chrM16569
GL000009.2201709
GL000194.1191469
GL000195.1182896
GL000205.2185591
GL000213.1164239
GL000216.2176608
GL000218.1161147
GL000219.1179198
GL000220.1161802
GL000225.1211173
KI270442.1392061
KI270711.142210
KI270713.140745
KI270721.1100316
KI270726.143739
KI270727.1448248
KI270728.11872759
KI270731.1150754
KI270733.1179772
KI270734.1165050
KI270744.1168472
KI270750.1148850
+ +### mouse (GRCm39) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1195154279
chr2181755017
chr3159745316
chr4156860686
chr5151758149
chr6149588044
chr7144995196
chr8130127694
chr9124359700
chr10130530862
chr11121973369
chr12120092757
chr13120883175
chr14125139656
chr15104073951
chr1698008968
chr1795294699
chr1890720763
chr1961420004
chrX169476592
chrY91455967
chrM16299
GL456210.1169725
GL456211.1241735
GL456212.1153618
GL456219.1175968
GL456221.1206961
GL456239.140056
GL456354.1195993
GL456372.128664
GL456381.125871
GL456385.135240
JH584295.11976
JH584296.1199368
JH584297.1205776
JH584298.1184189
JH584299.1953012
JH584303.1158099
JH584304.1114452
\ No newline at end of file From fc3a9acca82e90fcd13d515d7303bb8a8bb312e7 Mon Sep 17 00:00:00 2001 From: jahilton Date: Wed, 23 Oct 2024 16:53:41 -0700 Subject: [PATCH 2/7] move atac schema into drafts --- schema/{ => drafts}/atac_schema.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename schema/{ => drafts}/atac_schema.md (100%) diff --git a/schema/atac_schema.md b/schema/drafts/atac_schema.md similarity index 100% rename from schema/atac_schema.md rename to schema/drafts/atac_schema.md From 6cb2849e5b50eb391d55980e6a115af870183b17 Mon Sep 17 00:00:00 2001 From: jahilton Date: Wed, 23 Oct 2024 16:54:31 -0700 Subject: [PATCH 3/7] use asset language --- schema/drafts/atac_schema.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/schema/drafts/atac_schema.md b/schema/drafts/atac_schema.md index fc930c1b2..4aac5f00e 100644 --- a/schema/drafts/atac_schema.md +++ b/schema/drafts/atac_schema.md @@ -6,14 +6,14 @@ ## Fragment File Dataset Criteria -A Dataset MUST meet each of the following criteria in order to be eligible for an attached Fragment File: +A Dataset MUST meet each of the following criteria in order to be eligible for an Fragment File asset: * the obs['assay_ontology_term_id'] values MUST all be paired assays or MUST all be unpaired assays * the obs['is_primary_data'] values MUST be all `True` * the var['feature_reference'] values MUST include one of "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus, but not both. The value that is present will determine the appropriate Chromosome Table for standards. -If the obs['assay_ontology_term_id'] values are all paired assays then a fragment file MAY be attached to the Dataset. +If the obs['assay_ontology_term_id'] values are all paired assays then the Dataset MAY have a fragment file asset. -If the obs['assay_ontology_term_id'] values are all unpaired assays then a fragment file MUST be attached to the Dataset. +If the obs['assay_ontology_term_id'] values are all unpaired assays then the Dataset MUST have a fragment file asset. ## Fragment File From 01bb4c11b16f881fd03f12d6a54df638b0e29df2 Mon Sep 17 00:00:00 2001 From: jahilton Date: Thu, 24 Oct 2024 10:25:56 -0700 Subject: [PATCH 4/7] add Genome Tracks notes, consistent fragmentS file term --- schema/drafts/atac_schema.md | 45 ++++++++++++++++++++++++++++-------- 1 file changed, 36 insertions(+), 9 deletions(-) diff --git a/schema/drafts/atac_schema.md b/schema/drafts/atac_schema.md index 4aac5f00e..40a7a450b 100644 --- a/schema/drafts/atac_schema.md +++ b/schema/drafts/atac_schema.md @@ -4,18 +4,20 @@ unpaired assay: "EFO:0010891" for scATAC-seq or its descendants that is not a descendant of "EFO:0008913" for single-cell RNA sequencing -## Fragment File Dataset Criteria +## scATAC-seq asset Dataset Criteria -A Dataset MUST meet each of the following criteria in order to be eligible for an Fragment File asset: +A Dataset MUST meet each of the following criteria in order to be eligible for scATAC-seq assets: * the obs['assay_ontology_term_id'] values MUST all be paired assays or MUST all be unpaired assays -* the obs['is_primary_data'] values MUST be all `True` +* the obs['is_primary_data'] values MUST be all True * the var['feature_reference'] values MUST include one of "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus, but not both. The value that is present will determine the appropriate Chromosome Table for standards. -If the obs['assay_ontology_term_id'] values are all paired assays then the Dataset MAY have a fragment file asset. +If the obs['assay_ontology_term_id'] values are all paired assays then the Dataset MAY have a fragments file asset. -If the obs['assay_ontology_term_id'] values are all unpaired assays then the Dataset MUST have a fragment file asset. +If the obs['assay_ontology_term_id'] values are all unpaired assays then the Dataset MUST have a fragments file asset. -## Fragment File +If a Dataset has a fragments file asset, it MAY have genome track assets. Otherwise, it MUST NOT have genome track assets. + +## scATAC-seq Asset: Fragments File This MUST be a gzipped tab-separated values (TSV) file. @@ -75,7 +77,7 @@ The curator MUST annotate the following header-less columns. Additional columns Value - str. This MUST be the cell identifier. The value MUST be found in the obs index of the associated Dataset. Every obs index value of the associated Dataset MUST appear at least once in this column of the fragment file. + str. This MUST be the cell identifier. The value MUST be found in the obs index of the associated Dataset. Every obs index value of the associated Dataset MUST appear at least once in this column of the fragments file. @@ -96,9 +98,34 @@ The curator MUST annotate the following header-less columns. Additional columns
-## Fragment File index +## scATAC-seq Asset: Fragments File index + +For every fragments file asset, CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragments file. The file name MUST be the name of the corresponding fragments file appended with `.tbi`. + +## `uns` (Dataset Metadata) + + + + + + + + + + + + + + +
Keypeak_grouping
AnnotationCurator MAY annotate if the Dataset has a fragments file asset; otherwise, this key MUST NOT be present.
Value + str. The value MUST match a key in obs. If annotated, genome track assets MUST be submitted. +
+ +## scATAC-seq Asset: Genome Track + +If uns['peak_grouping'] is annotated, there MUST be exactly one genome track asset submitted for each unique value in the obs column specified as determined by anndata.obs.{peak_grouping_column}.unique(). Otherwise, this MUST NOT be submitted. -CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragment file. The file name MUST be the name of the corresponding fragment file appended with `.tbi`. +Asset file specifications TBD based on the visualization solution. Accepting .bigWig format is a requirement. ## Chromosome Tables From 2458bad3f2b0af789320e15399834575079c85cb Mon Sep 17 00:00:00 2001 From: jahilton Date: Thu, 24 Oct 2024 12:16:21 -0700 Subject: [PATCH 5/7] block ontology id fields from peak_grouping --- schema/drafts/atac_schema.md | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/schema/drafts/atac_schema.md b/schema/drafts/atac_schema.md index 40a7a450b..d6adc5e1d 100644 --- a/schema/drafts/atac_schema.md +++ b/schema/drafts/atac_schema.md @@ -116,7 +116,18 @@ For every fragments file asset, CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragments file. The file name MUST be the name of the corresponding fragments file appended with `.tbi`. +## scATAC-seq Asset: Fragments File (processed) -## `uns` (Dataset Metadata) +From every fragments file asset, CELLxGENE Discover MUST generate a tab-separated values (TSV) file position-sorted and compressed by bgzip. - - - - - - - - - - - - - -
Keypeak_grouping
AnnotationCurator MAY annotate if the Dataset has a fragments file asset; otherwise, this key MUST NOT be present.
Value - str. The value MUST match a key in obs. If annotated, genome track assets MUST be submitted. The following columns MUST NOT be specified: -
    -
  • assay_ontology_term_id
  • -
  • cell_type_ontology_term_id
  • -
  • development_stage_ontology_term_id
  • -
  • disease_ontology_term_id
  • -
  • organism_ontology_term_id
  • -
  • self_reported_ethnicity_ontology_term_id
  • -
  • sex_ontology_term_id
  • -
  • tissue_ontology_term_id
  • -
- Instead specify the corresponding Discover column such as cell_type.

-
- -## scATAC-seq Asset: Genome Track - -If uns['peak_grouping'] is annotated, there MUST be exactly one genome track asset submitted for each unique value in the obs column specified as determined by anndata.obs.{peak_grouping_column}.unique(). Otherwise, this MUST NOT be submitted. +## scATAC-seq Asset: Fragments File index -Asset file specifications TBD based on the visualization solution. Accepting .bigWig format is a requirement. +From every fragments file (processed) asset, CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragments file. The file name MUST be the name of the corresponding fragments file appended with `.tbi`. ## Chromosome Tables diff --git a/schema/drafts/genome_track.md b/schema/drafts/genome_track.md new file mode 100644 index 000000000..dc18a6e6d --- /dev/null +++ b/schema/drafts/genome_track.md @@ -0,0 +1,41 @@ +## scATAC-seq asset Dataset Criteria + +See [fragments file schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/drafts/fragments_file.md) for criteria a Dataset MUST meet in order to be eligible for scATAC-seq assets. + +If a Dataset has a fragments file asset, it MAY have genome track assets. Otherwise, it MUST NOT have genome track assets. + +## `uns` (Dataset Metadata) + + + + + + + + + + + + + + +
Keypeak_grouping
AnnotationCurator MAY annotate if the Dataset has a fragments file asset; otherwise, this key MUST NOT be present.
Value + str. The value MUST match a key in obs. If annotated, genome track assets MUST be submitted. The following columns MUST NOT be specified: +
    +
  • assay_ontology_term_id
  • +
  • cell_type_ontology_term_id
  • +
  • development_stage_ontology_term_id
  • +
  • disease_ontology_term_id
  • +
  • organism_ontology_term_id
  • +
  • self_reported_ethnicity_ontology_term_id
  • +
  • sex_ontology_term_id
  • +
  • tissue_ontology_term_id
  • +
+ Instead specify the corresponding Discover column such as cell_type.

+
+ +## scATAC-seq Asset: Genome Track + +If uns['peak_grouping'] is annotated, there MUST be exactly one genome track asset submitted for each unique value in the obs column specified as determined by anndata.obs.{peak_grouping_column}.unique(). Otherwise, this MUST NOT be submitted. + +Asset file specifications TBD based on the visualization solution. Accepting .bigWig format is a requirement. From 2dfd9061c414395d3d486166f978a09644328388 Mon Sep 17 00:00:00 2001 From: jahilton Date: Thu, 31 Oct 2024 16:38:12 -0700 Subject: [PATCH 7/7] specify naming convention for fragments file and index --- schema/drafts/fragments_file.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/schema/drafts/fragments_file.md b/schema/drafts/fragments_file.md index 813900432..db4cc8cc7 100644 --- a/schema/drafts/fragments_file.md +++ b/schema/drafts/fragments_file.md @@ -99,11 +99,11 @@ The curator MUST annotate the following header-less columns. Additional columns ## scATAC-seq Asset: Fragments File (processed) -From every fragments file asset, CELLxGENE Discover MUST generate a tab-separated values (TSV) file position-sorted and compressed by bgzip. +From every fragments file asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz, a tab-separated values (TSV) file position-sorted and compressed by bgzip. ## scATAC-seq Asset: Fragments File index -From every fragments file (processed) asset, CELLxGENE Discover MUST generate a tabix index of the fragment intervals from the fragments file. The file name MUST be the name of the corresponding fragments file appended with `.tbi`. +From every fragments file (processed) asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz.tbi, a tabix index of the fragment intervals from the fragments file. ## Chromosome Tables