diff --git a/schema/drafts/fragments_file.md b/schema/drafts/fragments_file.md new file mode 100644 index 000000000..db4cc8cc7 --- /dev/null +++ b/schema/drafts/fragments_file.md @@ -0,0 +1,478 @@ +## scATAC-seq assay types + +paired assay: any descendant of "EFO:0010891" for scATAC-seq that is also a descendant of "EFO:0008913" for single-cell RNA sequencing + +unpaired assay: "EFO:0010891" for scATAC-seq or its descendants that is not a descendant of "EFO:0008913" for single-cell RNA sequencing + +## scATAC-seq asset Dataset Criteria + +A Dataset MUST meet each of the following criteria in order to be eligible for scATAC-seq assets: +* the obs['assay_ontology_term_id'] values MUST all be paired assays or MUST all be unpaired assays +* the obs['is_primary_data'] values MUST be all True +* the var['feature_reference'] values MUST include one of "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus, but not both. The value that is present will determine the appropriate Chromosome Table for standards. + +If the obs['assay_ontology_term_id'] values are all paired assays then the Dataset MAY have a fragments file asset. + +If the obs['assay_ontology_term_id'] values are all unpaired assays then the Dataset MUST have a fragments file asset. + + +## scATAC-seq Asset: Fragments File (submitted) + +This MUST be a gzipped tab-separated values (TSV) file. + +The curator MUST annotate the following header-less columns. Additional columns and header lines beginning with `#` MUST NOT be included. + +### first column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valuestr. This MUST be the reference genome chromosome the fragment is located on. The value MUST be one of the values from the Chromosome column in the appropriate Chromosome Table. +
+
+ +### second column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based start coordinate of the fragment. +
+
+ +### third column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the 0-based end coordinate of the fragment. The end position is exclusive, so represents the position immediately following the fragment interval. The value MUST be greater than the start coordinate specified in the second column and less than or equal to the Length of the Chromosome specified in the first column, as specified in the appropriate Chromosome Table. +
+
+ +### fourth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valuestr. This MUST be the cell identifier. The value MUST be found in the obs index of the associated Dataset. Every obs index value of the associated Dataset MUST appear at least once in this column of the fragments file. +
+
+ +### fifth column + + + + + + + + + + +
AnnotatorCurator MUST annotate.
Valueint. This MUST be the total number of read pairs associated with this fragment. The value MUST be 1 or greater. +
+
+ +## scATAC-seq Asset: Fragments File (processed) + +From every fragments file asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz, a tab-separated values (TSV) file position-sorted and compressed by bgzip. + +## scATAC-seq Asset: Fragments File index + +From every fragments file (processed) asset, CELLxGENE Discover MUST generate {dataset_version_id}-fragments.tsv.gz.tbi, a tabix index of the fragment intervals from the fragments file. + +## Chromosome Tables + +As determined by the reference assembly used by the gene annotation versions pinned for this version of the schema. Only chromosomes or scaffolds that have at least one gene feature present are included. + +### human (GRCh38.p14) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1248956422
chr2242193529
chr3198295559
chr4190214555
chr5181538259
chr6170805979
chr7159345973
chr8145138636
chr9138394717
chr10133797422
chr11135086622
chr12133275309
chr13114364328
chr14107043718
chr15101991189
chr1690338345
chr1783257441
chr1880373285
chr1958617616
chr2064444167
chr2146709983
chr2250818468
chrX156040895
chrY57227415
chrM16569
GL000009.2201709
GL000194.1191469
GL000195.1182896
GL000205.2185591
GL000213.1164239
GL000216.2176608
GL000218.1161147
GL000219.1179198
GL000220.1161802
GL000225.1211173
KI270442.1392061
KI270711.142210
KI270713.140745
KI270721.1100316
KI270726.143739
KI270727.1448248
KI270728.11872759
KI270731.1150754
KI270733.1179772
KI270734.1165050
KI270744.1168472
KI270750.1148850
+ +### mouse (GRCm39) + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ChromosomeLength
chr1195154279
chr2181755017
chr3159745316
chr4156860686
chr5151758149
chr6149588044
chr7144995196
chr8130127694
chr9124359700
chr10130530862
chr11121973369
chr12120092757
chr13120883175
chr14125139656
chr15104073951
chr1698008968
chr1795294699
chr1890720763
chr1961420004
chrX169476592
chrY91455967
chrM16299
GL456210.1169725
GL456211.1241735
GL456212.1153618
GL456219.1175968
GL456221.1206961
GL456239.140056
GL456354.1195993
GL456372.128664
GL456381.125871
GL456385.135240
JH584295.11976
JH584296.1199368
JH584297.1205776
JH584298.1184189
JH584299.1953012
JH584303.1158099
JH584304.1114452
\ No newline at end of file diff --git a/schema/drafts/genome_track.md b/schema/drafts/genome_track.md new file mode 100644 index 000000000..dc18a6e6d --- /dev/null +++ b/schema/drafts/genome_track.md @@ -0,0 +1,41 @@ +## scATAC-seq asset Dataset Criteria + +See [fragments file schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/drafts/fragments_file.md) for criteria a Dataset MUST meet in order to be eligible for scATAC-seq assets. + +If a Dataset has a fragments file asset, it MAY have genome track assets. Otherwise, it MUST NOT have genome track assets. + +## `uns` (Dataset Metadata) + + + + + + + + + + + + + + +
Keypeak_grouping
AnnotationCurator MAY annotate if the Dataset has a fragments file asset; otherwise, this key MUST NOT be present.
Value + str. The value MUST match a key in obs. If annotated, genome track assets MUST be submitted. The following columns MUST NOT be specified: +
    +
  • assay_ontology_term_id
  • +
  • cell_type_ontology_term_id
  • +
  • development_stage_ontology_term_id
  • +
  • disease_ontology_term_id
  • +
  • organism_ontology_term_id
  • +
  • self_reported_ethnicity_ontology_term_id
  • +
  • sex_ontology_term_id
  • +
  • tissue_ontology_term_id
  • +
+ Instead specify the corresponding Discover column such as cell_type.

+
+ +## scATAC-seq Asset: Genome Track + +If uns['peak_grouping'] is annotated, there MUST be exactly one genome track asset submitted for each unique value in the obs column specified as determined by anndata.obs.{peak_grouping_column}.unique(). Otherwise, this MUST NOT be submitted. + +Asset file specifications TBD based on the visualization solution. Accepting .bigWig format is a requirement.