diff --git a/.github/ISSUE_TEMPLATE/add-species.md b/.github/ISSUE_TEMPLATE/add-species.md new file mode 100644 index 000000000..31504d143 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/add-species.md @@ -0,0 +1,228 @@ +--- +name: Add species +about: Editor's template for adding new species +title: Draft +labels: drafting, multispecies discovery, schema +assignees: brianraymor + +--- + +## Pending Issues + +1. Waiting on sscrdv to be submitted to OLS for use in references +1. [FAANG](http://www.faang.org/) is the Functional Annotation of ANimal Genomes project. _We are working to understand the genotype to phenotype link in domesticated animals._ Per their [Ontology Improver](https://data.faang.org/ontology?sortTerm=key&sortDirection=asc), *Dv terms are not referenced. Both UBERON and CL are in use. Their [schema](https://github.com/FAANG/dcc-metadata/blob/9e7c1b5304fc57a724d197384e83243562bebbf4/json_schema/type/samples/faang_samples_specimen.metadata_rules.json#L154): + +``` +"name": "developmental stage", +"description": "Ontology for Developmental stage, UBERON is preferred to EFO.", +``` + + +## Design + +This draft design reflects additions to corresponding sections in [schema 5.2.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md). Reviewers are expected to be familiar with the CELLxGENE schema. + +**Editorial Notes** that are inlined in the design below will not be surfaced in the schema. + +--- + +### Required Ontologies + + +| Ontology | OBO Prefix | Release | Download | +|:--|:--|:--|:--| +| [Unavailable](https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/ontology) | SscrDv | [Releases](https://github.com/obophenotype/developmental-stage-ontologies/releases) | TBD | +||||| + + +#### Editorial Notes + +This ontology is under active development. CELLxGENE pins ontology releases in each version of the schema. A specific release of the ontology above must be selected in the future. + + +--- + +### Required Gene Annotations + +| Organism | Source | Required version | Download | +|:--|:--|:--|:--| +| "NCBITaxon:9823"
for Sus scrofa domesticus | [ENSEMBL (Sus scrofa domesticus)] | Sscrofa11.1 (GCA_000003025.6) | [Sus_scrofa.Sscrofa11.1.113.gtf] | + + +[ENSEMBL (Sus scrofa domesticus)]: https://useast.ensembl.org/Sus_scrofa/Info/Index +[Sus_scrofa.Sscrofa11.1.113.gtf]: https://ftp.ensembl.org/pub/release-113/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.113.gtf.gz + +#### Editorial Notes + + +--- + +## `obs` (Cell Metadata) + +### cell_type_ontology_term_id + +No schema changes are required. + +#### Editorial Notes + +--- + +### development_stage_ontology_term_id + + + + + + + + + + + + + + +
Keydevelopment_stage_ontology_term_id
AnnotatorCurator MUST annotate.
Value + categorical with str categories. If unavailable, this MUST be "unknown".

+ If organism_ontolology_term_id is "NCBITaxon:9823" for Sus scrofa domesticus, this MUST be the most accurate descendant of SscrDv:0000000 for life cycle stage. +
+
+ +#### Editorial Notes + +This may be outdated, but [potential recommendations](https://github.com/obophenotype/developmental-stage-ontologies/blob/master/external/bgee/report.md#sus-scrofa): + +``` +UBERON:0000104 life cycle + UBERON:0000068 embryo stage + UBERON:0000106 zygote stage + UBERON:0000107 cleavage stage + UBERON:0007232 2 cell stage + UBERON:0007233 4 cell stage + UBERON:0007236 8 cell stage + UBERON:0000108 blastula stage + UBERON:0000109 gastrula stage + UBERON:0000110 neurula stage + UBERON:0000111 organogenesis stage + SscrDv:0000081 ridge limb stage (pig) + SscrDv:0000082 bud limb stage (pig) + SscrDv:0000083 paddle limb stage (pig) + UBERON:0007220 late embryonic stage + UBERON:0000092 post-embryonic stage + UBERON:0000066 fully formed stage + UBERON:0000112 sexually immature stage + UBERON:0018685 nursing stage + UBERON:0007221 neonate stage + SscrDv:0000072 0-day-old stage (pig) + SscrDv:0000073 1-day-old stage (pig) + SscrDv:0000074 2-day-old stage (pig) + SscrDv:0000075 3-day-old stage (pig) + SscrDv:0000076 4-day-old stage (pig) + SscrDv:0000077 5-day-old stage (pig) + SscrDv:0000078 6-day-old stage (pig) + UBERON:0034920 infant stage + SscrDv:0000010 1-week-old stage (pig) + SscrDv:0000011 2-week-old stage (pig) + SscrDv:0000012 3-week-old stage (pig) + SscrDv:0000018 21-day-old stage (pig) + SscrDv:0000019 22-day-old stage (pig) + SscrDv:0000020 23-day-old stage (pig) + SscrDv:0000021 24-day-old stage (pig) + SscrDv:0000022 25-day-old stage (pig) + SscrDv:0000023 26-day-old stage (pig) + SscrDv:0000024 27-day-old stage (pig) + SscrDv:0000013 4-week-old stage (pig) + SscrDv:0000025 28-day-old stage (pig) + SscrDv:0000026 29-day-old stage (pig) + SscrDv:0000027 30-day-old stage (pig) + SscrDv:0000028 31-day-old stage (pig) + SscrDv:0000029 32-day-old stage (pig) + SscrDv:0000030 33-day-old stage (pig) + SscrDv:0000031 34-day-old stage (pig) + SscrDv:0000014 5-week-old stage (pig) + SscrDv:0000032 35-day-old stage (pig) + SscrDv:0000033 36-day-old stage (pig) + SscrDv:0000034 37-day-old stage (pig) + SscrDv:0000035 38-day-old stage (pig) + SscrDv:0000036 39-day-old stage (pig) + SscrDv:0000037 40-day-old stage (pig) + SscrDv:0000038 41-day-old stage (pig) + SscrDv:0000015 6-week-old stage (pig) + SscrDv:0000016 7-week-old stage (pig) + UBERON:0034919 juvenile stage + SscrDv:0000039 2-month-old stage (pig) + SscrDv:0000017 8-week-old stage (pig) + SscrDv:0000040 9-week-old stage (pig) + SscrDv:0000041 10-week-old stage (pig) + SscrDv:0000042 11-week-old stage (pig) + SscrDv:0000043 3-month-old stage (pig) + SscrDv:0000044 12-week-old stage (pig) + SscrDv:0000045 13-week-old stage (pig) + SscrDv:0000046 14-week-old stage (pig) + SscrDv:0000047 15-week-old stage (pig) + SscrDv:0000048 4-month-old stage (pig) + SscrDv:0000049 16-week-old stage (pig) + SscrDv:0000050 17-week-old stage (pig) + SscrDv:0000051 18-week-old stage (pig) + SscrDv:0000052 19-week-old stage (pig) + SscrDv:0000053 20-week-old stage (pig) + SscrDv:0000054 5-month-old stage (pig) + SscrDv:0000055 21-week-old stage (pig) + SscrDv:0000056 22-week-old stage (pig) + SscrDv:0000057 23-week-old stage (pig) + SscrDv:0000058 24-week-old stage (pig) + SscrDv:0000059 6-month-old stage (pig) + SscrDv:0000060 7-month-old stage (pig) + SscrDv:0000061 8-month-old stage (pig) + SscrDv:0000062 9-month-old stage (pig) + SscrDv:0000063 10-month-old stage (pig) + UBERON:0000113 post-juvenile + UBERON:0018241 prime adult stage + SscrDv:0000064 11-month-old stage (pig) + SscrDv:0000065 1-year-old stage (pig) + SscrDv:0000066 2-year-old stage (pig) + SscrDv:0000067 3-year-old stage (pig) + SscrDv:0000068 4-year-old stage (pig) + SscrDv:0000069 5-year-old stage (pig) + SscrDv:0000070 6-year-old stage (pig) + SscrDv:0000071 7-year-old stage (pig) + UBERON:0007222 late adult stage +``` + +--- + +### disease_ontology_term_id + +No schema changes are required. + +#### Editorial Notes + +--- + +### organism_ontolology_term_id + +organism_ontolology_term_id is "NCBITaxon:9823" for Sus scrofa domesticus + +--- + +### sex_ontology_term_id + +No schema changes are required. + +#### Editorial Notes + +--- + +### tissue_ontology_term_id + +No schema changes are required. + + +#### Editorial Notes + +--- + +## Reference + + +[BGEE](https://www.bgee.org/species/9823) diff --git a/.github/ISSUE_TEMPLATE/tech-issue.md b/.github/ISSUE_TEMPLATE/tech-issue.md index 4965be8c6..7b75dbb8d 100644 --- a/.github/ISSUE_TEMPLATE/tech-issue.md +++ b/.github/ISSUE_TEMPLATE/tech-issue.md @@ -1,9 +1,11 @@ --- name: Tech Issue -about: Engineering-specific technical work that is not product-specific. Engineering team "owns" these issues. -title: "" +about: Engineering-specific technical work that is not product-specific. Engineering + team "owns" these issues. +title: '' labels: tech -assignees: "" +assignees: '' + --- ## Motivation diff --git a/.github/workflows/push_tests.yml b/.github/workflows/push_tests.yml index cd5577d4b..86de15a39 100644 --- a/.github/workflows/push_tests.yml +++ b/.github/workflows/push_tests.yml @@ -57,8 +57,9 @@ jobs: uses: actions/upload-artifact@v4 with: name: coverage-cli - path: ./.coverage* + path: .coverage* retention-days: 3 + include-hidden-files: true unit-tests-migration-assistant: runs-on: ubuntu-latest @@ -88,8 +89,9 @@ jobs: uses: actions/upload-artifact@v4 with: name: coverage-migration-assisstant - path: ./.coverage* + path: .coverage* retention-days: 3 + include-hidden-files: true unit-test-ontology-dry-run: runs-on: ubuntu-latest @@ -119,8 +121,9 @@ jobs: uses: actions/upload-artifact@v4 with: name: coverage-ontology-dry-run - path: ./.coverage* + path: .coverage* retention-days: 3 + include-hidden-files: true unit-test-genes-dry-run: runs-on: ubuntu-latest @@ -150,8 +153,9 @@ jobs: uses: actions/upload-artifact@v4 with: name: coverage-genes-dry-run - path: ./.coverage* + path: .coverage* retention-days: 3 + include-hidden-files: true submit-codecoverage: needs: @@ -184,6 +188,7 @@ jobs: - name: Upload coverage to Codecov uses: codecov/codecov-action@v4 with: + token: ${{ secrets.CODECOV_TOKEN }} env_vars: OS,PYTHON files: ./coverage.xml flags: unittests diff --git a/cellxgene_schema_cli/cellxgene_schema/cli.py b/cellxgene_schema_cli/cellxgene_schema/cli.py index 33fce82b6..1254a7ba1 100644 --- a/cellxgene_schema_cli/cellxgene_schema/cli.py +++ b/cellxgene_schema_cli/cellxgene_schema/cli.py @@ -1,7 +1,10 @@ +import logging import sys import click +logger = logging.getLogger("cellxgene_schema") + @click.group( name="schema", @@ -9,11 +12,13 @@ short_help="Apply and validate the cellxgene data integration schema to an h5ad file.", context_settings=dict(max_content_width=85, help_option_names=["-h", "--help"]), ) -def schema_cli(): - pass +@click.option("-v", "--verbose", help="When present will set logging level to debug", is_flag=True) +def schema_cli(verbose): + logging.basicConfig(level=logging.ERROR) + logger.setLevel(logging.DEBUG if verbose else logging.INFO) -@click.command( +@schema_cli.command( name="validate", short_help="Check that an h5ad follows the cellxgene data integration schema.", help="Check that an h5ad follows the cellxgene data integration schema. If validation fails this command will " @@ -31,27 +36,25 @@ def schema_cli(): type=click.Path(exists=False, dir_okay=False, writable=True), ) @click.option("-i", "--ignore-labels", help="Ignore ontology labels when validating", is_flag=True) -@click.option("-v", "--verbose", help="When present will set logging level to debug", is_flag=True) -def schema_validate(h5ad_file, add_labels_file, ignore_labels, verbose): +def schema_validate(h5ad_file, add_labels_file, ignore_labels): # Imports are very slow so we defer loading until Click arg validation has passed - - print("Loading dependencies") + logger.info("Loading dependencies") try: import anndata # noqa: F401 except ImportError: raise click.ClickException("[cellxgene] cellxgene-schema requires anndata") from None - print("Loading validator modules") + logger.info("Loading validator modules") from .validate import validate - is_valid, _, _ = validate(h5ad_file, add_labels_file, ignore_labels=ignore_labels, verbose=verbose) + is_valid, _, _ = validate(h5ad_file, add_labels_file, ignore_labels=ignore_labels) if is_valid: sys.exit(0) else: sys.exit(1) -@click.command( +@schema_cli.command( name="remove-labels", short_help="Create a copy of an h5ad without portal-added labels", help="Create a copy of an h5ad without portal-added labels.", @@ -61,24 +64,24 @@ def schema_validate(h5ad_file, add_labels_file, ignore_labels, verbose): def remove_labels(input_file, output_file): from .remove_labels import AnnDataLabelRemover - print("Loading dependencies") + logger.info("Loading dependencies") try: import anndata # noqa: F401 except ImportError: raise click.ClickException("[cellxgene] cellxgene-schema requires anndata") from None - print(f"Loading h5ad from {input_file}") + logger.info(f"Loading h5ad from {input_file}") adata = anndata.read_h5ad(input_file) anndata_label_remover = AnnDataLabelRemover(adata) if not anndata_label_remover.schema_def: return - print("Removing labels") + logger.info("Removing labels") anndata_label_remover.remove_labels() - print(f"Labels have been removed. Writing to {output_file}") + logger.info(f"Labels have been removed. Writing to {output_file}") anndata_label_remover.adata.write(output_file, compression="gzip") -@click.command( +@schema_cli.command( name="migrate", short_help="Convert an h5ad to the latest schema version.", help="Convert an h5ad from the previous to latest minor schema version. No validation will be " @@ -94,9 +97,5 @@ def migrate(input_file, output_file, collection_id, dataset_id): migrate(input_file, output_file, collection_id, dataset_id) -schema_cli.add_command(schema_validate) -schema_cli.add_command(migrate) -schema_cli.add_command(remove_labels) - if __name__ == "__main__": schema_cli() diff --git a/cellxgene_schema_cli/cellxgene_schema/schema_definitions/schema_definition.yaml b/cellxgene_schema_cli/cellxgene_schema/schema_definitions/schema_definition.yaml index 28a3fad54..28a153d73 100644 --- a/cellxgene_schema_cli/cellxgene_schema/schema_definitions/schema_definition.yaml +++ b/cellxgene_schema_cli/cellxgene_schema/schema_definitions/schema_definition.yaml @@ -186,7 +186,12 @@ components: type: curie dependencies: - # If tissue_type is tissue OR organoid - rule: "tissue_type == 'tissue' | tissue_type == 'organoid'" + rule: + column: tissue_type + match_exact: + terms: + - tissue + - organoid error_message_suffix: >- When 'tissue_type' is 'tissue' or 'organoid', 'tissue_ontology_term_id' MUST be a descendant term id of 'UBERON:0001062' (anatomical entity). @@ -199,7 +204,11 @@ components: UBERON: - UBERON:0001062 - # If tissue_type is cell culture - rule: "tissue_type == 'cell culture'" + rule: + column: tissue_type + match_exact: + terms: + - cell culture error_message_suffix: >- When 'tissue_type' is 'cell culture', 'tissue_ontology_term_id' MUST be either a CL term (excluding 'CL:0000255' (eukaryotic cell), 'CL:0000257' (Eumycetozoan cell), @@ -222,7 +231,11 @@ components: type: curie dependencies: - # If organism is Human - rule: "organism_ontology_term_id == 'NCBITaxon:9606'" + rule: + column: organism_ontology_term_id + match_exact: + terms: + - NCBITaxon:9606 error_message_suffix: >- When 'organism_ontology_term_id' is 'NCBITaxon:9606' (Homo sapiens), self_reported_ethnicity_ontology_term_id MUST be formatted as one @@ -285,7 +298,11 @@ components: type: curie dependencies: - # If organism is Human - rule: "organism_ontology_term_id == 'NCBITaxon:9606'" + rule: + column: organism_ontology_term_id + match_exact: + terms: + - NCBITaxon:9606 error_message_suffix: >- When 'organism_ontology_term_id' is 'NCBITaxon:9606' (Homo sapiens), 'development_stage_ontology_term_id' MUST be the most accurate descendant of 'HsapDv:0000001' or unknown. @@ -300,7 +317,11 @@ components: exceptions: - unknown - # If organism is Mouse - rule: "organism_ontology_term_id == 'NCBITaxon:10090'" + rule: + column: organism_ontology_term_id + match_exact: + terms: + - NCBITaxon:10090 error_message_suffix: >- When 'organism_ontology_term_id' is 'NCBITaxon:10090' (Mus musculus), 'development_stage_ontology_term_id' MUST be the most accurate descendant of 'MmusDv:0000001' or unknown. @@ -353,227 +374,70 @@ components: selected the most appropriate value for the assay(s) between 'cell', 'nucleus', and 'na'. Please contact cellxgene@chanzuckerberg.com during submission so that the assay(s) can be added to the schema definition document. dependencies: - - # If assay_ontology_term_id is EFO:0030080 or its descendants, 'suspension_type' MUST be 'cell' or 'nucleus' - complex_rule: - match_ancestors: - column: assay_ontology_term_id + - # 'suspension_type' MUST be 'cell' or 'nucleus' + rule: + column: assay_ontology_term_id + match_ancestors_inclusive: ancestors: - EFO: - - EFO:0030080 - inclusive: True + - EFO:0030080 + - EFO:0010184 + match_exact: + terms: + - EFO:0010010 + - EFO:0008722 + - EFO:0010550 + - EFO:0008780 + - EFO:0700010 + - EFO:0700011 + - EFO:0009919 + - EFO:0030060 + - EFO:0022490 + - EFO:0030028 type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0030080 or its descendants enum: - "cell" - "nucleus" - - # If assay_ontology_term_id is EFO:0007045 or its descendants, 'suspension_type' MUST be 'nucleus' - complex_rule: - match_ancestors: - column: assay_ontology_term_id + - # 'suspension_type' MUST be 'nucleus' + rule: + column: assay_ontology_term_id + match_ancestors_inclusive: ancestors: - EFO: - - EFO:0007045 - inclusive: True - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0007045 or its descendants - enum: - - "nucleus" - - # If assay_ontology_term_id is EFO:0010184 or its descendants, 'suspension_type' MUST be 'cell' or 'nucleus' - complex_rule: - match_ancestors: - column: assay_ontology_term_id - ancestors: - EFO: - - EFO:0010184 - inclusive: True + - EFO:0007045 + - EFO:0002761 + match_exact: + terms: + - EFO:0008720 + - EFO:0030026 type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0010184 or its descendants enum: - - "cell" - "nucleus" - - # If assay_ontology_term_id is EFO:0008994 or its descendants, 'suspension_type' MUST be 'na' - complex_rule: - match_ancestors: - column: assay_ontology_term_id + - #'suspension_type' MUST be 'cell' + rule: + column: assay_ontology_term_id + match_ancestors_inclusive: ancestors: - EFO: - - EFO:0008994 - inclusive: True - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008994 or its descendants - enum: - - "na" - - # If assay_ontology_term_id is EFO:0008919 or its descendants, 'suspension_type' MUST be 'cell' - complex_rule: - match_ancestors: - column: assay_ontology_term_id - ancestors: - EFO: - - EFO:0008919 - inclusive: True + - EFO:0008919 + match_exact: + terms: + - EFO:0030002 + - EFO:0008853 + - EFO:0008796 + - EFO:0700003 + - EFO:0700004 + - EFO:0008953 type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008919 or its descendants enum: - "cell" - - # If assay_ontology_term_id is EFO:0002761 or its descendants, 'suspension_type' MUST be 'nucleus' - complex_rule: - match_ancestors: - column: assay_ontology_term_id + - # 'suspension_type' MUST be 'na' + rule: + column: assay_ontology_term_id + match_ancestors_inclusive: ancestors: - EFO: - - EFO:0002761 - inclusive: True - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0002761 or its descendants - enum: - - "nucleus" - - # If assay_ontology_term_id is EFO:0010010, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0010010'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0010010 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0008720, 'suspension_type' MUST be 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0008720'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008720 - enum: - - "nucleus" - - # If assay_ontology_term_id is EFO:0008722, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0008722'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008722 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0030002, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0030002'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0030002 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0008853, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0008853'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008853 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0030026, 'suspension_type' MUST be 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0030026'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0030026 - enum: - - "nucleus" - - # If assay_ontology_term_id is EFO:0010550, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0010550'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0010550 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0008796, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0008796'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008796 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0700003, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0700003'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0700003 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0700004, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0700004'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0700004 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0008780, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0008780'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008780 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0008953, 'suspension_type' MUST be 'cell' - rule: "assay_ontology_term_id == 'EFO:0008953'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008953 - enum: - - "cell" - - # If assay_ontology_term_id is EFO:0700010, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0700010'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0700010 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0700011, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0700011'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0700011 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0009919, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0009919'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0009919 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0030060, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0030060'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0030060 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0022490, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0022490'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0022490 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0030028, 'suspension_type' MUST be 'cell' or 'nucleus' - rule: "assay_ontology_term_id == 'EFO:0030028'" - type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0030028 - enum: - - "cell" - - "nucleus" - - # If assay_ontology_term_id is EFO:0008992, 'suspension_type' MUST be 'na' - rule: "assay_ontology_term_id == 'EFO:0008992'" + - EFO:0008994 + match_exact: + terms: + - EFO:0008992 type: categorical - error_message_suffix: >- - when 'assay_ontology_term_id' is EFO:0008992 enum: - "na" tissue_type: @@ -582,3 +446,15 @@ components: - "cell culture" - "organoid" - "tissue" + genetic_ancestry_African: + type: genetic_ancestry_value + genetic_ancestry_East_Asian: + type: genetic_ancestry_value + genetic_ancestry_European: + type: genetic_ancestry_value + genetic_ancestry_Indigenous_American: + type: genetic_ancestry_value + genetic_ancestry_Oceanian: + type: genetic_ancestry_value + genetic_ancestry_South_Asian: + type: genetic_ancestry_value diff --git a/cellxgene_schema_cli/cellxgene_schema/utils.py b/cellxgene_schema_cli/cellxgene_schema/utils.py index fb8f58f45..e2b558f7a 100644 --- a/cellxgene_schema_cli/cellxgene_schema/utils.py +++ b/cellxgene_schema_cli/cellxgene_schema/utils.py @@ -2,10 +2,12 @@ import os import sys from base64 import b85encode +from functools import lru_cache from typing import Dict, List, Union import anndata as ad import numpy as np +from cellxgene_ontology_guide.ontology_parser import OntologyParser from scipy import sparse from xxhash import xxh3_64_intdigest @@ -151,3 +153,15 @@ def get_hash_digest_column(dataframe): .astype(np.uint64) .apply(lambda v: b85encode(v.to_bytes(8, "big")).decode("ascii")) ) + + +@lru_cache() +def is_ontological_descendant_of(onto: OntologyParser, term: str, target: str, include_self: bool = True) -> bool: + """ + Determines if :term is an ontological descendant of :target and whether to include :term==:target. + + This function is cached and is safe to call many times. + + #TODO:[EM] needs testing + """ + return term in set(onto.get_term_descendants(target, include_self)) diff --git a/cellxgene_schema_cli/cellxgene_schema/validate.py b/cellxgene_schema_cli/cellxgene_schema/validate.py index 5cc36abc3..25630556f 100644 --- a/cellxgene_schema_cli/cellxgene_schema/validate.py +++ b/cellxgene_schema_cli/cellxgene_schema/validate.py @@ -4,7 +4,7 @@ import os import re from datetime import datetime -from typing import Dict, List, Mapping, Optional, Union +from typing import Dict, List, Mapping, Optional, Tuple, Union import anndata import matplotlib.colors as mcolors @@ -13,23 +13,38 @@ import scipy from anndata.abc import CSCDataset, CSRDataset from cellxgene_ontology_guide.ontology_parser import OntologyParser -from pandas.errors import UndefinedVariableError from scipy import sparse from . import gencode, schema -from .utils import SPARSE_MATRIX_TYPES, get_matrix_format, getattr_anndata, read_h5ad +from .utils import SPARSE_MATRIX_TYPES, get_matrix_format, getattr_anndata, is_ontological_descendant_of, read_h5ad logger = logging.getLogger(__name__) -ONTOLOGY_PARSER = OntologyParser(schema_version=f"v{schema.get_current_schema_version()}") +ONTOLOGY_PARSER = OntologyParser(schema_version="v5.3.0") ASSAY_VISIUM = "EFO:0010961" +ASSAY_VISIUM_11M = "EFO:0022860" ASSAY_SLIDE_SEQV2 = "EFO:0030062" VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE = 4992 +VISIUM_11MM_AND_IS_SINGLE_TRUE_MATRIX_SIZE = 14336 +VISIUM_TISSUE_POSITION_MAX_ROW = 77 +VISIUM_TISSUE_POSITION_MAX_COL = 127 +VISIUM_11MM_TISSUE_POSITION_MAX_ROW = 127 +VISIUM_11MM_TISSUE_POSITION_MAX_COL = 223 SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE = 2000 +SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM = 4000 -ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE = "obs['assay_ontology_term_id'] 'EFO:0010961' (Visium Spatial Gene Expression) and uns['spatial']['is_single'] is True" +CONDITION_IS_VISIUM = "a descendant of 'EFO:0010961' (Visium Spatial Gene Expression)" +CONDITION_IS_VISIUM_11M = f"'{ASSAY_VISIUM_11M} (Visium CytAssist Spatial Gene Expression, 11mm)" +CONDITION_IS_SEQV2 = f"'{ASSAY_SLIDE_SEQV2}' (Slide-seqV2)" + +ERROR_SUFFIX_SPATIAL = f"obs['assay_ontology_term_id'] is either {CONDITION_IS_VISIUM} or {CONDITION_IS_SEQV2}" +ERROR_SUFFIX_VISIUM = f"obs['assay_ontology_term_id'] is {CONDITION_IS_VISIUM}" +ERROR_SUFFIX_VISIUM_11M = f"obs['assay_ontology_term_id'] is {CONDITION_IS_VISIUM_11M}" + +ERROR_SUFFIX_IS_SINGLE = "uns['spatial']['is_single'] is True" +ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE = f"{ERROR_SUFFIX_VISIUM} and {ERROR_SUFFIX_IS_SINGLE}" ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_FORBIDDEN = f"is only allowed for {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}" ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_REQUIRED = f"is required for {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}" ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_IN_TISSUE_0 = f"{ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE} and in_tissue is 0" @@ -42,13 +57,16 @@ def __init__(self, ignore_labels=False): self.schema_def = dict() self.schema_version: str = None self.ignore_labels = ignore_labels - self.visium_and_is_single_true_matrix_size = VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE + self._visium_and_is_single_true_matrix_size = None + self._hires_max_dimension_size = None + self._visium_error_suffix = None + self._visium_tissue_position_max = None # Values will be instances of gencode.GeneChecker, # keys will be one of gencode.SupportedOrganisms self.gene_checkers = dict() - def reset(self): + def reset(self, hi_res_size: Optional[int] = None, true_mat_size: Optional[int] = None): self.errors = [] self.warnings = [] self.is_valid = False @@ -57,6 +75,8 @@ def reset(self): self.is_spatial = None self.is_visium = None self.is_visium_and_is_single_true = None + self._hires_max_dimension_size = hi_res_size + self._visium_and_is_single_true_matrix_size = true_mat_size # Matrix (e.g., X, raw.X, ...) number non-zero cache self.number_non_zero = dict() @@ -70,6 +90,64 @@ def adata(self, adata: anndata.AnnData): self.reset() self._adata = adata + @property + def visium_and_is_single_true_matrix_size(self) -> Optional[int]: + """ + Returns the required matrix size based on assay type, if applicable, else returns None. + """ + if self._visium_and_is_single_true_matrix_size is None: + # Visium 11M's raw matrix size is distinct from other visium assays + if bool( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda t: is_ontological_descendant_of(ONTOLOGY_PARSER, t, ASSAY_VISIUM_11M, True)) + .astype(bool) + .any() + ): + self._visium_error_suffix = f"{ERROR_SUFFIX_VISIUM_11M} and {ERROR_SUFFIX_IS_SINGLE}" + self._visium_and_is_single_true_matrix_size = VISIUM_11MM_AND_IS_SINGLE_TRUE_MATRIX_SIZE + elif self._is_visium_including_descendants(): + self._visium_error_suffix = f"{ERROR_SUFFIX_VISIUM} and {ERROR_SUFFIX_IS_SINGLE}" + self._visium_and_is_single_true_matrix_size = VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE + return self._visium_and_is_single_true_matrix_size + + @property + def hires_max_dimension_size(self) -> Optional[int]: + """ + Returns the restricted hires image dimension based on assay type, if applicable, else returns None. + """ + if self._hires_max_dimension_size is None: + # Visium 11M's max dimension size is distinct from other visium assays + if bool( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda t: is_ontological_descendant_of(ONTOLOGY_PARSER, t, ASSAY_VISIUM_11M, True)) + .astype(bool) + .any() + ): + self._visium_error_suffix = ERROR_SUFFIX_VISIUM_11M + self._hires_max_dimension_size = SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM + elif self._is_visium_including_descendants(): + self._visium_error_suffix = ERROR_SUFFIX_VISIUM + self._hires_max_dimension_size = SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE + return self._hires_max_dimension_size + + @property + def tissue_position_maxes(self) -> Tuple[int, int]: + if self._visium_tissue_position_max is None and self._is_visium_and_is_single_true: + # visium 11 has different requirements than other visium + if ( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda t: is_ontological_descendant_of(ONTOLOGY_PARSER, t, ASSAY_VISIUM_11M, True)) + .astype(bool) + .any() + ): + self._visium_tissue_position_max = ( + VISIUM_11MM_TISSUE_POSITION_MAX_ROW, + VISIUM_11MM_TISSUE_POSITION_MAX_COL, + ) + else: + self._visium_tissue_position_max = (VISIUM_TISSUE_POSITION_MAX_ROW, VISIUM_TISSUE_POSITION_MAX_COL) + return self._visium_tissue_position_max + def _is_single(self) -> bool | None: """ Determine value of uns.spatial.is_single. None if non-spatial. @@ -95,9 +173,11 @@ def _is_supported_spatial_assay(self) -> bool: """ if self.is_spatial is None: try: - self.is_spatial = False - if self.adata.obs.assay_ontology_term_id.isin([ASSAY_VISIUM, ASSAY_SLIDE_SEQV2]).any(): - self.is_spatial = True + _spatial = ( + self._is_visium_including_descendants() + or self.adata.obs.assay_ontology_term_id.isin([ASSAY_SLIDE_SEQV2]).astype(bool).any() + ) + self.is_spatial = bool(_spatial) except AttributeError: # specific error reporting will occur downstream in the validation self.is_spatial = False @@ -211,7 +291,7 @@ def _validate_curie_ancestors( is_valid_term_id = ONTOLOGY_PARSER.is_valid_term_id(term_id) is_valid_ancestor_id = ONTOLOGY_PARSER.is_valid_term_id(ancestor) if is_valid_term_id & is_valid_ancestor_id: - is_descendant = ancestor in ONTOLOGY_PARSER.get_term_ancestors(term_id) + is_descendant = ancestor in ONTOLOGY_PARSER.get_term_ancestors(term_id, inclusive) checks.append(is_descendant) if True not in checks: @@ -407,6 +487,110 @@ def _count_matrix_nonzero(self, matrix_name: str, matrix: Union[np.ndarray, spar self.number_non_zero[matrix_name] = nnz return nnz + def _validate_genetic_ancestry(self): + """ + Performs row-based validation of the genetic_ancestry_X fields. This ensures that a valid row must be: + - all float('nan') if organism is not homo sapiens or info is unavailable + - sum to 1.0 + + Additionally, verifies that all rows with the same donor_id must have the same genetic ancestry values + """ + ancestry_columns = [ + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", + ] + + organism_column = "organism_ontology_term_id" + donor_id_column = "donor_id" + + # Skip any additional validation if the genetic ancestry or organism columns are not present + # An error for missing columns will be raised at a different point + required_columns = ancestry_columns + [organism_column, donor_id_column] + for column in required_columns: + if column not in self.adata.obs.columns: + return + + donor_id_to_ancestry_values = dict() + + def is_valid_row(row): + ancestry_values = row[ancestry_columns] + + # If ancestry values are different for the same donor id, then this row is invalid + donor_id = row[donor_id_column] + if donor_id in donor_id_to_ancestry_values: + if not donor_id_to_ancestry_values[donor_id].equals(ancestry_values): + return False + else: + donor_id_to_ancestry_values[donor_id] = ancestry_values + + # All values are NaN. This is always valid, regardless of organism + if ancestry_values.isna().all(): + return True + + # If any values are NaN, and we didn't return in the earlier all NaN check, then + # this is invalid + if ancestry_values.isna().any(): + return False + + # If organism is not homo sapiens, and we didn't return in the earlier all NaN check, + # then this row is invalid + if row[organism_column] != "NCBITaxon:9606": + return False + + # The sum of genetic ancestry values should be approximately 1.0 + if ( + ancestry_values.apply(lambda x: isinstance(x, (float, int))).all() + and abs(ancestry_values.sum() - 1.0) <= 1e-6 + ): + return True + + return False + + invalid_rows = ~self.adata.obs.apply(is_valid_row, axis=1) + + if invalid_rows.any(): + donor_ids = self.adata.obs[donor_id_column].tolist() + unique_donor_ids = list(set(donor_ids)) + self.errors.append( + f"obs rows with donor ids {unique_donor_ids} have invalid genetic_ancestry_* values. All " + f"observations with the same donor_id must contain the same genetic_ancestry_* values. If " + f"organism_ontolology_term_id is NOT 'NCBITaxon:9606' for Homo sapiens, then all genetic" + f"ancestry values MUST be float('nan'). If organism_ontolology_term_id is 'NCBITaxon:9606' " + f"for Homo sapiens, then the value MUST be a float('nan') if unavailable; otherwise, the " + f"sum of all genetic_ancestry_* fields must be equal to 1.0" + ) + + def _validate_individual_genetic_ancestry_value(self, column: pd.Series, column_name: str): + """ + The following fields are valid for genetic_ancestry_value columns: + - float values between 0 and 1 + - float('nan') + """ + if column.dtype != float: + self.errors.append(f"Column '{column_name}' in obs must be float, not '{column.dtype.name}'.") + return + + def is_individual_value_valid(value): + if isinstance(value, (float, int)) and 0 <= value <= 1: + return True + # Ensures only float('nan') or numpy.nan is valid, None is invalid + if isinstance(value, float) and pd.isna(value): + return True + return False + + # Identify invalid values + invalid_values = column[~column.map(is_individual_value_valid)] + + if not invalid_values.empty: + self.errors.append( + f"Column '{column_name}' in obs contains invalid values: {invalid_values.to_list()}. " + f"Valid values are floats between 0 and 1 or float('nan')." + ) + def _validate_column_feature_is_filtered(self, column: pd.Series, column_name: str, df_name: str): """ Validates the "is_feature_filtered" in adata.var. This column must be bool, and for genes that are set to @@ -445,7 +629,9 @@ def _validate_column_feature_is_filtered(self, column: pd.Series, column_name: s f"these features must be 0." ) - def _validate_column(self, column: pd.Series, column_name: str, df_name: str, column_def: dict): + def _validate_column( + self, column: pd.Series, column_name: str, df_name: str, column_def: dict, default_error_message_suffix=None + ): """ Given a schema definition and the column of a dataframe, verify that the column satisfies the schema. If there are any errors, it adds them to self.errors @@ -455,6 +641,7 @@ def _validate_column(self, column: pd.Series, column_name: str, df_name: str, co :param str df_name: Name of the dataframe :param dict column_def: schema definition for this specific column, e.g. schema_def["obs"]["columns"]["cell_type_ontology_term_id"] + :param str default_error_message_suffix: default error message suffix to be added to errors found here :rtype None """ @@ -496,6 +683,9 @@ def _validate_column(self, column: pd.Series, column_name: str, df_name: str, co if column_def.get("type") == "feature_is_filtered": self._validate_column_feature_is_filtered(column, column_name, df_name) + if column_def.get("type") == "genetic_ancestry_value": + self._validate_individual_genetic_ancestry_value(column, column_name) + if "enum" in column_def: bad_enums = [v for v in column.drop_duplicates() if v not in column_def["enum"]] if bad_enums: @@ -520,10 +710,11 @@ def _validate_column(self, column: pd.Series, column_name: str, df_name: str, co self._validate_curie_str(term_str, column_name, column_def["curie_constraints"]) # Add error suffix to errors found here - if "error_message_suffix" in column_def: + error_message_suffix = column_def.get("error_message_suffix", default_error_message_suffix) + if error_message_suffix: error_total_count = len(self.errors) for i in range(error_original_count, error_total_count): - self.errors[i] = self.errors[i] + " " + column_def["error_message_suffix"] + self.errors[i] = self.errors[i] + " " + error_message_suffix def _validate_column_dependencies( self, df: pd.DataFrame, df_name: str, column_name: str, dependencies: List[dict] @@ -543,73 +734,38 @@ def _validate_column_dependencies( """ all_rules = [] - for dependency_def in dependencies: - if "complex_rule" in dependency_def: - if "match_ancestors" in dependency_def["complex_rule"]: - query_fn, args = self._generate_match_ancestors_query_fn( - dependency_def["complex_rule"]["match_ancestors"] - ) - term_id, ontologies, ancestors, ancestor_inclusive = args - query_exp = f"@query_fn({term_id}, {ontologies}, {ancestors}, {ancestor_inclusive})" - elif "rule" in dependency_def: - query_exp = dependency_def["rule"] - else: - continue - + terms_to_match = set() + column_to_match = dependency_def["rule"]["column"] + if "match_ancestors_inclusive" in dependency_def["rule"]: + ancestors = dependency_def["rule"]["match_ancestors_inclusive"]["ancestors"] + for ancestor in ancestors: + terms_to_match.update(ONTOLOGY_PARSER.get_term_descendants(ancestor, include_self=True)) + if "match_exact" in dependency_def["rule"]: + terms_to_match.update(dependency_def["rule"]["match_exact"]["terms"]) try: - column = getattr(df.query(query_exp, engine="python"), column_name) - except UndefinedVariableError: + match_query = df[column_to_match].isin(terms_to_match) + match_df = df[match_query] + column = getattr(match_df, column_name) + error_message_suffix = dependency_def.get("error_message_suffix", None) + if not error_message_suffix: + matched_values = list(getattr(match_df, column_to_match).unique()) + error_message_suffix = f"when '{column_to_match}' is in {matched_values}" + except KeyError: self.errors.append( f"Checking values with dependencies failed for adata.{df_name}['{column_name}'], " f"this is likely due to missing dependent column in adata.{df_name}." ) return pd.Series(dtype=np.float64) - all_rules.append(query_exp) - - self._validate_column(column, column_name, df_name, dependency_def) + all_rules.append(match_query) + self._validate_column(column, column_name, df_name, dependency_def, error_message_suffix) - # Set column with the data that's left - all_rules = " | ".join(all_rules) - column = getattr(df.query("not (" + all_rules + " )", engine="python"), column_name) + # Return column of data that was not matched by any of the rules + column = getattr(df[~np.logical_or.reduce(all_rules)], column_name) return column - def _generate_match_ancestors_query_fn(self, rule_def: Dict): - """ - Generates vectorized function and args to query a pandas dataframe. Function will determine whether values from - a specified column is a descendant term to a group of specified ancestors, returning a Bool. - :param rule_def: defines arguments to pass into vectorized ancestor match validation function - :return: Tuple(function, Tuple(str, List[str], List[str])) - """ - validate_curie_ancestors_vectorized = np.vectorize(self._validate_curie_ancestors) - ancestor_map = rule_def["ancestors"] - inclusive = rule_def["inclusive"] - - # hack: pandas dataframe query doesn't support Dict inputs - ontology_keys = [] - ancestor_list = [] - for key, val in ancestor_map.items(): - ontology_keys.append(key) - ancestor_list.append(val) - - def is_ancestor_match( - term_id: str, - ontologies: List[str], - ancestors: List[str], - ancestor_inclusive: bool, - ) -> bool: - allowed_ancestors = dict(zip(ontologies, ancestors)) - return validate_curie_ancestors_vectorized(term_id, allowed_ancestors, inclusive=ancestor_inclusive) - - return is_ancestor_match, ( - rule_def["column"], - ontology_keys, - ancestor_list, - inclusive, - ) - def _validate_list(self, list_name: str, current_list: List[str], element_type: str): """ Validates the elements of a list based on the type definition. Adds errors to self.errors if any @@ -944,6 +1100,7 @@ def _validate_obsm(self): issue_list = self.errors regex_pattern = r"^[a-zA-Z][a-zA-Z0-9_.-]*$" + key_is_spatial = key.lower() == "spatial" unknown_key = False # an unknown key does not match 'spatial' or 'X_{suffix}' if key.startswith("X_"): @@ -954,7 +1111,7 @@ def _validate_obsm(self): self.errors.append( f"Suffix for embedding key in 'adata.obsm' {key} does not match the regex pattern {regex_pattern}." ) - elif key.lower() != "spatial": + elif not key_is_spatial: if not re.match(regex_pattern, key): self.errors.append( f"Embedding key in 'adata.obsm' {key} does not match the regex pattern {regex_pattern}." @@ -1002,7 +1159,11 @@ def _validate_obsm(self): # Check for inf/NaN values only if the dtype is numeric if np.isinf(value).any(): issue_list.append(f"adata.obsm['{key}'] contains positive infinity or negative infinity values.") - if np.all(np.isnan(value)): + + # spatial embeddings can't have any NaN; other embeddings can't be all NaNs + if key_is_spatial and np.any(np.isnan(value)): + issue_list.append("adata.obs['spatial] contains at least one NaN value.") + elif np.all(np.isnan(value)): issue_list.append(f"adata.obsm['{key}'] contains all NaN values.") if self._is_supported_spatial_assay() is False and obsm_with_x_prefix == 0: @@ -1107,7 +1268,7 @@ def _has_valid_raw(self, force: bool = False) -> bool: if is_visium_and_is_single_true and x.shape[0] != self.visium_and_is_single_true_matrix_size: self._raw_layer_exists = False self.errors.append( - f"When {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}, the raw matrix must be the " + f"When {self._visium_error_suffix}, the raw matrix must be the " f"unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " f"{self.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is " f"{x.shape[0]}." @@ -1461,10 +1622,7 @@ def _validate_spatial_assay_ontology_term_id(self): # Validate assay ontology term ids are identical. term_count = obs["assay_ontology_term_id"].nunique() if term_count > 1: - self.errors.append( - "When obs['assay_ontology_term_id'] is either 'EFO:0010961' (Visium Spatial Gene Expression) or " - "'EFO:0030062' (Slide-seqV2), all observations must contain the same value." - ) + self.errors.append(f"When {ERROR_SUFFIX_SPATIAL}" ", all observations must contain the same value.") def _validate_spatial_cell_type_ontology_term_id(self): """ @@ -1472,18 +1630,27 @@ def _validate_spatial_cell_type_ontology_term_id(self): :rtype none """ - # Exit if: - # - not Visium and is_single is True as no further checks are necessary - # - in_tissue is not specified as checks are dependent on this value - if not self._is_visium_and_is_single_true() or "in_tissue" not in self.adata.obs: + self._is_visium_including_descendants() + self._is_single() + self._is_visium_and_is_single_true() + + # skip checks if not a valid spatial assay with a corresponding "in_tissue" column + if not self.is_visium_and_is_single_true: + # not a valid spatial assay + return + elif self.is_visium_and_is_single_true and "in_tissue" not in self.adata.obs.columns: + # valid spatial assay, but missing "in_tissue" column return - # Validate cell type: must be "unknown" if Visium and is_single is True and in_tissue is 0. - if ( - (self.adata.obs["assay_ontology_term_id"] == ASSAY_VISIUM) - & (self.adata.obs["in_tissue"] == 0) - & (self.adata.obs["cell_type_ontology_term_id"] != "unknown") - ).any(): + # Validate all out of tissue (in_tissue==0) spatial spots have unknown cell ontology term + is_spatial = ( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda assay: is_ontological_descendant_of(ONTOLOGY_PARSER, assay, ASSAY_VISIUM, True)) + .astype(bool) + ) + is_not_tissue = self.adata.obs["in_tissue"] == 0 + is_not_unknown = self.adata.obs["cell_type_ontology_term_id"] != "unknown" + if (is_spatial & is_not_tissue & is_not_unknown).any(): self.errors.append( f"obs['cell_type_ontology_term_id'] must be 'unknown' when {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_IN_TISSUE_0}." ) @@ -1495,11 +1662,21 @@ def _validate_spatial_tissue_position(self, tissue_position_name: str, min: int, :rtype none """ + # check for visium status and then is visium and single + # techdebt: the following lines are order dependent. Violates idempotence. + self._is_visium_including_descendants() + self._is_single() + self._is_visium_and_is_single_true() + # Tissue position is foribidden if assay is not Visium and is_single is True. if tissue_position_name in self.adata.obs and ( - not self._is_visium_and_is_single_true() + not (self.is_visium_and_is_single_true) or ( - ~(self.adata.obs["assay_ontology_term_id"] == ASSAY_VISIUM) + ~( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda t: is_ontological_descendant_of(ONTOLOGY_PARSER, t, ASSAY_VISIUM, True)) + .astype(bool) + ) & (self.adata.obs[tissue_position_name].notnull()) ).any() ): @@ -1516,7 +1693,11 @@ def _validate_spatial_tissue_position(self, tissue_position_name: str, min: int, if ( tissue_position_name not in self.adata.obs or ( - (self.adata.obs["assay_ontology_term_id"] == ASSAY_VISIUM) + ( + self.adata.obs["assay_ontology_term_id"] + .apply(lambda t: is_ontological_descendant_of(ONTOLOGY_PARSER, t, ASSAY_VISIUM, True)) + .astype(bool) + ) & (self.adata.obs[tissue_position_name].isnull()) ).any() ): @@ -1546,8 +1727,8 @@ def _validate_spatial_tissue_positions(self): :rtype none """ - self._validate_spatial_tissue_position("array_col", 0, 127) - self._validate_spatial_tissue_position("array_row", 0, 77) + self._validate_spatial_tissue_position("array_col", 0, self.tissue_position_maxes[1]) + self._validate_spatial_tissue_position("array_row", 0, self.tissue_position_maxes[0]) self._validate_spatial_tissue_position("in_tissue", 0, 1) def _check_spatial_uns(self): @@ -1573,10 +1754,7 @@ def _check_spatial_uns(self): uns_spatial = self.adata.uns.get("spatial") is_supported_spatial_assay = self._is_supported_spatial_assay() if uns_spatial is not None and not is_supported_spatial_assay: - self.errors.append( - "uns['spatial'] is only allowed for obs['assay_ontology_term_id'] values " - "'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." - ) + self.errors.append(f"uns['spatial'] is only allowed when {ERROR_SUFFIX_SPATIAL}") return # Exit if we aren't dealing with a supported spatial assay as no further checks are necessary. @@ -1585,10 +1763,7 @@ def _check_spatial_uns(self): # spatial is required for supported spatial assays. if not isinstance(uns_spatial, dict): - self.errors.append( - "A dict in uns['spatial'] is required for obs['assay_ontology_term_id'] values " - "'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." - ) + self.errors.append("A dict in uns['spatial'] is required when " f"{ERROR_SUFFIX_SPATIAL}.") return # is_single is required. @@ -1667,7 +1842,8 @@ def _check_spatial_uns(self): self.errors.append("uns['spatial'][library_id]['images'] must contain the key 'hires'.") # hires is specified: proceed with validation of hires. else: - self._validate_spatial_image_shape("hires", uns_images["hires"], SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE) + _max_size = self.hires_max_dimension_size + self._validate_spatial_image_shape("hires", uns_images["hires"], _max_size) # fullres is optional. uns_fullres = uns_images.get("fullres") @@ -1760,6 +1936,29 @@ def _is_visium(self) -> bool: self.is_visium = assay_ontology_term_id is not None and (assay_ontology_term_id == ASSAY_VISIUM).any() return self.is_visium + def _is_visium_including_descendants(self) -> bool: + """ + Determine if the assay_ontology_term_id is Visium (inclusive descendant of EFO:0010961). + Returns True if ANY assay_ontology_term_id is a Visium descendant + + :return True if assay_ontology_term_id is Visium, False otherwise. + :rtype bool + """ + _assay_key = "assay_ontology_term_id" + + # only compute if not already stored + if self.is_visium is None and _assay_key in self.adata.obs.columns: + # check if any assay_ontology_term_ids are descendants of VISIUM + self.is_visium = bool( + self.adata.obs[_assay_key] + .astype("string") + .apply(lambda assay: is_ontological_descendant_of(ONTOLOGY_PARSER, assay, ASSAY_VISIUM, True)) + .astype(bool) + .any() + ) + + return self.is_visium + def _validate_spatial_image_shape(self, image_name: str, image: np.ndarray, max_dimension: int = None): """ Validate the spatial image is of shape (,,3 or 4) and has a max dimension, if specified. A spatial image @@ -1827,6 +2026,9 @@ def _deep_check(self): # Checks spatial self._check_spatial() + # Validate genetic ancestry + self._validate_genetic_ancestry() + # Checks each component for component_name, component_def in self.schema_def["components"].items(): logger.debug(f"Validating component: {component_name}") @@ -1873,8 +2075,6 @@ def validate_adata(self, h5ad_path: Union[str, bytes, os.PathLike] = None) -> bo :rtype bool """ logger.info("Starting validation...") - # Re-start errors in case a new h5ad is being validated - self.reset() if h5ad_path: logger.debug("Reading the h5ad file...") @@ -1882,6 +2082,8 @@ def validate_adata(self, h5ad_path: Union[str, bytes, os.PathLike] = None) -> bo self.h5ad_path = h5ad_path self._validate_encoding_version() logger.debug("Successfully read the h5ad file") + # Re-start errors in case a new h5ad is being validated + self.reset() # Fetches schema def for latest major schema version self._set_schema_def() @@ -1912,7 +2114,7 @@ def validate( add_labels_file: str = None, ignore_labels: bool = False, verbose: bool = False, -) -> (bool, list): +) -> (bool, list, bool): from .write_labels import AnnDataLabelAppender """ @@ -1921,16 +2123,13 @@ def validate( :param Union[str, bytes, os.PathLike] h5ad_path: Path to h5ad file to validate :param str add_labels_file: Path to new h5ad file with ontology/gene labels added - :return (True, []) if successful validation, (False, [list_of_errors]) otherwise + :return (True, [], False) if successful validation, (False, [list_of_errors], False) otherwise; + last bool is for seurat convertability which is deprecated / unused :rtype tuple """ # Perform validation start = datetime.now() - if verbose: - logging.basicConfig(level=logging.DEBUG) - else: - logging.basicConfig(level=logging.INFO, format="%(message)s") validator = Validator( ignore_labels=ignore_labels, ) @@ -1939,7 +2138,7 @@ def validate( # Stop if validation was unsuccessful if not validator.is_valid: - return False, validator.errors + return False, validator.errors, False if add_labels_file: label_start = datetime.now() @@ -1950,6 +2149,6 @@ def validate( f"{writer.was_writing_successful}" ) - return (validator.is_valid and writer.was_writing_successful, validator.errors + writer.errors) + return (validator.is_valid and writer.was_writing_successful, validator.errors + writer.errors, False) - return True, validator.errors + return True, validator.errors, False diff --git a/cellxgene_schema_cli/requirements.txt b/cellxgene_schema_cli/requirements.txt index e7328b7af..3128fb834 100644 --- a/cellxgene_schema_cli/requirements.txt +++ b/cellxgene_schema_cli/requirements.txt @@ -1,7 +1,8 @@ anndata>=0.8.0,<0.11 cellxgene-ontology-guide==1.3.0 # update before a schema migration click<9 -numpy<2 +Cython<4 +numpy<3 pandas>2,<3 scipy<1.15 # broken in 1.15.0 see https://github.com/chanzuckerberg/single-cell-curation/issues/1165 semver<4 diff --git a/cellxgene_schema_cli/tests/fixtures/examples_validate.py b/cellxgene_schema_cli/tests/fixtures/examples_validate.py index 470c165ce..accbecfcd 100644 --- a/cellxgene_schema_cli/tests/fixtures/examples_validate.py +++ b/cellxgene_schema_cli/tests/fixtures/examples_validate.py @@ -48,6 +48,12 @@ "HsapDv:0000003", "donor_1", "nucleus", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], [ "CL:0000192", @@ -62,6 +68,12 @@ "MmusDv:0000003", "donor_2", "na", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], ], index=["X", "Y"], @@ -78,6 +90,12 @@ "development_stage_ontology_term_id", "donor_id", "suspension_type", + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", ], ) @@ -144,6 +162,12 @@ "donor_1", "na", 0, + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], [ 2, @@ -161,6 +185,12 @@ "donor_2", "na", 1, + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], ], index=["X", "Y"], @@ -180,6 +210,12 @@ "donor_id", "suspension_type", "in_tissue", + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", ], ) @@ -203,6 +239,12 @@ "HsapDv:0000003", "donor_1", "na", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], [ "CL:0000192", @@ -217,6 +259,12 @@ "MmusDv:0000003", "donor_2", "na", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], ], index=["X", "Y"], @@ -233,6 +281,12 @@ "development_stage_ontology_term_id", "donor_id", "suspension_type", + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", ], ) @@ -255,6 +309,12 @@ "HsapDv:0000003", "donor_1", "na", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], [ "CL:0000192", @@ -269,6 +329,12 @@ "MmusDv:0000003", "donor_2", "na", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], ], index=["X", "Y"], @@ -285,6 +351,12 @@ "development_stage_ontology_term_id", "donor_id", "suspension_type", + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", ], ) @@ -493,6 +565,12 @@ "tissue:1", "sre:1", "development_stage:1", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], [ "cell_type:1", @@ -503,6 +581,12 @@ "tissue:1", "sre:1", "development_stage:1", + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), + float("nan"), ], ], index=["X", "Y"], @@ -515,6 +599,12 @@ "tissue_ontology_term_id", "self_reported_ethnicity_ontology_term_id", "development_stage_ontology_term_id", + "genetic_ancestry_African", + "genetic_ancestry_East_Asian", + "genetic_ancestry_European", + "genetic_ancestry_Indigenous_American", + "genetic_ancestry_Oceanian", + "genetic_ancestry_South_Asian", ], ) diff --git a/cellxgene_schema_cli/tests/fixtures/h5ads/example_valid.h5ad b/cellxgene_schema_cli/tests/fixtures/h5ads/example_valid.h5ad index ec5f0aee2..a1b121bdf 100644 Binary files a/cellxgene_schema_cli/tests/fixtures/h5ads/example_valid.h5ad and b/cellxgene_schema_cli/tests/fixtures/h5ads/example_valid.h5ad differ diff --git a/cellxgene_schema_cli/tests/test_schema_compliance.py b/cellxgene_schema_cli/tests/test_schema_compliance.py index 0d3d5d4bc..f78ad6dac 100644 --- a/cellxgene_schema_cli/tests/test_schema_compliance.py +++ b/cellxgene_schema_cli/tests/test_schema_compliance.py @@ -4,6 +4,7 @@ import tempfile import unittest +from copy import deepcopy import anndata import fixtures.examples_validate as examples @@ -14,11 +15,20 @@ from cellxgene_schema.schema import get_schema_definition from cellxgene_schema.utils import getattr_anndata from cellxgene_schema.validate import ( + ASSAY_VISIUM_11M, + ERROR_SUFFIX_IS_SINGLE, + ERROR_SUFFIX_SPATIAL, + ERROR_SUFFIX_VISIUM, + ERROR_SUFFIX_VISIUM_11M, ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, + VISIUM_11MM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, Validator, ) from cellxgene_schema.write_labels import AnnDataLabelAppender +from fixtures.examples_validate import visium_library_id schema_def = get_schema_definition() @@ -76,7 +86,8 @@ def validator_with_spatial_and_is_single_false(validator) -> Validator: @pytest.fixture def validator_with_visium_assay(validator) -> Validator: validator.adata = examples.adata_visium.copy() - validator.visium_and_is_single_true_matrix_size = 2 + validator.reset(None, None) + return validator @@ -197,6 +208,7 @@ def test_raw_values__invalid_spatial(self, validator_with_visium_assay, invalid_ validator = validator_with_visium_assay validator.adata.raw.X[0, 1] = invalid_value + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: All non-zero values in raw matrix must be positive integers of type numpy.float32.", @@ -237,7 +249,8 @@ def test_raw_values__contains_zero_row_in_tissue_1(self, validator_with_visium_a Raw Matrix contains a row with all zeros and in_tissue is 1, but no values are in_tissue 0. """ - validator = validator_with_visium_assay + validator: Validator = validator_with_visium_assay + validator.reset(None, 2) validator.adata.obs["in_tissue"] = 1 validator.adata.X[0] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) validator.adata.raw.X[0] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) @@ -252,9 +265,10 @@ def test_raw_values__contains_zero_row_in_tissue_1_mixed_in_tissue_values(self, Raw Matrix contains a row with all zeros and in_tissue is 1, and there are also values with in_tissue 0. """ - validator = validator_with_visium_assay + validator: Validator = validator_with_visium_assay validator.adata.X[1] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) validator.adata.raw.X[1] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: Each observation with obs['in_tissue'] == 1 must have at least one " @@ -276,6 +290,7 @@ def test_raw_values__contains_all_zero_rows_in_tissue_0(self, validator_with_vis ) validator.adata.raw = validator.adata.copy() validator.adata.raw.var.drop("feature_is_filtered", axis=1, inplace=True) + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: If obs['in_tissue'] contains at least one value 0, then there must be at least " @@ -294,42 +309,103 @@ def test_raw_values__contains_some_zero_rows_in_tissue_0(self, validator_with_vi validator.adata.obs["cell_type_ontology_term_id"] = "unknown" validator.adata.X[0] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) validator.adata.raw.X[0] = numpy.zeros(validator.adata.var.shape[0], dtype=numpy.float32) + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [] - def test_raw_values__invalid_visium_and_is_single_true_row_length(self, validator_with_visium_assay): + @pytest.mark.parametrize( + "assay_ontology_term_id, req_matrix_size, image_size", + [ + ("EFO:0022858", VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ( + "EFO:0022860", + VISIUM_11MM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, + ), + ], + ) + def test_raw_values__invalid_visium_and_is_single_true_row_length( + self, validator_with_visium_assay, assay_ontology_term_id, req_matrix_size, image_size + ): """ Dataset is visium and uns['is_single'] is True, but raw.X is the wrong length. """ - validator = validator_with_visium_assay - validator.visium_and_is_single_true_matrix_size = VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE + validator: Validator = validator_with_visium_assay + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + + # hires image size must be present in order to validate the raw. + validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = numpy.zeros( + (1, image_size, 3), dtype=numpy.uint8 + ) validator.validate_adata() - assert validator.errors == [ - f"ERROR: When {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}, the raw matrix must be the " - "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " - f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2.", - "ERROR: Raw data may be missing: data in 'raw.X' does not meet schema requirements.", - ] + if assay_ontology_term_id == ASSAY_VISIUM_11M: + _errors = [ + f"ERROR: When {ERROR_SUFFIX_VISIUM_11M} and {ERROR_SUFFIX_IS_SINGLE}, the raw matrix must be the " + "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " + f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2.", + "ERROR: Raw data may be missing: data in 'raw.X' does not meet schema requirements.", + ] + else: + _errors = [ + f"ERROR: When {ERROR_SUFFIX_VISIUM} and {ERROR_SUFFIX_IS_SINGLE}, the raw matrix must be the " + "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " + f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2.", + "ERROR: Raw data may be missing: data in 'raw.X' does not meet schema requirements.", + ] - def test_raw_values__multiple_invalid_in_tissue_errors(self, validator_with_visium_assay): + assert validator.errors == _errors + + @pytest.mark.parametrize( + "assay_ontology_term_id, req_matrix_size, image_size", + [ + ("EFO:0022858", VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ( + "EFO:0022860", + VISIUM_11MM_AND_IS_SINGLE_TRUE_MATRIX_SIZE, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, + ), + ], + ) + def test_raw_values__multiple_invalid_in_tissue_errors( + self, validator_with_visium_assay, assay_ontology_term_id, req_matrix_size, image_size + ): """ Dataset is visium and uns['is_single'] is True, in_tissue has both 0 and 1 values and there are issues validating rows of both in the matrix. """ validator = validator_with_visium_assay - validator.visium_and_is_single_true_matrix_size = VISIUM_AND_IS_SINGLE_TRUE_MATRIX_SIZE + + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + # hires image size must be present in order to validate the raw. + validator._visium_and_is_single_true_matrix_size = None + validator._hires_max_dimension_size = image_size + validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = numpy.zeros( + (1, image_size, 3), dtype=numpy.uint8 + ) validator.adata.X = numpy.zeros( [validator.adata.obs.shape[0], validator.adata.var.shape[0]], dtype=numpy.float32 ) validator.adata.raw = validator.adata.copy() validator.adata.raw.var.drop("feature_is_filtered", axis=1, inplace=True) validator.validate_adata() - assert validator.errors == [ - f"ERROR: When {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}, the raw matrix must be the " - "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " - f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2.", + if assay_ontology_term_id == ASSAY_VISIUM_11M: + assert ( + validator.errors[0] + == f"ERROR: When {ERROR_SUFFIX_VISIUM_11M} and {ERROR_SUFFIX_IS_SINGLE}, the raw matrix must be the " + "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " + f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2." + ) + else: + assert ( + validator.errors[0] + == f"ERROR: When {ERROR_SUFFIX_VISIUM} and {ERROR_SUFFIX_IS_SINGLE}, the raw matrix must be the " + "unfiltered feature-barcode matrix 'raw_feature_bc_matrix'. It must have exactly " + f"{validator.visium_and_is_single_true_matrix_size} rows. Raw matrix row count is 2." + ) + + assert validator.errors[1:] == [ "ERROR: If obs['in_tissue'] contains at least one value 0, then there must be at least " "one row with obs['in_tissue'] == 0 that has a non-zero value in the raw matrix.", "ERROR: Each observation with obs['in_tissue'] == 1 must have at least one " @@ -477,6 +553,28 @@ def test_column_presence_assay(self, validator_with_adata): "to missing dependent column in adata.obs.", ] + @pytest.mark.parametrize( + "assay_ontology_term_id, is_descendant", + [("EFO:0010961", True), ("EFO:0022858", True), ("EFO:0030029", False), ("EFO:0002697", False)], + ) + def test_column_presence_in_tissue(self, validator_with_visium_assay, assay_ontology_term_id, is_descendant): + """ + Spatial assays that are descendants of visium must have a valid "in_tissue" column. + """ + validator: Validator = validator_with_visium_assay + + # reset and test + validator.reset() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.obs["assay_ontology_term_id"] = validator.adata.obs["assay_ontology_term_id"].astype("category") + validator._validate_spatial_tissue_position("in_tissue", 0, 1) + if is_descendant: + assert validator.errors == [] + else: + assert validator.errors == [ + f"obs['in_tissue'] is only allowed for {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}." + ] + @pytest.mark.parametrize("reserved_column", schema_def["components"]["obs"]["reserved_columns"]) def test_obs_reserved_columns_presence(self, validator_with_adata, reserved_column): """ @@ -539,6 +637,47 @@ def test_assay_ontology_term_id(self, validator_with_adata, assay_ontology_term_ ] assert validator.errors == [self.get_format_error_message(error_message_suffix, error)] + def test_assay_ontology_term_id__as_categorical(self, validator_with_visium_assay): + """ + Formally, assay_ontology_term_id is expected to be a categorical variable of type string. However, it should work for categorical dtypes as well. + """ + validator: Validator = validator_with_visium_assay + + # check encoding as string + validator.reset(None, 2) + validator._check_spatial() + validator._validate_raw() + assert validator.errors == [] + + # force encoding as 'categorical' + validator.reset(None, 2) + validator.adata.obs["assay_ontology_term_id"] = validator.adata.obs["assay_ontology_term_id"].astype("category") + validator._check_spatial() + validator._validate_raw() + assert validator.errors == [] + + @pytest.mark.parametrize( + "assay_ontology_term_id, all_same", + [("EFO:0010961", True), ("EFO:0030062", True), ("EFO:0022860", True), ("EFO:0008995", False)], + ) + def test_assay_ontology_term_id__all_same(self, validator_with_visium_assay, assay_ontology_term_id, all_same): + """ + Spatial assays (descendants of Visium Spatia Gene Expression, or Slide-SeqV2) require all values in the column to be identical. + """ + validator: Validator = validator_with_visium_assay + + # mix values (with otherwise allowed values) + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.obs["assay_ontology_term_id"].iloc[0] = "EFO:0010183" + + # check that unique values are allowed + validator._check_spatial_obs() + EXPECTED_ERROR = f"When {ERROR_SUFFIX_SPATIAL}, all observations must contain the same value." + if all_same: + assert EXPECTED_ERROR in validator.errors + else: + assert validator.errors not in validator.errors + def test_cell_type_ontology_term_id_invalid_term(self, validator_with_adata): validator = validator_with_adata validator.adata.obs.loc[validator.adata.obs.index[0], "cell_type_ontology_term_id"] = "EFO:0000001" @@ -1345,13 +1484,15 @@ def test_suspension_type(self, validator, assay, suspension_types): if "na" in suspension_types: invalid_suspension_type = "nucleus" if "nucleus" not in suspension_types else "cell" obs = validator.adata.obs - obs.loc[obs.index[1], "suspension_type"] = invalid_suspension_type - obs.loc[obs.index[1], "assay_ontology_term_id"] = assay + obs["suspension_type"] = invalid_suspension_type + obs["assay_ontology_term_id"] = assay + obs["suspension_type"] = obs["suspension_type"].astype("category") + obs["assay_ontology_term_id"] = obs["assay_ontology_term_id"].astype("category") validator.validate_adata() assert validator.errors == [ f"ERROR: Column 'suspension_type' in dataframe 'obs' contains invalid values " f"'['{invalid_suspension_type}']'. Values must be one of {suspension_types} when " - f"'assay_ontology_term_id' is {assay}" + f"'assay_ontology_term_id' is in ['{assay}']" ] @pytest.mark.parametrize( @@ -1378,13 +1519,15 @@ def test_suspension_type_ancestors_inclusive(self, validator_with_adata, assay, if "na" in suspension_types: invalid_suspension_type = "nucleus" if "nucleus" not in suspension_types else "cell" obs["suspension_type"] = obs["suspension_type"].cat.remove_unused_categories() - obs.loc[obs.index[1], "assay_ontology_term_id"] = assay - obs.loc[obs.index[1], "suspension_type"] = invalid_suspension_type + obs["suspension_type"] = invalid_suspension_type + obs["assay_ontology_term_id"] = assay + obs["suspension_type"] = obs["suspension_type"].astype("category") + obs["assay_ontology_term_id"] = obs["assay_ontology_term_id"].astype("category") validator.validate_adata() assert validator.errors == [ f"ERROR: Column 'suspension_type' in dataframe 'obs' contains invalid values " f"'['{invalid_suspension_type}']'. Values must be one of {suspension_types} when " - f"'assay_ontology_term_id' is {assay} or its descendants" + f"'assay_ontology_term_id' is in ['{assay}']" ] def test_suspension_type_with_descendant_term_id_failure(self, validator_with_adata): @@ -1394,14 +1537,15 @@ def test_suspension_type_with_descendant_term_id_failure(self, validator_with_ad """ validator = validator_with_adata obs = validator.adata.obs - obs.loc[obs.index[0], "assay_ontology_term_id"] = "EFO:0022615" # descendant of EFO:0008994 - obs.loc[obs.index[0], "suspension_type"] = "nucleus" - + obs["suspension_type"] = "nucleus" + obs["assay_ontology_term_id"] = "EFO:0022615" # descendant of EFO:0008994 + obs["suspension_type"] = obs["suspension_type"].astype("category") + obs["assay_ontology_term_id"] = obs["assay_ontology_term_id"].astype("category") validator.validate_adata() assert validator.errors == [ "ERROR: Column 'suspension_type' in dataframe 'obs' contains invalid values " "'['nucleus']'. Values must be one of ['na'] when " - "'assay_ontology_term_id' is EFO:0008994 or its descendants" + "'assay_ontology_term_id' is in ['EFO:0022615']" ] def test_suspension_type_with_descendant_term_id_success(self, validator_with_adata): @@ -1488,6 +1632,127 @@ def test_nan_values_must_be_rejected(self, validator_with_adata): in validator.errors ) + @pytest.mark.parametrize( + "genetic_ancestry_African, genetic_ancestry_East_Asian, genetic_ancestry_European, " + "genetic_ancestry_Indigenous_American, genetic_ancestry_Oceanian, genetic_ancestry_South_Asian", + [ + (0.0, 0.0, 0.0, 0.0, 0.0, 1.0), + (0.5, 0.5, 0.0, 0.0, 0.0, 0.0), + (0.0, 0.25, 0.25, 0.25, 0.25, 0.0), + (float("nan"), float("nan"), float("nan"), float("nan"), float("nan"), float("nan")), + (numpy.nan, numpy.nan, numpy.nan, numpy.nan, numpy.nan, numpy.nan), + ], + ) + def test_genetic_ancestry__OK( + self, + validator_with_adata, + genetic_ancestry_African, + genetic_ancestry_East_Asian, + genetic_ancestry_European, + genetic_ancestry_Indigenous_American, + genetic_ancestry_Oceanian, + genetic_ancestry_South_Asian, + ): + """ + genetic_ancestry_X fields must all be floats between 0 and 1 and sum to 1 + OR they can all be NaN + """ + validator = validator_with_adata + # Second organism in adata is not homo sapiens + validator.adata.obs["genetic_ancestry_African"] = [genetic_ancestry_African, float("nan")] + validator.adata.obs["genetic_ancestry_East_Asian"] = [genetic_ancestry_East_Asian, float("nan")] + validator.adata.obs["genetic_ancestry_European"] = [genetic_ancestry_European, float("nan")] + validator.adata.obs["genetic_ancestry_Indigenous_American"] = [ + genetic_ancestry_Indigenous_American, + float("nan"), + ] + validator.adata.obs["genetic_ancestry_Oceanian"] = [genetic_ancestry_Oceanian, float("nan")] + validator.adata.obs["genetic_ancestry_South_Asian"] = [genetic_ancestry_South_Asian, float("nan")] + validator.validate_adata() + assert validator.errors == [] + + @pytest.mark.parametrize( + "genetic_ancestry_African, genetic_ancestry_East_Asian, genetic_ancestry_European, " + "genetic_ancestry_Indigenous_American, genetic_ancestry_Oceanian, genetic_ancestry_South_Asian", + [ + # Non-float value of "random string" + (0.0, 0.0, 0.0, 1.0, 0.0, "random string"), + # Non-float value of True + (0.0, 0.0, 0.0, 1.0, 0.0, True), + # Non-float value of None + (0.0, 0.0, 0.0, 1.0, 0.0, None), + # Non-float value of numpy True + (0.0, 0.0, 0.0, 1.0, 0.0, numpy.True_), + # Non-float value of numpy NaN + (0.0, numpy.nan, 0.0, 1.0, 0.0, 0.0), + # One value is > 1 + (0.0, 0.0, 1.1, 0.0, 0.0, 0.0), + # One value is < 0.0 + (0.0, 0.0, -0.25, 1.0, 0.25, 0.0), + # Sum is > 1.0 + (0.0, 0.1, 1.0, 0.0, 0.0, 0.0), + # Sum is < 1.0 + (0.0, 0.25, 0.25, 0.25, 0.0, 0.0), + # Only all NaN is valid + (float("nan"), 0.0, 0.0, 0.0, 0.0, 0.0), + # Only all NaN is valid + (numpy.nan, 0.0, 0.0, 0.0, 0.0, 0.0), + ], + ) + def test_genetic_ancestry__invalid( + self, + validator_with_adata, + genetic_ancestry_African, + genetic_ancestry_East_Asian, + genetic_ancestry_European, + genetic_ancestry_Indigenous_American, + genetic_ancestry_Oceanian, + genetic_ancestry_South_Asian, + ): + validator = validator_with_adata + # Second organism in adata is not homo sapiens + validator.adata.obs["genetic_ancestry_African"] = [genetic_ancestry_African, float("nan")] + validator.adata.obs["genetic_ancestry_East_Asian"] = [genetic_ancestry_East_Asian, float("nan")] + validator.adata.obs["genetic_ancestry_European"] = [genetic_ancestry_European, float("nan")] + validator.adata.obs["genetic_ancestry_Indigenous_American"] = [ + genetic_ancestry_Indigenous_American, + float("nan"), + ] + validator.adata.obs["genetic_ancestry_Oceanian"] = [genetic_ancestry_Oceanian, float("nan")] + validator.adata.obs["genetic_ancestry_South_Asian"] = [genetic_ancestry_South_Asian, float("nan")] + validator.validate_adata() + assert len(validator.errors) > 0 + + def test_genetic_ancestry_same_donor_id(self, validator_with_adata): + """ + genetic_ancestry_X fields must be the same when the donor id is the same + """ + validator = validator_with_adata + original_donor_id_column = validator.adata.obs["donor_id"].copy() + + # Second row should have identical donor id + genetic ancestry values, so this should pass validation + validator.adata.obs.iloc[1] = validator.adata.obs.iloc[0].values + + validator.validate_adata() + assert validator.errors == [] + + # Update the genetic ancestry values to be different. This should now fail validation + validator.adata.obs["genetic_ancestry_African"] = [1.0, 0.0] + validator.adata.obs["genetic_ancestry_East_Asian"] = [0.0, 1.0] + validator.adata.obs["genetic_ancestry_European"] = [0.0, 0.0] + validator.adata.obs["genetic_ancestry_Indigenous_American"] = [0.0, 0.0] + validator.adata.obs["genetic_ancestry_Oceanian"] = [0.0, 0.0] + validator.adata.obs["genetic_ancestry_South_Asian"] = [0.0, 0.0] + validator.reset(None, 2) + validator.validate_adata() + assert len(validator.errors) == 1 + + # Change the donor id back to two different donor id's. Now, this should pass validation + validator.adata.obs["donor_id"] = original_donor_id_column + validator.reset(None, 2) + validator.validate_adata() + assert validator.errors == [] + class TestVar: """ @@ -1567,6 +1832,7 @@ def test_feature_is_filtered(self, validator_with_adata): X[i, 0] = 0 X[0, 0] = 1 + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: Some features are 'True' in 'feature_is_filtered' of dataframe 'var', " @@ -1576,6 +1842,7 @@ def test_feature_is_filtered(self, validator_with_adata): # Test that feature_is_filtered is a bool and not a string var["feature_is_filtered"] = "string" + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: Column 'feature_is_filtered' in dataframe 'var' must be boolean, not 'object'." @@ -1652,11 +1919,16 @@ def test_should_warn_for_low_gene_count(self, validator_with_adata): Raise a warning if there are too few genes """ validator = validator_with_adata + # NOTE:[EM] changing the schema def here is stateful and results in unpredictable test results. + # Reset after mutating. + _old_schema = deepcopy(validator.schema_def.copy()) + validator.schema_def["components"]["var"]["warn_if_less_than_rows"] = 100 validator.validate_adata() assert validator.warnings == [ "WARNING: Dataframe 'var' only has 4 rows. Features SHOULD NOT be filtered from expression matrix." ] + validator.schema_def = _old_schema @pytest.mark.parametrize( "df,column", @@ -2141,23 +2413,38 @@ def test_obsm_values_str(self, validator_with_visium_assay, key): @pytest.mark.parametrize("key", ["X_umap", "spatial"]) def test_obsm_values_nan(self, validator_with_visium_assay, key): """ - values in obsm cannot all be NaN + test obsm NaN restrictions for different embedding types. + feature embeddings: X_* cannot be all NaN + spatial emeddings: 'spatial' cannot have any NaNs """ validator = validator_with_visium_assay obsm = validator.adata.obsm - # It's okay if only one value is NaN + + # Check embedding has any NaN obsm[key][0:100, 1] = numpy.nan + validator.reset(None, 2) validator.validate_adata() - assert validator.errors == [] - # It's not okay if all values are NaN + if key != "spatial": + assert validator.errors == [] + else: + assert validator.errors == ["ERROR: adata.obs['spatial] contains at least one NaN value."] + + # Check embedding has all NaNs all_nan = numpy.full(obsm[key].shape, numpy.nan) obsm[key] = all_nan + validator.reset(None, 2) validator.validate_adata() - assert validator.errors == [f"ERROR: adata.obsm['{key}'] contains all NaN values."] + if key != "spatial": + assert validator.errors == [f"ERROR: adata.obsm['{key}'] contains all NaN values."] + else: + assert validator.errors == ["ERROR: adata.obs['spatial] contains at least one NaN value."] def test_obsm_values_no_X_embedding__non_spatial_dataset(self, validator_with_adata): - validator = validator_with_adata + """ + X_{suffix} embeddings MUST exist for non-spatial datasets + """ + validator: Validator = validator_with_adata validator.adata.obsm["harmony"] = validator.adata.obsm["X_umap"] validator.adata.uns["default_embedding"] = "harmony" del validator.adata.obsm["X_umap"] @@ -2167,19 +2454,32 @@ def test_obsm_values_no_X_embedding__non_spatial_dataset(self, validator_with_ad ] assert validator.is_spatial is False assert validator.warnings == [ - "WARNING: Dataframe 'var' only has 4 rows. Features SHOULD NOT be filtered from expression matrix.", "WARNING: Embedding key in 'adata.obsm' harmony is not 'spatial' nor does it start with 'X_'. " "Thus, it will not be available in Explorer", "WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.", ] - def test_obsm_values_no_X_embedding__visium_dataset(self, validator_with_visium_assay): - validator = validator_with_visium_assay + @pytest.mark.parametrize("assay_ontology_term_id", ["EFO:0010961", "EFO:0030062", "EFO:0022860"]) + def test_obsm_values_no_X_embedding__visium_dataset(self, validator_with_visium_assay, assay_ontology_term_id): + """ + X_{suffix} embeddings MAY exist for spatial datasets + """ + validator: Validator = validator_with_visium_assay validator.adata.uns["default_embedding"] = "spatial" - del validator.adata.obsm["X_umap"] - validator.validate_adata() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + + # may have X_{suffix} embedding + validator._validate_obsm() + assert validator.is_spatial is True assert validator.errors == [] + validator.reset() + + # may also have no X_{suffix} embedding + del validator.adata.obsm["X_umap"] + validator._validate_obsm() assert validator.is_spatial is True + assert validator.errors == [] + validator.reset() def test_obsm_values_no_X_embedding__slide_seq_v2_dataset(self, validator_with_slide_seq_v2_assay): validator = validator_with_slide_seq_v2_assay @@ -2217,7 +2517,6 @@ def test_obsm_values_warn_start_with_X(self, validator_with_adata): validator.adata.obsm["harmony"] = pd.DataFrame(validator.adata.obsm["X_umap"], index=validator.adata.obs_names) validator.validate_adata() assert validator.warnings == [ - "WARNING: Dataframe 'var' only has 4 rows. Features SHOULD NOT be filtered from expression matrix.", "WARNING: Embedding key in 'adata.obsm' harmony is not 'spatial' nor does it start with 'X_'. " "Thus, it will not be available in Explorer", "WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.", @@ -2251,7 +2550,6 @@ def test_obsm_values_key_start_with_number(self, validator_with_adata): "'pandas.core.frame.DataFrame'>').", ] assert validator.warnings == [ - "WARNING: Dataframe 'var' only has 4 rows. Features SHOULD NOT be filtered from expression matrix.", "WARNING: Embedding key in 'adata.obsm' 3D is not 'spatial' nor does it start with 'X_'. " "Thus, it will not be available in Explorer", "WARNING: Validation of raw layer was not performed due to current errors, try again after fixing current errors.", @@ -2282,6 +2580,7 @@ def test_obsm_key_name_whitespace(self, validator_with_adata): del obsm["X_ umap"] obsm["u m a p"] = obsm["X_umap"] + validator.reset(None, 2) validator.validate_adata() assert validator.errors == [ "ERROR: Embedding key in 'adata.obsm' u m a p does not match the regex pattern ^[a-zA-Z][a-zA-Z0-9_.-]*$." diff --git a/cellxgene_schema_cli/tests/test_validate.py b/cellxgene_schema_cli/tests/test_validate.py index 819cde432..cd7652bfd 100644 --- a/cellxgene_schema_cli/tests/test_validate.py +++ b/cellxgene_schema_cli/tests/test_validate.py @@ -1,5 +1,6 @@ import hashlib import os +import re import tempfile from typing import Union from unittest import mock @@ -15,6 +16,8 @@ ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_FORBIDDEN, ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_IN_TISSUE_0, ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_REQUIRED, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, Validator, validate, ) @@ -295,7 +298,7 @@ def test__validate_with_h5ad_valid_and_labels(self): with tempfile.TemporaryDirectory() as temp_dir: labels_path = "/".join([temp_dir, "labels.h5ad"]) - success, errors = validate(h5ad_valid, labels_path) + success, errors, _ = validate(h5ad_valid, labels_path) import anndata as ad @@ -310,7 +313,7 @@ def test__validate_with_h5ad_valid_and_labels(self): assert original_hash != expected_hash, "Writing labels did not change the dataset from the original." def test__validate_with_h5ad_valid_and_without_labels(self): - success, errors = validate(h5ad_valid) + success, errors, _ = validate(h5ad_valid) assert success assert not errors @@ -319,25 +322,50 @@ def test__validate_with_h5ad_invalid_and_with_labels(self): with tempfile.TemporaryDirectory() as temp_dir: labels_path = "/".join([temp_dir, "labels.h5ad"]) - success, errors = validate(h5ad_invalid, labels_path) + success, errors, _ = validate(h5ad_invalid, labels_path) assert not success assert errors assert not os.path.exists(labels_path) def test__validate_with_h5ad_invalid_and_without_labels(self): - success, errors = validate(h5ad_invalid) + success, errors, _ = validate(h5ad_invalid) assert not success assert errors class TestCheckSpatial: + @pytest.mark.parametrize( + "assay_ontology_term_id, expected_is_visium", + [ + # Parent term for Visium Spatial Gene Expression. This term and all its descendants are Visium + ("EFO:0010961", True), + # Visium Spatial Gene Expression V1 + ("EFO:0022857", True), + # Visium CytAssist Spatial Gene Expression V2 + ("EFO:0022858", True), + # Visium CytAssist Spatial Gene Expression, 11mm + ("EFO:0022860", True), + # Visium CytAssist Spatial Gene Expression, 6.5mm + ("EFO:0022859", True), + # Random other EFO term + ("EFO:0003740", False), + ], + ) + def test__is_visium_descendant(self, assay_ontology_term_id, expected_is_visium): + validator: Validator = Validator() + validator._set_schema_def() + validator.adata = adata_visium.copy() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + + assert validator._is_visium_including_descendants() == expected_is_visium + def test__validate_spatial_visium_ok(self): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.visium_and_is_single_true_matrix_size = 2 + validator._visium_and_is_single_true_matrix_size = 2 # Confirm spatial is valid. validator.validate_adata() assert not validator.errors @@ -357,7 +385,7 @@ def test__validate_spatial_visium_dense_matrix_ok(self): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.visium_and_is_single_true_matrix_size = 2 + validator._visium_and_is_single_true_matrix_size = 2 validator.adata.X = validator.adata.X.toarray() validator.adata.raw = validator.adata.copy() validator.adata.raw.var.drop("feature_is_filtered", axis=1, inplace=True) @@ -398,10 +426,9 @@ def test__validate_spatial_type_error(self, spatial): # Confirm key type dict is required. validator.validate_adata() - assert validator.errors assert ( - "A dict in uns['spatial'] is required for obs['assay_ontology_term_id'] values 'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." - in validator.errors[0] + validator.errors[0] + == "ERROR: A dict in uns['spatial'] is required when obs['assay_ontology_term_id'] is either a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)." ) def test__validate_spatial_is_single_false_ok(self): @@ -423,25 +450,42 @@ def test__validate_spatial_forbidden_if_not_visium_or_slide_seqv2(self): # Confirm spatial is not allowed for 10x 3' v2. validator._check_spatial_uns() - assert len(validator.errors) == 1 - assert ( - "uns['spatial'] is only allowed for obs['assay_ontology_term_id'] values " - "'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." in validator.errors[0] - ) + assert validator.errors == [ + "uns['spatial'] is only allowed when obs['assay_ontology_term_id'] is either " + "a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)" + ] - def test__validate_spatial_required_if_visium(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, is_descendant", + [("EFO:0010961", True), ("EFO:0022858", True), ("EFO:0030029", False), ("EFO:0002697", False)], + ) + def test__validate_spatial_required_if_visium(self, assay_ontology_term_id, is_descendant): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.adata.uns = good_uns.copy() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id - # Confirm spatial is required for Visium. - validator._check_spatial_uns() - assert len(validator.errors) == 1 - assert ( - "A dict in uns['spatial'] is required for obs['assay_ontology_term_id'] values " - "'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." in validator.errors[0] - ) + if is_descendant: + # check pass if 'spatial' included + validator.adata.uns = good_uns_with_visium_spatial.copy() + validator._check_spatial_uns() + assert len(validator.errors) == 0 + validator.reset() + + # check fail if 'spatial' not included + validator.adata.uns = good_uns.copy() + validator._check_spatial_uns() + assert validator.errors == [ + "A dict in uns['spatial'] is required when obs['assay_ontology_term_id'] is " + "either a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)." + ] + validator.reset() + else: + # check fail if 'spatial' included + validator.adata.uns = good_uns_with_visium_spatial.copy() + validator._check_spatial_uns() + assert len(validator.errors) == 1 + validator.reset() def test__validate_spatial_required_if_slide_seqV2(self): validator: Validator = Validator() @@ -451,11 +495,9 @@ def test__validate_spatial_required_if_slide_seqV2(self): # Confirm spatial is required for Slide-seqV2. validator._check_spatial_uns() - assert len(validator.errors) == 1 - assert ( - "A dict in uns['spatial'] is required for obs['assay_ontology_term_id'] values " - "'EFO:0010961' (Visium Spatial Gene Expression) and 'EFO:0030062' (Slide-seqV2)." in validator.errors[0] - ) + assert validator.errors == [ + "A dict in uns['spatial'] is required when obs['assay_ontology_term_id'] is either a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)." + ] def test__validate_spatial_allowed_keys_error(self): validator: Validator = Validator() @@ -471,16 +513,26 @@ def test__validate_spatial_allowed_keys_error(self): "More than two top-level keys detected:" in validator.errors[0] ) - def test__validate_is_single_required_visium_error(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, is_descendant", + [("EFO:0010961", True), ("EFO:0022858", True), ("EFO:0030029", False), ("EFO:0002697", False)], + ) + def test__validate_is_single_required_visium_error(self, assay_ontology_term_id, is_descendant): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id validator.adata.uns["spatial"].pop("is_single") - - # Confirm is_single is identified as required. validator._check_spatial_uns() - assert validator.errors - assert "uns['spatial'] must contain the key 'is_single'." in validator.errors[0] + + if is_descendant: + # if spatial, MUST specify `is_single` + assert "uns['spatial'] must contain the key 'is_single'." in validator.errors[0] + else: + # if not spatial, MUST NOT speciffy `is_single` + assert validator.errors == [ + "uns['spatial'] is only allowed when obs['assay_ontology_term_id'] is either a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)" + ] def test__validate_is_single_required_slide_seqV2_error(self): validator: Validator = Validator() @@ -535,19 +587,36 @@ def test__validate_library_id_forbidden_if_visium_or_is_single_false(self): assert len(validator.errors) == 1 assert f"uns['spatial'][library_id] {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_FORBIDDEN}." in validator.errors[0] - def test__validate_library_id_required_if_visium(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, is_descendant", + [("EFO:0010961", True), ("EFO:0022858", True), ("EFO:0030029", False), ("EFO:0002697", False)], + ) + def test__validate_library_id_required_if_visium(self, assay_ontology_term_id, is_descendant): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.adata.uns["spatial"].pop(visium_library_id) - # Confirm library_id is identified as required. - validator._check_spatial_uns() - assert validator.errors - assert ( - f"uns['spatial'] must contain at least one key representing the library_id when {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}." - in validator.errors[0] - ) + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + if is_descendant: + # if spatial, `library_id` must exist + validator._check_spatial_uns() + assert len(validator.errors) == 0 + validator.reset() + + # if spatial, but missing from `uns` + validator.adata.uns["spatial"].pop(visium_library_id) + validator._check_spatial_uns() + assert validator.errors == [ + f"uns['spatial'] must contain at least one key representing the library_id when {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE}." + ] + else: + # if not spatial, MUST NOT define `library_id` + validator.adata.uns["spatial"][visium_library_id] = {"images": []} + validator._check_spatial_uns() + # Report the most general top level error + assert validator.errors == [ + "uns['spatial'] is only allowed when obs['assay_ontology_term_id'] is either a descendant of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2)" + ] @pytest.mark.parametrize("library_id", [None, "invalid", 1, 1.0, True]) def test__validate_library_id_type_error(self, library_id): @@ -585,7 +654,11 @@ def test__validate_images_required_error(self): assert validator.errors assert "uns['spatial'][library_id] must contain the key 'images'." in validator.errors[0] - def test__validate_images_allowed_keys_error(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, is_descendant", + [("EFO:0010961", True), ("EFO:0022858", True), ("EFO:0030029", False), ("EFO:0002697", False)], + ) + def test__validate_images_allowed_keys_error(self, assay_ontology_term_id, is_descendant): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() @@ -705,34 +778,84 @@ def test__validate_images_image_is_shape_error(self, image_name): "for example) or 4 (RGBA color model for example) for its last dimension" in validator.errors[0] ) - def test__validate_images_hires_max_dimension_greater_than_error(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, hi_res_size, image_max", + [ + ("EFO:0022858", 2001, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ("EFO:0022860", 4001, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM), + ], + ) + def test__validate_images_hires_max_dimension_greater_than_error( + self, assay_ontology_term_id, hi_res_size, image_max + ): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = np.zeros((1, 2001, 3), dtype=np.uint8) + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = np.zeros( + (1, hi_res_size, 3), dtype=np.uint8 + ) # Confirm hires is identified as invalid. validator._check_spatial_uns() - assert validator.errors - assert ( - "The largest dimension of uns['spatial'][library_id]['images']['hires'] must be 2000 pixels" - in validator.errors[0] - ) + assert validator.errors == [ + f"The largest dimension of uns['spatial'][library_id]['images']['hires'] must be {image_max} pixels, it has a largest dimension of {hi_res_size} pixels." + ] - def test__validate_images_hires_max_dimension_less_than_error(self): + @pytest.mark.parametrize( + "assay_ontology_term_id, hi_res_size, size_requirement", + [ + ("EFO:0022858", SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ("EFO:0022858", SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ("EFO:0022860", SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM), + ( + "EFO:0022860", + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, + SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM, + ), + ], + ) + def test__validate_images_hires_max_dimension(self, assay_ontology_term_id, hi_res_size, size_requirement): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = np.zeros((1, 1999, 3), dtype=np.uint8) + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = np.zeros( + (1, hi_res_size, 3), dtype=np.uint8 + ) # Confirm hires is identified as invalid. + validator.reset() validator._check_spatial_uns() - assert validator.errors - assert ( - "The largest dimension of uns['spatial'][library_id]['images']['hires'] must be 2000 pixels" - in validator.errors[0] + if hi_res_size == size_requirement: + assert validator.errors == [] + else: + assert validator.errors == [ + f"The largest dimension of uns['spatial'][library_id]['images']['hires'] must be {size_requirement} pixels, it has a largest dimension of {hi_res_size} pixels." + ] + + @pytest.mark.parametrize( + "assay_ontology_term_id, hi_res_size, image_max", + [ + ("EFO:0022858", 1999, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE), + ("EFO:0022860", 3999, SPATIAL_HIRES_IMAGE_MAX_DIMENSION_SIZE_VISIUM_11MM), + ], + ) + def test__validate_images_hires_max_dimension_less_than_error(self, assay_ontology_term_id, hi_res_size, image_max): + validator: Validator = Validator() + validator._set_schema_def() + validator.adata = adata_visium.copy() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.uns["spatial"][visium_library_id]["images"]["hires"] = np.zeros( + (1, hi_res_size, 3), dtype=np.uint8 ) + # Confirm hires is identified as invalid. + validator._check_spatial_uns() + assert validator.errors == [ + f"The largest dimension of uns['spatial'][library_id]['images']['hires'] must be {image_max} pixels, it has a largest dimension of {hi_res_size} pixels." + ] + def test__validate_scalefactors_required_error(self): validator: Validator = Validator() validator._set_schema_def() @@ -836,8 +959,8 @@ def test__validate_assay_type_ontology_term_id_not_unique_error(self): validator._validate_spatial_assay_ontology_term_id() assert validator.errors assert ( - "When obs['assay_ontology_term_id'] is either 'EFO:0010961' (Visium Spatial Gene Expression) or " - "'EFO:0030062' (Slide-seqV2), all observations must contain the same value." + "When obs['assay_ontology_term_id'] is either a descendant" + " of 'EFO:0010961' (Visium Spatial Gene Expression) or 'EFO:0030062' (Slide-seqV2), all observations must contain the same value." ) in validator.errors[0] def test__validate_assay_type_ontology_term_id_not_unique_ok(self, valid_adata): @@ -889,21 +1012,32 @@ def test__validate_tissue_position_required(self, tissue_position_name): validator.adata = adata_visium.copy() validator.adata.obs.pop(tissue_position_name) + # check visium + validator.adata.obs["assay_ontology_term_id"] = "EFO:0010961" validator._check_spatial_obs() assert validator.errors assert ( f"obs['{tissue_position_name}'] {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_REQUIRED}." in validator.errors[0] ) + validator.reset() - @pytest.mark.parametrize("assay_ontology_term_id", ["EFO:0010961", "EFO:0030062"]) + # check visium descendant + validator.adata.obs["assay_ontology_term_id"] = "EFO:0022860" + validator._check_spatial_obs() + assert validator.errors + assert ( + f"obs['{tissue_position_name}'] {ERROR_SUFFIX_VISIUM_AND_IS_SINGLE_TRUE_REQUIRED}." in validator.errors[0] + ) + validator.reset() + + @pytest.mark.parametrize("assay_ontology_term_id", ["EFO:0010961", "EFO:0030062", "EFO:0022860"]) def test__validate_tissue_position_not_required(self, assay_ontology_term_id): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_slide_seqv2.copy() validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id - validator.adata.uns["spatial"]["is_single"] = False + validator.adata.uns["spatial"]["is_single"] = False # setting to false removes the requirement validator.adata.obs["is_primary_data"] = False - validator._check_spatial_obs() assert not validator.errors @@ -919,72 +1053,102 @@ def test__validate_tissue_position_int_error(self, tissue_position_name): assert validator.errors assert f"obs['{tissue_position_name}'] must be of int type" in validator.errors[0] - @pytest.mark.parametrize( - "tissue_position_name, min, error_message_token", - [ - ("array_col", 0, "between 0 and 127"), - ("array_row", 0, "between 0 and 77"), - ("in_tissue", 0, "0 or 1"), - ], - ) - def test__validate_tissue_position_int_min_error(self, tissue_position_name, min, error_message_token): + @pytest.mark.parametrize("assay_ontology_term_id", ["EFO:0010961", "EFO:0022860", "EFO:0022859"]) + @pytest.mark.parametrize("tissue_position_name, min", [("array_col", 0), ("array_row", 0), ("in_tissue", 0)]) + def test__validate_tissue_position_int_min_error(self, assay_ontology_term_id, tissue_position_name, min): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id validator.adata.obs[tissue_position_name] = min - 1 # Confirm tissue_position is identified as invalid. validator._check_spatial_obs() - assert validator.errors - assert f"obs['{tissue_position_name}'] must be {error_message_token}" in validator.errors[0] + assert ( + re.match(f"^obs\['{tissue_position_name}'\] must be (between )?{min} (and|or) [0-9]+", validator.errors[0]) + is not None + ) @pytest.mark.parametrize( - "tissue_position_name, max, error_message_token", + "assay_ontology_term_id, tissue_position_name, tissue_position_max", [ - ("array_col", 127, "between 0 and 127"), - ("array_row", 77, "between 0 and 77"), - ("in_tissue", 1, "0 or 1"), + ("EFO:0010961", "array_col", 127), + ("EFO:0010961", "array_row", 77), + ("EFO:0022860", "array_col", 223), + ("EFO:0022860", "array_row", 127), + ("EFO:0022859", "array_col", 127), + ("EFO:0022859", "array_row", 77), + ("EFO:0022859", "in_tissue", 1), ], ) - def test__validate_tissue_position_int_max_error(self, tissue_position_name, max, error_message_token): + def test__validate_tissue_position_int_max_error( + self, assay_ontology_term_id, tissue_position_name, tissue_position_max + ): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() - validator.adata.obs[tissue_position_name] = max + 1 + validator.adata.obs["assay_ontology_term_id"] = assay_ontology_term_id + validator.adata.obs[tissue_position_name] = tissue_position_max + 1 # Confirm tissue_position is identified as invalid. validator._check_spatial_obs() - assert validator.errors - assert f"obs['{tissue_position_name}'] must be {error_message_token}" in validator.errors[0] + assert ( + re.match( + f"^obs\['{tissue_position_name}'\] must be (between )?[0-9]+ (and|or) {tissue_position_max}", + validator.errors[0], + ) + is not None + ) @pytest.mark.parametrize( - "cell_type_ontology_term_id, in_tissue", - [("unknown", 0), (["unknown", "CL:0000066"], [0, 1]), ("CL:0000066", 1)], + "cell_type_ontology_term_id, in_tissue, assay_ontology_term_id", + [ + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium Spatial Gene Expression + ("unknown", 0, "EFO:0010961"), + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium CytAssist Spatial Gene Expression, 11mm + ("unknown", 0, "EFO:0022860"), + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium Spatial Gene Expression V1 + # valid CL term is ok when in_tissue = 1 and assay_ontology_term_id = Visium CytAssist Spatial Gene Expression, 11mm + (["unknown", "CL:0000066"], [0, 1], ["EFO:0022857", "EFO:0022860"]), + # normal CL term for in_tissue = 1 and assay_ontology_term_id = 10x 3' v2 + ("CL:0000066", 1, "EFO:0009899"), + ], ) - def test__validate_cell_type_ontology_term_id_ok(self, cell_type_ontology_term_id, in_tissue): + def test__validate_cell_type_ontology_term_id_ok( + self, cell_type_ontology_term_id, in_tissue, assay_ontology_term_id + ): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() validator.adata.obs.cell_type_ontology_term_id = cell_type_ontology_term_id validator.adata.obs.in_tissue = in_tissue + validator.adata.obs.assay_ontology_term_id = assay_ontology_term_id # Confirm cell type is valid. validator._validate_spatial_cell_type_ontology_term_id() assert not validator.errors @pytest.mark.parametrize( - "cell_type_ontology_term_id, in_tissue", + "cell_type_ontology_term_id, in_tissue, assay_ontology_term_id", [ - ("CL:0000066", 0), - (["CL:0000066", "unknown"], [0, 1]), + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium Spatial Gene Expression + ("CL:0000066", 0, "EFO:0010961"), + (["CL:0000066", "unknown"], [0, 1], ["EFO:0010961", "EFO:0010961"]), + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium CytAssist Spatial Gene Expression, 11mm + ("CL:0000066", 0, "EFO:0022860"), + # MUST be unknown when in_tissue = 0 and assay_ontology_term_id = Visium Spatial Gene Expression V1 + ("CL:0000066", 0, "EFO:0022857"), ], ) - def test__validate_cell_type_ontology_term_id_error(self, cell_type_ontology_term_id, in_tissue): + def test__validate_cell_type_ontology_term_id_error( + self, cell_type_ontology_term_id, in_tissue, assay_ontology_term_id + ): validator: Validator = Validator() validator._set_schema_def() validator.adata = adata_visium.copy() validator.adata.obs.cell_type_ontology_term_id = cell_type_ontology_term_id validator.adata.obs.in_tissue = in_tissue + validator.adata.obs.assay_ontology_term_id = assay_ontology_term_id # Confirm errors. validator._validate_spatial_cell_type_ontology_term_id() @@ -994,6 +1158,18 @@ def test__validate_cell_type_ontology_term_id_error(self, cell_type_ontology_ter in validator.errors[0] ) + def test__validate_embeddings_non_nans(self): + validator: Validator = Validator() + validator._set_schema_def() + validator.adata = adata_visium.copy() + validator._visium_and_is_single_true_matrix_size = 2 + + # invalidate spatial embeddings with NaN value + validator.adata.obsm["spatial"][0, 1] = np.nan + # Confirm spatial is valid. + validator.validate_adata() + assert validator.errors == ["ERROR: adata.obs['spatial] contains at least one NaN value."] + class TestValidatorValidateDataFrame: @pytest.mark.parametrize("_type", [np.int64, np.int32, int, np.float64, np.float32, float, str]) diff --git a/codecov.yaml b/codecov.yaml new file mode 100644 index 000000000..9dbaf5d10 --- /dev/null +++ b/codecov.yaml @@ -0,0 +1,27 @@ +comment: + layout: "header, diff, components" + +component_management: + default_rules: + statuses: + - type: project + target: auto + branches: + - "!main" + individual_components: + - component_id: module_cellxgene_schema_cli + name: cellxgene_schema_cli + paths: + - cellxgene_schema_cli/** + - component_id: module_migration_assistant + name: migration_assistant + paths: + - scripts/migration_assistant/** + - component_id: module_schema_bump_dry_run_genes + name: schema_bump_dry_run_genes + paths: + - scripts/schema_bump_dry_run_genes/** + - component_id: module_schema_bump_dry_run_ontologies + name: schema_bump_dry_run_ontologies + paths: + - scripts/schema_bump_dry_run_ontologies/** diff --git a/pyproject.toml b/pyproject.toml index 1dd902b4d..01f9c970c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -41,3 +41,8 @@ show_error_codes = true ignore_missing_imports = true warn_unreachable = true warn_unused_configs = true + +[tool.pytest.ini_options] +pythonpath = [ + "cellxgene_schema_cli" +] \ No newline at end of file diff --git a/schema/drafts/5.2.1-experimental.md b/schema/drafts/5.2.1-experimental.md index 4275dfe6b..0190c8843 100644 --- a/schema/drafts/5.2.1-experimental.md +++ b/schema/drafts/5.2.1-experimental.md @@ -8,7 +8,7 @@ Version: 5.2.1-experimental The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://tools.ietf.org/html/bcp14), [RFC2119](https://www.rfc-editor.org/rfc/rfc2119.txt), and [RFC8174](https://www.rfc-editor.org/rfc/rfc8174.txt) when, and only when, they appear in all capitals, as shown here. -This draft is limited to **additions** or **modifications** to [schema 5.2.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md). If a 5.2.0 reference does not appear in this document, then no schema change is required. The following **temporary** constraints for *Danio rerio* and *Drosophila melanogaster* are specified: +This draft is limited to **additions** or **modifications** to [schema 5.2.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.2.0/schema.md). If a 5.2.0 reference does not appear in this document, then no schema change is required. The following **temporary** constraints are specified: * The `organism_ontology_term_id` MUST be the same for all observations. * The `tissue_type` MUST be `'tissue'` for all observations. @@ -24,6 +24,8 @@ The following ontology dependencies are *pinned* for this version of the schema. | Ontology | OBO Prefix | Release | Download | |:--|:--|:--|:--| +| [C. elegans Development Ontology] | WBls | [ 2024-09-26 Wormbase WS295](https://github.com/obophenotype/c-elegans-development-ontology/blob/vWS295) | [wbls.owl] | +| [C. elegans Gross Anatomy Ontology] | WBbt | [2024-09-24 Wormbase WS295](https://github.com/obophenotype/c-elegans-gross-anatomy-ontology/blob/v2024-09-24) | [wbbt.owl] | | [Cell Ontology] | CL | [2024-08-16] | [cl.owl]| | [Drosophila Anatomy Ontology] | FBbt | [2024-08-08](https://github.com/FlyBase/drosophila-anatomy-developmental-ontology/releases/tag/v2024-08-08) | [fbbt.owl] | | [Drosophila Development Ontology] | FBdv | [2024-08-07](https://github.com/FlyBase/drosophila-developmental-ontology/releases/tag/v2024-08-07) | [fbdv.owl] | @@ -38,6 +40,11 @@ The following ontology dependencies are *pinned* for this version of the schema. | [Zebrafish Anatomy Ontology] | ZFA
ZFS | [2022-12-09] | [zfa.owl] | | | | | | +[C. elegans Development Ontology]: https://obofoundry.org/ontology/wbls.html +[wbls.owl]: https://github.com/obophenotype/c-elegans-development-ontology/blob/vWS295/wbls.owl +[C. elegans Gross Anatomy Ontology]: https://obofoundry.org/ontology/wbbt.html + +[wbbt.owl]: https://github.com/obophenotype/c-elegans-gross-anatomy-ontology/blob/v2024-09-24/wbbt.owl [Cell Ontology]: http://obofoundry.org/ontology/cl.html [2024-08-16]: https://github.com/obophenotype/cell-ontology/releases/tag/v2024-08-16 [cl.owl]: https://github.com/obophenotype/cell-ontology/releases/download/v2024-08-16/cl.owl @@ -97,8 +104,9 @@ The following gene annotation dependencies are *pinned* for this version of the | NCBITaxon:9606
for Homo sapiens | [GENCODE (Human)] | Human reference GRCh38.p14
(GENCODE v44/Ensembl 110) | [gencode.v44.primary_assembly.annotation.gtf] | | NCBITaxon:10090
for Mus musculus | [GENCODE (Mouse)] | Mouse reference GRCm39
(GENCODE vM33/Ensembl 110) | [gencode.vM33.primary_assembly.annotation.gtf] | | NCBITaxon:2697049
for SARS-CoV-2 | [ENSEMBL (COVID-19)] | SARS-CoV-2 reference (ENSEMBL assembly: ASM985889v3) | [Sars\_cov\_2.ASM985889v3.101.gtf] | -| NCBITaxon:7955
for Danio rerio | [ENSEMBL (Zebrafish)] | GRCz11.112 (Ensembl 112) | [Danio_rerio.GRCz11.112.gtf] | -| "NCBITaxon:7227"
for Drosophila melanogaster| [ENSEMBL (Fruit fly)] | BDGP6.46 (Ensembl 112) | [Drosophila_melanogaster.BDGP6.46.112.gtf] | +| "NCBITaxon:6239"
for Caenorhabditis elegans | [ENSEMBL (Caenorhabditis elegans)] | WBcel235 (GCA_000002985.3)
Ensembl 113 | [Caenorhabditis_elegans.WBcel235.113.gtf] | +| NCBITaxon:7955
for Danio rerio | [ENSEMBL (Zebrafish)] | GRCz11 (GCA_000002035.4)
Ensembl 113 | [Danio_rerio.GRCz11.113.gtf] | +| "NCBITaxon:7227"
for Drosophila melanogaster| [ENSEMBL (Fruit fly)] | BDGP6.46 (GCA_000001215.4)
Ensembl 113 | [Drosophila_melanogaster.BDGP6.46.113.gtf] | | | [ThermoFisher ERCC Spike-Ins] | ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739) | [cms_095047.txt] | [RNA Spike-In Control Mixes]: https://www.thermofisher.com/document-connect/document-connect.html?url=https%3A%2F%2Fassets.thermofisher.com%2FTFS-Assets%2FLSG%2Fmanuals%2Fcms_086340.pdf&title=VXNlciBHdWlkZTogRVJDQyBSTkEgU3Bpa2UtSW4gQ29udHJvbCBNaXhlcyAoRW5nbGlzaCAp @@ -112,11 +120,14 @@ The following gene annotation dependencies are *pinned* for this version of the [ENSEMBL (COVID-19)]: https://covid-19.ensembl.org/index.html [Sars\_cov\_2.ASM985889v3.101.gtf]: https://ftp.ensemblgenomes.org/pub/viruses/gtf/sars_cov_2/Sars_cov_2.ASM985889v3.101.gtf.gz +[ENSEMBL (Caenorhabditis elegans)]: https://useast.ensembl.org/Caenorhabditis_elegans/Info/Index +[Caenorhabditis_elegans.WBcel235.113.gtf]: https://ftp.ensembl.org/pub/release-113/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.113.gtf.gz + [ENSEMBL (Zebrafish)]: https://useast.ensembl.org/Danio_rerio/Info/Index -[Danio_rerio.GRCz11.112.gtf]: https://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz +[Danio_rerio.GRCz11.113.gtf]: https://ftp.ensembl.org/pub/release-113/gtf/danio_rerio/Danio_rerio.GRCz11.113.gtf.gz [ENSEMBL (Fruit fly)]: https://www.ensembl.org/Drosophila_melanogaster/Info/Index -[Drosophila_melanogaster.BDGP6.46.112.gtf]: https://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz +[Drosophila_melanogaster.BDGP6.46.113.gtf]: https://ftp.ensembl.org/pub/release-113/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.113.gtf.gz [ThermoFisher ERCC Spike-Ins]: https://www.thermofisher.com/order/catalog/product/4456740#/4456740 [cms_095047.txt]: https://assets.thermofisher.com/TFS-Assets/LSG/manuals/cms_095047.txt @@ -128,27 +139,57 @@ The following gene annotation dependencies are *pinned* for this version of the ### development_stage_ontology_term_id - - - - - - - - - - - + + + + + + + + + + +
Keydevelopment_stage_ontology_term_id
AnnotatorCurator MUST annotate.
Value - categorical with str categories. If unavailable, this MUST be "unknown".

- If organism_ontolology_term_id is "NCBITaxon:7955" for Danio rerio, then this MUST be the most accurate descendant of ZFS:0100000 for zebrafish stage and MUST NOT be ZFS:0000000 for Unknown.

If organism_ontolology_term_id is "NCBITaxon:7227" for Drosophila melanogaster, then this MUST be the most accurate FBdv term. -

Otherwise, for all other organisms this MUST be the most accurate descendant of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage. -
Keydevelopment_stage_ontology_term_id
AnnotatorCurator MUST annotate.
Value + categorical with str categories. If unavailable, this MUST be "unknown".

+ + + + + + + + + + + + + + + + + + + + + +
For organism_ontolology_term_idValue
+ "NCBITaxon:6239"
for Caenorhabditis elegans +
+ MUST be the most accurate descendant of WBls:0000075
for worm life stage +
+ "NCBITaxon:7955"
for Danio rerio +
+ MUST be the most accurate descendant of ZFS:0100000
for zebrafish stage and MUST NOT be ZFS:0000000 for Unknown +
+ "NCBITaxon:7227"
for Drosophila melanogaster +
+ MUST be the most accurate FBdv term +
+

---- - ### organism_cell_type_ontology_term_id @@ -163,7 +204,15 @@ The following gene annotation dependencies are *pinned* for this version of the
Value - categorical with str categories.

+ categorical with str categories. This MUST be "unknown" when: +
    +
  • + no appropriate term can be found (e.g. the cell type is unknown) +
  • +
  • + assay_ontology_term_id is "EFO:0010961" for Visium Spatial Gene Expression, uns['spatial']['is_single'] is True, and the corresponding value of in_tissue is 0 +
  • +
@@ -172,40 +221,27 @@ The following gene annotation dependencies are *pinned* for this version of the + - + -
For organism_ontolology_term_id
- "NCBITaxon:7955"
for Danio rerio + "NCBITaxon:6239"
for Caenorhabditis elegans
- MUST be either the most accurate descendant of ZFA:0009000 for cell
or "unknown" when: -
    -
  • - no appropriate term can be found (e.g. the cell type is unknown) -
  • -
  • - assay_ontology_term_id is "EFO:0010961" for
    Visium Spatial Gene Expression, uns['spatial']['is_single'] is True,
    and the corresponding value of in_tissue is 0 -
  • -
+ MUST be the most accurate descendant of WBbt:0004017 for Cell
- "NCBITaxon:7227"
for Drosophila melanogaster + "NCBITaxon:7955"
for Danio rerio
MUST be either the most accurate descendant of FBbt:00007002 for cell
or "unknown" when: -
    -
  • - no appropriate term can be found (e.g. the cell type is unknown) -
  • -
  • - assay_ontology_term_id is "EFO:0010961" for
    Visium Spatial Gene Expression, uns['spatial']['is_single'] is True,
    and the corresponding value of in_tissue is 0 -
  • -
+
+ MUST be the most accurate descendant of ZFA:0009000 for cell
- All other values of
organism_ontology_term_id + "NCBITaxon:7227"
for Drosophila melanogaster +
MUST be the most accurate descendant of FBbt:00007002 for cell MUST be "na"
@@ -230,7 +266,12 @@ The following gene annotation dependencies are *pinned* for this version of the
Value - categorical with str categories. This MUST be a descendant of NCBITaxon:33208 for Metazoa.

If organism_ontology_term_id is "NCBITaxon:7955" for Danio rerio or "NCBITaxon:7227" for Drosophila melanogaster, then all observations MUST contain the same value. + categorical with str categories. This MUST be a descendant of NCBITaxon:33208 for Metazoa.

All observations MUST contain the same value when the organism_ontology_term_id is: +
@@ -261,6 +302,14 @@ The following gene annotation dependencies are *pinned* for this version of the + + + "NCBITaxon:6239"
for Caenorhabditis elegans + + + MUST be the most accurate descendant of WBbt:0005766 for Anatomy + + "NCBITaxon:7955"
for Danio rerio @@ -277,12 +326,6 @@ The following gene annotation dependencies are *pinned* for this version of the MUST be the most accurate descendant of FBbt:10000000 for
anatomical entity and MUST NOT be FBbt:00007002
for cell or any of its descendants. - - - All other values of
organism_ontology_term_id - - MUST be "na" - @@ -292,6 +335,27 @@ The following gene annotation dependencies are *pinned* for this version of the --- +### sex_ontology_term_id + + + + + + + + + + + + + + +
Keysex_ontology_term_id
AnnotatorCurator MUST annotate.
Valuecategorical with str categories. If unavailable, this MUST be "unknown".

If organism_ontolology_term_id is "NCBITaxon:6239" for Caenorhabditis elegans, this MUST be PATO:0000384 for male or PATO:0001340 for hermaphrodite.

Otherwise, this MUST be a descendant of PATO:0001894 for phenotypic sex. +
+
+ +--- + ### tissue_type @@ -306,12 +370,18 @@ The following gene annotation dependencies are *pinned* for this version of the
Value - categorical with str categories.

If organism_ontology_term_id is "NCBITaxon:7955" for Danio rerio or "NCBITaxon:7227" for Drosophila melanogaster, then the value MUST be "tissue".

Otherwise, the value MUST be "tissue", "organoid", or "cell culture". + categorical with str categories.

The value MUST be "tissue" when the organism_ontology_term_id is: + Otherwise, the value MUST be "tissue", "organoid", or "cell culture".

+ --- ## var and raw.var (Gene Metadata) @@ -355,6 +425,12 @@ The following gene annotation dependencies are *pinned* for this version of the "NCBITaxon:2697049" + + Caenorhabditis elegans + + "NCBITaxon:6293" + + Danio rerio @@ -388,18 +464,34 @@ The following gene annotation dependencies are *pinned* for this version of the * General Requirements * Updated requirements for supported organisms * Required Ontologies + * Added C. elegans Development Ontology (WBls) release 2024-09-26 Wormbase WS295 + * Added C. elegans Gross Anatomy Ontology (WBbt) release 2024-09-24 Wormbase WS295 * Added Drosophila Anatomy Ontology (FBbt) release 2024-08-08 * Added Drosophila Development Ontology (FBdv) release 2024-08-07 * Added Zebrafish Anatomy Ontology (ZFA+ZFS) release 2022-12-09 * Required Gene Annotations * Refactored table to include NCBI Taxon for supported organisms - * Added *Danio rerio* Reference GRCz11.112 (Ensembl 112) - * Added *Drosophila melanogaster* Reference BDGP6.46 (Ensembl 112) + * Added *Caenorhabditis elegans* WBcel235 (GCA_000002985.3) Ensembl 113 + * Added *Danio rerio* GRCz11 (GCA_000002035.4) Ensembl 113 + * Added *Drosophila melanogaster* BDGP6.46 (GCA_000001215.4) Ensembl 113 * obs (Cell metadata) - * Updated `development_stage_ontology_term_id` for *Danio rerio* and *Drosophila melanogaster* + * Updated `development_stage_ontology_term_id` to include: + * *Caenorhabditis elegans* + * *Danio rerio* + * *Drosophila melanogaster* * Added `organism_cell_type_ontology_term_id` - * Updated `organism_ontology_term_id` for *Danio rerio* and *Drosophila melanogaster* to require all observations to contain the same value + * Updated `organism_ontology_term_id` to require all observations to contain the same value for: + * *Caenorhabditis elegans* + * *Danio rerio* + * *Drosophila melanogaster* * Added `organism_tissue_ontology_term_id` - * Updated `tissue_type` to require `"tissue"` for *Danio rerio* and *Drosophila melanogaster* + * Updated `sex_ontology_term_id` for *Caenorhabditis elegans* + * Updated `tissue_type` to require `"tissue"` for: + * *Caenorhabditis elegans* + * *Danio rerio* + * *Drosophila melanogaster* * var and raw.var (Gene Metadata) - * Updated `feature_reference` for *Danio rerio* and *Drosophila melanogaster* \ No newline at end of file + * Updated `feature_reference` to include: + * *Caenorhabditis elegans* + * *Danio rerio* + * *Drosophila melanogaster* \ No newline at end of file diff --git a/schema/drafts/5.3.0.md b/schema/drafts/5.3.0.md index a65f74a8e..70de7b798 100644 --- a/schema/drafts/5.3.0.md +++ b/schema/drafts/5.3.0.md @@ -163,9 +163,8 @@ The types below are python3 types. Note that a python3 `str` is a sequence of Un ## `X` (Matrix Layers) -The data stored in the `X` data matrix is the data that is viewable in CELLxGENE Explorer. CELLxGENE does not impose any additional constraints on the `X` data matrix. +The data stored in the `AnnData.X` data matrix is the data that is viewable in CELLxGENE Explorer. For `AnnData.X`, `AnnData.raw.X`, and all layers, if a data matrix contains 50% or more values that are zeros, it MUST be encoded as a [`scipy.sparse.csr_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) with zero values encoded as implicit zeros. -In any layer, if a matrix has 50% or more values that are zeros, it is STRONGLY RECOMMENDED that the matrix be encoded as a [`scipy.sparse.csr_matrix`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) with zero values encoded as implicit zeros. CELLxGENE's matrix layer requirements are tailored to optimize data reuse. Because each assay has different characteristics, the requirements differ by assay type. In general, CELLxGENE requires submission of "raw" data suitable for computational reuse when a standard raw matrix format exists for an assay. It is STRONGLY RECOMMENDED to also include a "normalized" matrix with processed values ready for data analysis and suitable for visualization in CELLxGENE Explorer. So that CELLxGENE's data can be provided in download formats suitable for both R and Python, the schema imposes the following requirements: @@ -583,10 +582,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

+ float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0010" for African expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -610,10 +608,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

- If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + float. All observations with the same donor_id MUST contain the same value.

+ If organism_ontolology_term_id is NOT + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0009" for East Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -637,10 +634,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

+ float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0005" for European expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -664,10 +660,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

+ float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0013" for Indigenous American expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -691,10 +686,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

+ float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0017" for Oceanian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -718,10 +712,9 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - str or float. All observations with the same donor_id MUST contain the same value.

+ float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT - "NCBITaxon:9606" for Homo sapiens, then the - value MUST be "na".

If + "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan").

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0006" for South Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0 @@ -1017,7 +1010,12 @@ If organism_ontolology_term_id is "NCBITaxon:9606" for Value - categorical with str categories. This MUST be a descendant of PATO:0001894 for phenotypic sex or "unknown" if unavailable. + categorical with str categories. This MUST be "unknown" if unavailable; otherwise, this MUST be one of:

+ @@ -2067,7 +2065,7 @@ When a dataset is uploaded, CELLxGENE Discover MUST automatically add the `schem * Updated _Visium Spatial Gene Expression_ table row to _Descendants of Visium Spatial Gene Expression_ * Added matrix requirements for _Visium CytAssist Spatial Gene Expression, 11mm_. * obs (Cell metadata) - * Updated the requirements for `array_col`: + * Updated the requirements for `array_col`: * MUST be annotated if the `assay_ontology_term_id` is a descendant of _Visium Spatial Gene Expression_ * Added ranges for _Visium CytAssist Spatial Gene Expression, 6.5mm_ and _Visium CytAssist Spatial Gene Expression, 11mm_ * Updated the requirements for `array_row`: @@ -2082,6 +2080,7 @@ When a dataset is uploaded, CELLxGENE Discover MUST automatically add the `schem * Added genetic_ancestry_Oceanian * Added genetic_ancestry_South_Asian * Updated the requirements for `in_tissue` to include descendants of _Visium Spatial Gene Expression_. + * Updated the requirements for `sex_ontology_term_id` to limit values to female, hermaphrodite, male, or `"unknown"` * obsm (Embeddings) * Updated the requirements for `spatial` to include descendants of _Visium Spatial Gene Expression_ and to prohibit 'Not a Number' values. * Updated the requirements for `X_{suffix}` to include descendants of _Visium Spatial Gene Expression_. @@ -2097,6 +2096,8 @@ When a dataset is uploaded, CELLxGENE Discover MUST automatically add the `schem * Updated the requirements for spatial[library_id]['scalefactors'] to include descendants of _Visium Spatial Gene Expression_. * Updated the requirements for spatial[library_id]['scalefactors']['spot_diameter_fullres'] to include descendants of _Visium Spatial Gene Expression_. * Updated the requirements for spatial[library_id]['scalefactors']['tissue_hires_scalef'] to include descendants of _Visium Spatial Gene Expression_. +* X (Matrix Layers) + * Updated the STRONGLY RECOMMENDED requirement to a MUST. A matrix with 50% or more values that are zeros MUST be encoded as `scipy.sparse.csr_matrix`. ### schema v5.2.0