Skip to content

Latest commit

 

History

History
78 lines (46 loc) · 3.28 KB

readme.md

File metadata and controls

78 lines (46 loc) · 3.28 KB

cellxgene curation tools

codecov

This repository contains documents and code used by cellxgene's curation team. Issues/suggestions pertaining to datasets and how they interact with cellxgene should be created here.

For information/issues about cellxgene and its portal please refer to:

Installation

The primary curation tool is the cellxgene-schema CLI. It enables curators to perform schema validation for datasets to be hosted on the cellxgene Data Portal.

It requires Python >= 3.8. It is available through pip:

pip install cellxgene-schema

It can also be installed from the source by cloning this repository and running:

make install 

And you can run the tests with:

make unit-test

Usage

The CLI validates an AnnData file (*.h5ad) to ensure that it addresses the schema requirements.

Datasets can be validated using the following command line:

cellxgene-schema validate input.h5ad

If the validation succeeds, the command returns a zero exit code; otherwise, it returns a non-zero exit code and prints validation failure messages.


This experimental validator also offers the option to annotate required columns cell_type_ontology_term_id and tissue_ontology_term_id in Zebrafish, Fruit Fly, or C. Elegans anndata BEFORE running validation commands above.

This relies on your anndata having the appropriate species-specific ontology terms (e.g. ZFA, FbBT, WBbt) labeled in organism_cell_type_ontology_term_id and organism_tissue_ontology_term_id, respectively.

cellxgene-schema map-species output.h5ad input.h5ad

The command will find the closest CL (for cell_type) or UBERON (for tissue) mapping for the given term, offering either an exact match for the given term or a match from the closest possible ancestor term. This is based on the CL and UBERON SSSOM mappings.

If there are multiple closest ancestors of the same distance with a match, the command will NOT annotate those rows and instead log your closest ancestor match options for your manual curation.


The data portal runs the following in the backend:

cellxgene-schema validate --add-labels output.h5ad input.h5ad

This execution validates the dataset as above AND adds the human-readable labels for the ontology and gene IDs as defined in the schema. If the validation is successful, a new AnnData file (output.h5ad) is written to disk with the labels appended.

This option SHOULD NOT be used by data contributors.

Contributing

Please read our contributing guidelines and make sure adhere to the Contributor Covenant code of conduct.

Reporting Security Issues

Please read our security reporting policy