Add function for listing available h5ads for download #1251

ivirshup · 2024-07-24T23:41:38Z

Description

Right now users can download one of the source h5ads using cellxgene_census.download_source_h5ad. However not all cellxgene datasets are available here, which can be confusing

KeyError on download_source_h5ad with Valid Dataset ID in cellxgene_census #1100

Internally we validate the IDs using a table, so I think we should expose that to users.

On a higher level, if we are providing the ability for users to access a dataset by ID, we should probably give them the ability to see what ids are available. This is how we do embeddings (cellxgene_census.experimental.get_all_available_embeddings) and census versions (cellxgene_census.get_census_version_directory).

Impact

Some users mainly grab h5ads from census. I would like to make their lives a little easier.

Alternatives you've considered

Users can look up a dataset in the explorer, then use this api to download the dataset programmatically. However, as shown in the issue referenced above, this doesn't always work since not all datasets exist in census.

The user can access a list of datasets by accessing the census and finding all the unique dataset ids. AFAICT this is the main user-visible way to accomplish this task.

Ideal behavior

A function that returns a pandas table with all datasets available in the public bucket along with a small amount of metadata, like dataset species. Likely, it should have most of the info available in census["census_info"]["datasets"]

The text was updated successfully, but these errors were encountered:

pablo-gar · 2024-07-26T14:08:48Z

All datasets in census["census_info"]["datasets"] are the "downloadable" datasets. If you find that other metadata may be valuable, an option I'd prefer would be to add those the data frame itself, and then provide a wrapper get_datasets() for census["census_info"]["datasets"].read().concat().to_pandas()

ivirshup · 2024-07-26T17:50:53Z

Yeah, I basically want to expose that. Higher visibility for that, and maybe a convenience function since

census["census_info"]["datasets"].read().concat().to_pandas()

Is pretty long and probably not super intuitive to a new user. I do think you should also be able to know what species a dataset is from, which isn't included in the census_info/datasets table.

pablo-gar · 2024-07-26T18:35:02Z

Be aware there are some edge cases for species, there are some datasets whose original source dataset contains cells from both organisms and when ingested into Census they follow these rules, which leads to inclusion of cells only from one organism.

https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#multi-species-data-constraints

ivirshup added the user request label Jul 24, 2024

pablo-gar added python api Related to the API r api labels Jul 26, 2024

cathystoli added the Priority backlog items label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function for listing available h5ads for download #1251

Add function for listing available h5ads for download #1251

ivirshup commented Jul 24, 2024

pablo-gar commented Jul 26, 2024 •

edited

Loading

ivirshup commented Jul 26, 2024

pablo-gar commented Jul 26, 2024 •

edited

Loading

Add function for listing available h5ads for download #1251

Add function for listing available h5ads for download #1251

Comments

ivirshup commented Jul 24, 2024

Description

Impact

Alternatives you've considered

Ideal behavior

pablo-gar commented Jul 26, 2024 • edited Loading

ivirshup commented Jul 26, 2024

pablo-gar commented Jul 26, 2024 • edited Loading

pablo-gar commented Jul 26, 2024 •

edited

Loading

pablo-gar commented Jul 26, 2024 •

edited

Loading