Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for listing available h5ads for download #1251

Open
ivirshup opened this issue Jul 24, 2024 · 3 comments
Open

Add function for listing available h5ads for download #1251

ivirshup opened this issue Jul 24, 2024 · 3 comments

Comments

@ivirshup
Copy link
Collaborator

Description

Right now users can download one of the source h5ads using cellxgene_census.download_source_h5ad. However not all cellxgene datasets are available here, which can be confusing

Internally we validate the IDs using a table, so I think we should expose that to users.

On a higher level, if we are providing the ability for users to access a dataset by ID, we should probably give them the ability to see what ids are available. This is how we do embeddings (cellxgene_census.experimental.get_all_available_embeddings) and census versions (cellxgene_census.get_census_version_directory).

Impact

Some users mainly grab h5ads from census. I would like to make their lives a little easier.

Alternatives you've considered

Users can look up a dataset in the explorer, then use this api to download the dataset programmatically. However, as shown in the issue referenced above, this doesn't always work since not all datasets exist in census.

The user can access a list of datasets by accessing the census and finding all the unique dataset ids. AFAICT this is the main user-visible way to accomplish this task.

Ideal behavior

A function that returns a pandas table with all datasets available in the public bucket along with a small amount of metadata, like dataset species. Likely, it should have most of the info available in census["census_info"]["datasets"]

@pablo-gar
Copy link
Contributor

pablo-gar commented Jul 26, 2024

All datasets in census["census_info"]["datasets"] are the "downloadable" datasets. If you find that other metadata may be valuable, an option I'd prefer would be to add those the data frame itself, and then provide a wrapper get_datasets() for census["census_info"]["datasets"].read().concat().to_pandas()

@pablo-gar pablo-gar added python api Related to the API r api labels Jul 26, 2024
@ivirshup
Copy link
Collaborator Author

Yeah, I basically want to expose that. Higher visibility for that, and maybe a convenience function since

census["census_info"]["datasets"].read().concat().to_pandas()

Is pretty long and probably not super intuitive to a new user. I do think you should also be able to know what species a dataset is from, which isn't included in the census_info/datasets table.

@pablo-gar
Copy link
Contributor

pablo-gar commented Jul 26, 2024

Be aware there are some edge cases for species, there are some datasets whose original source dataset contains cells from both organisms and when ingested into Census they follow these rules, which leads to inclusion of cells only from one organism.

https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#multi-species-data-constraints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants