You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now users can download one of the source h5ads using cellxgene_census.download_source_h5ad. However not all cellxgene datasets are available here, which can be confusing
Some users mainly grab h5ads from census. I would like to make their lives a little easier.
Alternatives you've considered
Users can look up a dataset in the explorer, then use this api to download the dataset programmatically. However, as shown in the issue referenced above, this doesn't always work since not all datasets exist in census.
The user can access a list of datasets by accessing the census and finding all the unique dataset ids. AFAICT this is the main user-visible way to accomplish this task.
Ideal behavior
A function that returns a pandas table with all datasets available in the public bucket along with a small amount of metadata, like dataset species. Likely, it should have most of the info available in census["census_info"]["datasets"]
The text was updated successfully, but these errors were encountered:
All datasets in census["census_info"]["datasets"] are the "downloadable" datasets. If you find that other metadata may be valuable, an option I'd prefer would be to add those the data frame itself, and then provide a wrapper get_datasets() for census["census_info"]["datasets"].read().concat().to_pandas()
Is pretty long and probably not super intuitive to a new user. I do think you should also be able to know what species a dataset is from, which isn't included in the census_info/datasets table.
Be aware there are some edge cases for species, there are some datasets whose original source dataset contains cells from both organisms and when ingested into Census they follow these rules, which leads to inclusion of cells only from one organism.
Description
Right now users can download one of the source h5ads using cellxgene_census.download_source_h5ad. However not all cellxgene datasets are available here, which can be confusing
Internally we validate the IDs using a table, so I think we should expose that to users.
On a higher level, if we are providing the ability for users to access a dataset by ID, we should probably give them the ability to see what ids are available. This is how we do embeddings (cellxgene_census.experimental.get_all_available_embeddings) and census versions (cellxgene_census.get_census_version_directory).
Impact
Some users mainly grab h5ads from census. I would like to make their lives a little easier.
Alternatives you've considered
Users can look up a dataset in the explorer, then use this api to download the dataset programmatically. However, as shown in the issue referenced above, this doesn't always work since not all datasets exist in census.
The user can access a list of datasets by accessing the census and finding all the unique dataset ids. AFAICT this is the main user-visible way to accomplish this task.
Ideal behavior
A function that returns a pandas table with all datasets available in the public bucket along with a small amount of metadata, like dataset species. Likely, it should have most of the info available in
census["census_info"]["datasets"]
The text was updated successfully, but these errors were encountered: