Intelligent gene name scraping #46

kieranrcampbell · 2022-05-18T14:37:37Z

Currently cytosel assumes that rownames(sce) are gene symbols (note: user doesn't necessarily upload SingleCellExperiment, could be Seurat or AnnData).

However, the rownames could be ensembl/entrez IDs, maybe the symbols are in colData(sce) but maybe they're not.

Can we write a function called parse_rownames that:

Identifies if rownames(sce) are symbols
If not, looks for a column of colData(sce) that might be symbols it could use
If nothing is there, work out what rownames(sce) are - if they're ensembl/entrez, convert them to symbol (use annotables). Otherwise throw an error (dialog)

Two notes:

Will need to check if it's human, if not -> error (only human supported currently)
When mapping from ensembl -> symbol, multiple ensembls correspond to a symbol, so there will be non-unique genes. I'd suggest handling by summing counts and taking mean of log counts. There should be a function in scuttle that does this
Check out this R package I wrote for working out gene/organism format https://github.com/camlab-bioml/inferorg

The text was updated successfully, but these errors were encountered:

Michael-Geuenich · 2022-05-27T19:41:16Z

Thinking about the parse_rownames function:

maybe parse_gene_names is a better name?
when it comes to identifying if rownames(sce) are symbols I was thinking of checking whether all the genes specified in the sce are present in annotables. This begs the question of what to do when only a subset are present, should we throw a warning specifying which are missing?

Also, I remember having an issue with annotables (a lot of histone genes suddenly disappeared between versions). I will file an issue on their github just to be safe this will be working. Otherwise it might be best to save a version of the grch38 dataframe from annotables locally so that we don't depend on any changes they make to their data.

kieranrcampbell · 2022-05-27T19:47:21Z

yes. generally feel free to rename anything if it makes more sense
i think just be sensible about it (remember don't have to justify any of this to reviewers). maybe something like:

if > half the genes are found, keep going, no error
if < half but > 100 genes are found, throw warning saying a small number of genes could be matched (but i wouldn't list, because who wants to see a list of 1000s of genes)
if < 100 genes, throw error saying likely mismatch

kieranrcampbell · 2022-06-01T17:56:27Z

This package should have functions that will help with this

Michael-Geuenich · 2022-07-16T22:53:37Z

Ok, this is implemented and I've done some basic testing locally (though more probably needed).

One note: I needed the grch38 annotables annotation, which I didn't manage to import because it is a dataframe, so I've just saved in and committed it. This might be better anyway because then we will not be reliant on any updates they make to their package, but happy to change if you have a better idea.

kieranrcampbell · 2022-07-18T13:05:02Z

Yes like the idea of saving locally if it can reduce dependencies (also reduces potential for things to break later)

Michael-Geuenich · 2022-08-02T22:04:40Z

Potential issue: the parse_gene_names function uses the sumCountsAcrossFeatures function from scuttle which takes an sce object and returns a matrix. This means I have to recreate the sce, which currently only returns one assay (logcounts). If this is an issue we will need to loop through all assays and run sumCountsAcrossFeatures for each one.

kieranrcampbell added the enhancement New feature or request label May 18, 2022

kieranrcampbell assigned Michael-Geuenich May 18, 2022

kieranrcampbell mentioned this issue Jun 1, 2022

Debug 'out of the box' failure on HCA dataset #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intelligent gene name scraping #46

Intelligent gene name scraping #46

kieranrcampbell commented May 18, 2022

Michael-Geuenich commented May 27, 2022

kieranrcampbell commented May 27, 2022

kieranrcampbell commented Jun 1, 2022

Michael-Geuenich commented Jul 16, 2022

kieranrcampbell commented Jul 18, 2022

Michael-Geuenich commented Aug 2, 2022

Intelligent gene name scraping #46

Intelligent gene name scraping #46

Comments

kieranrcampbell commented May 18, 2022

Michael-Geuenich commented May 27, 2022

kieranrcampbell commented May 27, 2022

kieranrcampbell commented Jun 1, 2022

Michael-Geuenich commented Jul 16, 2022

kieranrcampbell commented Jul 18, 2022

Michael-Geuenich commented Aug 2, 2022