Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intelligent gene name scraping #46

Open
kieranrcampbell opened this issue May 18, 2022 · 6 comments
Open

Intelligent gene name scraping #46

kieranrcampbell opened this issue May 18, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@kieranrcampbell
Copy link
Member

Currently cytosel assumes that rownames(sce) are gene symbols (note: user doesn't necessarily upload SingleCellExperiment, could be Seurat or AnnData).

However, the rownames could be ensembl/entrez IDs, maybe the symbols are in colData(sce) but maybe they're not.

Can we write a function called parse_rownames that:

  1. Identifies if rownames(sce) are symbols
  2. If not, looks for a column of colData(sce) that might be symbols it could use
  3. If nothing is there, work out what rownames(sce) are - if they're ensembl/entrez, convert them to symbol (use annotables). Otherwise throw an error (dialog)

Two notes:

  1. Will need to check if it's human, if not -> error (only human supported currently)
  2. When mapping from ensembl -> symbol, multiple ensembls correspond to a symbol, so there will be non-unique genes. I'd suggest handling by summing counts and taking mean of log counts. There should be a function in scuttle that does this
  3. Check out this R package I wrote for working out gene/organism format https://github.com/camlab-bioml/inferorg
@Michael-Geuenich
Copy link
Contributor

Thinking about the parse_rownames function:

  1. maybe parse_gene_names is a better name?
  2. when it comes to identifying if rownames(sce) are symbols I was thinking of checking whether all the genes specified in the sce are present in annotables. This begs the question of what to do when only a subset are present, should we throw a warning specifying which are missing?

Also, I remember having an issue with annotables (a lot of histone genes suddenly disappeared between versions). I will file an issue on their github just to be safe this will be working. Otherwise it might be best to save a version of the grch38 dataframe from annotables locally so that we don't depend on any changes they make to their data.

@kieranrcampbell
Copy link
Member Author

  1. yes. generally feel free to rename anything if it makes more sense
  2. i think just be sensible about it (remember don't have to justify any of this to reviewers). maybe something like:
  • if > half the genes are found, keep going, no error
  • if < half but > 100 genes are found, throw warning saying a small number of genes could be matched (but i wouldn't list, because who wants to see a list of 1000s of genes)
  • if < 100 genes, throw error saying likely mismatch

@kieranrcampbell
Copy link
Member Author

This package should have functions that will help with this

@Michael-Geuenich
Copy link
Contributor

Ok, this is implemented and I've done some basic testing locally (though more probably needed).

One note: I needed the grch38 annotables annotation, which I didn't manage to import because it is a dataframe, so I've just saved in and committed it. This might be better anyway because then we will not be reliant on any updates they make to their package, but happy to change if you have a better idea.

@kieranrcampbell
Copy link
Member Author

Yes like the idea of saving locally if it can reduce dependencies (also reduces potential for things to break later)

@Michael-Geuenich
Copy link
Contributor

Potential issue: the parse_gene_names function uses the sumCountsAcrossFeatures function from scuttle which takes an sce object and returns a matrix. This means I have to recreate the sce, which currently only returns one assay (logcounts). If this is an issue we will need to loop through all assays and run sumCountsAcrossFeatures for each one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants