Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up large sample/panel analysis #118

Open
kieranrcampbell opened this issue Mar 7, 2023 · 4 comments
Open

Speed up large sample/panel analysis #118

kieranrcampbell opened this issue Mar 7, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@kieranrcampbell
Copy link
Member

Using the lymph node tablula sapiens dataset (all cells) + the uploaded 100 CD list takes ~3-4 minutes to create the plots / analysis (after clicking "run analysis") which I think is too long. Can profvis be run on this example to see what's taking so long? We then have some options:

If scoring

  • reduce number of folds
  • choose different algorithm

If UMAP

  • Move to background computation

Other?

@matt-sd-watson
Copy link
Collaborator

In advance of profvis, I know two of the more consuming processes here:

  • Scran's findMarkers
  • Running the nnet scoring, k = 10 folds

For the first one, findMarkers grows linearly with the number of categories to run. For example, 27 cell types runs noticeably slower than 6 cell types such as the Heart dataset. Parallelization of these processes is likely not possible because shinyapps says that it never guarantees a certain number of cores on any server instance. I have tried guessing at finding the available number of cores on any given instance, but any attempt at more than 1 core will blow up the memory on a server and will be guaranteed to crash. With more than one user I am certain that trying to parallelize the marker finding will crash everything.

For the second one, I also notice that "non-optimal" panels will score much slower than cytosel-recommended panels. This can be seen by comparing the speed of running the Lymph Node set without an uploaded list vs. the CD set. We may want to try a different scorer but this won't be a trivial replacement and will likely require a significant code re-write for compatibility.

The real crux here is R's single-threaded nature. When combined with deployment on a server with shared instances for multiple users, we are limited in the amount of "parallel" computing that we want do without 1. crashing everything completely, or 2. slowing down shared instances significantly. If we really need to add parallel computing and multiple threads, we will probably need to move away from shinyapps.

@kieranrcampbell
Copy link
Member Author

Ok thanks for the insights, this is really helpful. For findMarkers, I'd be surprised if we can speed it up much more as Aaron Lun tends to write highly optimized code.

Two questions

  1. What set of genes are put into findMarkers ? If we can narrow this down, it may help things.
  2. When is findMarkers re-run? If we can limit this, it may speed up subsequent runs

For the panel scoring, I've been really happy that scores approximately correspond to how good the panel is for that cell type, so don't want to break this too much. Could you push a version that uses k=5 rather than 10? This should speed things up.

Looking at other solutions:

>  system.time({nnet::multinom(y~., data = df, trace = FALSE, MaxNWts = 100000)})
   user  system elapsed 
 23.014   0.132  23.235
> system.time({Rfast::multinom.reg(y,x)})
## gave up because it took so long

not so Rfast...will keep looking

@matt-sd-watson matt-sd-watson self-assigned this Mar 8, 2023
@matt-sd-watson matt-sd-watson added the enhancement New feature or request label Mar 8, 2023
@matt-sd-watson
Copy link
Collaborator

cytosel_profvis_lymph_node_cd_1_200_all_genes.zip

Uploaded zip file of the profvis profile for Lymph Node with 100+ CD markers. As discussed, reducing the number of genes to profile to only protein coding genes can drastically speed up the marker finding process.

@matt-sd-watson
Copy link
Collaborator

Analysis speed could be improved significantly by switching from scran findMarkers to scoreMarkers as detailed in #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants