Speed up large sample/panel analysis #118

kieranrcampbell · 2023-03-07T21:02:48Z

Using the lymph node tablula sapiens dataset (all cells) + the uploaded 100 CD list takes ~3-4 minutes to create the plots / analysis (after clicking "run analysis") which I think is too long. Can profvis be run on this example to see what's taking so long? We then have some options:

If scoring

reduce number of folds
choose different algorithm

If UMAP

Move to background computation

Other?

The text was updated successfully, but these errors were encountered:

matt-sd-watson · 2023-03-07T21:18:23Z

In advance of profvis, I know two of the more consuming processes here:

Scran's findMarkers
Running the nnet scoring, k = 10 folds

For the first one, findMarkers grows linearly with the number of categories to run. For example, 27 cell types runs noticeably slower than 6 cell types such as the Heart dataset. Parallelization of these processes is likely not possible because shinyapps says that it never guarantees a certain number of cores on any server instance. I have tried guessing at finding the available number of cores on any given instance, but any attempt at more than 1 core will blow up the memory on a server and will be guaranteed to crash. With more than one user I am certain that trying to parallelize the marker finding will crash everything.

For the second one, I also notice that "non-optimal" panels will score much slower than cytosel-recommended panels. This can be seen by comparing the speed of running the Lymph Node set without an uploaded list vs. the CD set. We may want to try a different scorer but this won't be a trivial replacement and will likely require a significant code re-write for compatibility.

The real crux here is R's single-threaded nature. When combined with deployment on a server with shared instances for multiple users, we are limited in the amount of "parallel" computing that we want do without 1. crashing everything completely, or 2. slowing down shared instances significantly. If we really need to add parallel computing and multiple threads, we will probably need to move away from shinyapps.

kieranrcampbell · 2023-03-07T22:21:55Z

Ok thanks for the insights, this is really helpful. For findMarkers, I'd be surprised if we can speed it up much more as Aaron Lun tends to write highly optimized code.

Two questions

What set of genes are put into findMarkers ? If we can narrow this down, it may help things.
When is findMarkers re-run? If we can limit this, it may speed up subsequent runs

For the panel scoring, I've been really happy that scores approximately correspond to how good the panel is for that cell type, so don't want to break this too much. Could you push a version that uses k=5 rather than 10? This should speed things up.

Looking at other solutions:

>  system.time({nnet::multinom(y~., data = df, trace = FALSE, MaxNWts = 100000)})
   user  system elapsed 
 23.014   0.132  23.235

> system.time({Rfast::multinom.reg(y,x)})
## gave up because it took so long

not so Rfast...will keep looking

matt-sd-watson · 2023-03-13T19:09:35Z

cytosel_profvis_lymph_node_cd_1_200_all_genes.zip

Uploaded zip file of the profvis profile for Lymph Node with 100+ CD markers. As discussed, reducing the number of genes to profile to only protein coding genes can drastically speed up the marker finding process.

matt-sd-watson · 2023-03-22T17:43:03Z

Analysis speed could be improved significantly by switching from scran findMarkers to scoreMarkers as detailed in #49

matt-sd-watson self-assigned this Mar 8, 2023

matt-sd-watson added the enhancement New feature or request label Mar 8, 2023

matt-sd-watson mentioned this issue Mar 23, 2023

PR_MW_23Mar2023 #122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up large sample/panel analysis #118

Speed up large sample/panel analysis #118

kieranrcampbell commented Mar 7, 2023

matt-sd-watson commented Mar 7, 2023

kieranrcampbell commented Mar 7, 2023

matt-sd-watson commented Mar 13, 2023

matt-sd-watson commented Mar 22, 2023

Speed up large sample/panel analysis #118

Speed up large sample/panel analysis #118

Comments

kieranrcampbell commented Mar 7, 2023

matt-sd-watson commented Mar 7, 2023

kieranrcampbell commented Mar 7, 2023

matt-sd-watson commented Mar 13, 2023

matt-sd-watson commented Mar 22, 2023