-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glPca order of magnitude faster #150
base: master
Are you sure you want to change the base?
Conversation
merge upstream
Sounds great, many thanks for this. It would be great to integrate this to Another reform would be to use .Call to avoid copying the data, but maybe Would you like to PR the integration above? No worries if not, but I'll In any case, thanks again for a welcome tweak! On 14 Aug 2016 21:51, "Libor Mořkovský" [email protected] wrote:
|
It was kind of a 'weekend project', I don't really have time for a proper refactoring. It would probably make sense to drop some parts of the code - eg. the parallel R implementation - to keep the code simple. If the C code will survive, the NA handling needs some work - probably replace the NA list by a bitmap which can be accessed by one pointer addition and masking or do the NA handling beforehand. Maybe I'd find some time to do bits of it if we agree on particular approach. |
Hi there, |
The failure is due to a lack of documentation for the new function. |
@thibautjombart how relevant is this PR 8 years later? I'm cleaning up my PR list, and would like to get rid of this. Is that change (and the whole package) still relevant and some bits of my low-effort help would be worth it, or noone cares any more and I'll just close the PR? Excuse my silly questions, I've left the field a few years ago.. |
I've been the maintainer of this since the pandemic and I do think it's still worthwhile incorporating, especially given these values. If it's alright with you, I can merge it into a separate branch and then work on it from there. I will also add you as an author to the description as "Libor Mořkovský". |
@zkamvar sure, do anything you need to get this moving. tag me, should you need any assistance! |
For 'wide' data sets (~100 individuals, <100k SNPs) - especially with >100k NAs,
glPca
gets really slow - minutes to hours. The reason is twofold - it either uses cross product implementation with the 'hot loop' in R, or a C implementation with very suboptimal handling of NAs (linear search for NA positions insnpbin.c::snpbin_isna
).This PR introduces a new function
glPcaFast
, which cuts the processing time to seconds. It tries to be as compatible as possible with the original. Most of the code is a copy of the original function, only the cross product loop was replaced.It prepares the data as a full
matrix
, scaled and centered, with NAs replaced. Memory size of the matrix grows linearly with the size of thegenlight
object (eg for 160 MB vcf, 6 MB genlight it's ~60 MB).glPcaFast
then usesbase::tcrossprod
to calculate the result in an efficient way. It's trading memory usage for speed and code simplicity, because 100 MB RAM is close to nothing nowadays.Ideally this would be tested more thoroughly, merged with the original
glPca
and used as a default unless the data set is really huge. But I didn't want to break some other code possibly depending on some particular behavior of the originalglPca
and would prefer to leave it on the author.