Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glPca order of magnitude faster #150

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

libor-m
Copy link
Contributor

@libor-m libor-m commented Aug 14, 2016

For 'wide' data sets (~100 individuals, <100k SNPs) - especially with >100k NAs, glPca gets really slow - minutes to hours. The reason is twofold - it either uses cross product implementation with the 'hot loop' in R, or a C implementation with very suboptimal handling of NAs (linear search for NA positions in snpbin.c::snpbin_isna).

This PR introduces a new function glPcaFast, which cuts the processing time to seconds. It tries to be as compatible as possible with the original. Most of the code is a copy of the original function, only the cross product loop was replaced.
It prepares the data as a full matrix, scaled and centered, with NAs replaced. Memory size of the matrix grows linearly with the size of the genlight object (eg for 160 MB vcf, 6 MB genlight it's ~60 MB). glPcaFast then uses base::tcrossprod to calculate the result in an efficient way. It's trading memory usage for speed and code simplicity, because 100 MB RAM is close to nothing nowadays.

Ideally this would be tested more thoroughly, merged with the original glPca and used as a default unless the data set is really huge. But I didn't want to break some other code possibly depending on some particular behavior of the original glPca and would prefer to leave it on the author.

@thibautjombart
Copy link
Owner

Sounds great, many thanks for this. It would be great to integrate this to
the main function, and add an argument to glPca to specify what is
optimized (ram or speed, defaulting to the new version).

Another reform would be to use .Call to avoid copying the data, but maybe
for another time.

Would you like to PR the integration above? No worries if not, but I'll
probably only get to work on it in a couple of weeks.

In any case, thanks again for a welcome tweak!

On 14 Aug 2016 21:51, "Libor Mořkovský" [email protected] wrote:

For 'wide' data sets (~100 individuals, <100k SNPs) - especially with

100k NAs, glPca gets really slow - minutes to hours. The reason is
twofold - it either uses cross product implementation with the 'hot loop'
in R, or a C implementation with very suboptimal handling of NAs (linear
search for NA positions in snpbin.c::snpbin_isna).

This PR introduces a new function glPcaFast, which cuts the processing
time to seconds. It tries to be as compatible as possible with the
original. Most of the code is a copy of the original function, only the
cross product loop was replaced.
It prepares the data as a full matrix, scaled and centered, with NAs
replaced. Memory size of the matrix grows linearly with the size of the
genlight object (eg for 160 MB vcf, 6 MB genlight it's ~60 MB). glPcaFast
then uses base::tcrossprod to calculate the result in an efficient way.
It's trading memory usage for speed and code simplicity, because 100 MB RAM
is close to nothing nowadays.

Ideally this would be tested more thoroughly, merged with the original
glPca and used as a default unless the data set is really huge. But I
didn't want to break some other code possibly depending on some particular

behavior of the original glPca and would prefer to leave it on the author.

You can view, comment on, or merge this pull request online at:

#150
Commit Summary

File Changes

Patch Links:


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#150, or mute the thread
https://github.com/notifications/unsubscribe-auth/AKQkIhwgwg4uA-gY2UJO4xWO4W0v3v2Nks5qf3_ZgaJpZM4Jj-4W
.

@libor-m
Copy link
Contributor Author

libor-m commented Aug 15, 2016

It was kind of a 'weekend project', I don't really have time for a proper refactoring. It would probably make sense to drop some parts of the code - eg. the parallel R implementation - to keep the code simple. If the C code will survive, the NA handling needs some work - probably replace the NA list by a bitmap which can be accessed by one pointer addition and masking or do the NA handling beforehand.

Maybe I'd find some time to do bits of it if we agree on particular approach.

@thibautjombart
Copy link
Owner

Hi there,
here's a poke about this. I am keen to merge it but travis fails. Any lead on this?

@zkamvar
Copy link
Collaborator

zkamvar commented Apr 13, 2017

The failure is due to a lack of documentation for the new function.

@libor-m
Copy link
Contributor Author

libor-m commented Sep 25, 2024

@thibautjombart how relevant is this PR 8 years later? I'm cleaning up my PR list, and would like to get rid of this.

Is that change (and the whole package) still relevant and some bits of my low-effort help would be worth it, or noone cares any more and I'll just close the PR?

Excuse my silly questions, I've left the field a few years ago..

@zkamvar
Copy link
Collaborator

zkamvar commented Sep 27, 2024

I've been the maintainer of this since the pandemic and I do think it's still worthwhile incorporating, especially given these values.

If it's alright with you, I can merge it into a separate branch and then work on it from there. I will also add you as an author to the description as "Libor Mořkovský".

@libor-m
Copy link
Contributor Author

libor-m commented Sep 30, 2024

@zkamvar sure, do anything you need to get this moving. tag me, should you need any assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants