glPca order of magnitude faster #150

libor-m · 2016-08-14T20:51:36Z

For 'wide' data sets (~100 individuals, <100k SNPs) - especially with >100k NAs, glPca gets really slow - minutes to hours. The reason is twofold - it either uses cross product implementation with the 'hot loop' in R, or a C implementation with very suboptimal handling of NAs (linear search for NA positions in snpbin.c::snpbin_isna).

This PR introduces a new function glPcaFast, which cuts the processing time to seconds. It tries to be as compatible as possible with the original. Most of the code is a copy of the original function, only the cross product loop was replaced.
It prepares the data as a full matrix, scaled and centered, with NAs replaced. Memory size of the matrix grows linearly with the size of the genlight object (eg for 160 MB vcf, 6 MB genlight it's ~60 MB). glPcaFast then uses base::tcrossprod to calculate the result in an efficient way. It's trading memory usage for speed and code simplicity, because 100 MB RAM is close to nothing nowadays.

Ideally this would be tested more thoroughly, merged with the original glPca and used as a default unless the data set is really huge. But I didn't want to break some other code possibly depending on some particular behavior of the original glPca and would prefer to leave it on the author.

merge upstream

thibautjombart · 2016-08-14T22:28:14Z

Sounds great, many thanks for this. It would be great to integrate this to
the main function, and add an argument to glPca to specify what is
optimized (ram or speed, defaulting to the new version).

Another reform would be to use .Call to avoid copying the data, but maybe
for another time.

Would you like to PR the integration above? No worries if not, but I'll
probably only get to work on it in a couple of weeks.

In any case, thanks again for a welcome tweak!

On 14 Aug 2016 21:51, "Libor Mořkovský" [email protected] wrote:

For 'wide' data sets (~100 individuals, <100k SNPs) - especially with

100k NAs, glPca gets really slow - minutes to hours. The reason is
twofold - it either uses cross product implementation with the 'hot loop'
in R, or a C implementation with very suboptimal handling of NAs (linear
search for NA positions in snpbin.c::snpbin_isna).

This PR introduces a new function glPcaFast, which cuts the processing
time to seconds. It tries to be as compatible as possible with the
original. Most of the code is a copy of the original function, only the
cross product loop was replaced.
It prepares the data as a full matrix, scaled and centered, with NAs
replaced. Memory size of the matrix grows linearly with the size of the
genlight object (eg for 160 MB vcf, 6 MB genlight it's ~60 MB). glPcaFast
then uses base::tcrossprod to calculate the result in an efficient way.
It's trading memory usage for speed and code simplicity, because 100 MB RAM
is close to nothing nowadays.

Ideally this would be tested more thoroughly, merged with the original
glPca and used as a default unless the data set is really huge. But I
didn't want to break some other code possibly depending on some particular

behavior of the original glPca and would prefer to leave it on the author.

You can view, comment on, or merge this pull request online at:

#150
Commit Summary

Merge pull request possible bug in hybridize #1 from thibautjombart/master

add glPcaFast

File Changes

M R/glFunctions.R
https://github.com/thibautjombart/adegenet/pull/150/files#diff-0
(168)

Patch Links:

https://github.com/thibautjombart/adegenet/pull/150.patch

https://github.com/thibautjombart/adegenet/pull/150.diff

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#150, or mute the thread
https://github.com/notifications/unsubscribe-auth/AKQkIhwgwg4uA-gY2UJO4xWO4W0v3v2Nks5qf3_ZgaJpZM4Jj-4W
.

libor-m · 2016-08-15T07:26:54Z

It was kind of a 'weekend project', I don't really have time for a proper refactoring. It would probably make sense to drop some parts of the code - eg. the parallel R implementation - to keep the code simple. If the C code will survive, the NA handling needs some work - probably replace the NA list by a bitmap which can be accessed by one pointer addition and masking or do the NA handling beforehand.

Maybe I'd find some time to do bits of it if we agree on particular approach.

thibautjombart · 2017-04-11T10:41:16Z

Hi there,
here's a poke about this. I am keen to merge it but travis fails. Any lead on this?

zkamvar · 2017-04-13T20:13:03Z

The failure is due to a lack of documentation for the new function.

libor-m · 2024-09-25T16:24:35Z

@thibautjombart how relevant is this PR 8 years later? I'm cleaning up my PR list, and would like to get rid of this.

Is that change (and the whole package) still relevant and some bits of my low-effort help would be worth it, or noone cares any more and I'll just close the PR?

Excuse my silly questions, I've left the field a few years ago..

zkamvar · 2024-09-27T18:45:14Z

I've been the maintainer of this since the pandemic and I do think it's still worthwhile incorporating, especially given these values.

If it's alright with you, I can merge it into a separate branch and then work on it from there. I will also add you as an author to the description as "Libor Mořkovský".

libor-m · 2024-09-30T11:51:24Z

@zkamvar sure, do anything you need to get this moving. tag me, should you need any assistance!

libor-m and others added 2 commits August 14, 2016 11:46

Merge pull request #1 from thibautjombart/master

8b28e0a

merge upstream

add glPcaFast

a852b26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

glPca order of magnitude faster #150

glPca order of magnitude faster #150

libor-m commented Aug 14, 2016

thibautjombart commented Aug 14, 2016

behavior of the original glPca and would prefer to leave it on the author.

libor-m commented Aug 15, 2016

thibautjombart commented Apr 11, 2017

zkamvar commented Apr 13, 2017

libor-m commented Sep 25, 2024

zkamvar commented Sep 27, 2024

libor-m commented Sep 30, 2024

glPca order of magnitude faster #150

Are you sure you want to change the base?

glPca order of magnitude faster #150

Conversation

libor-m commented Aug 14, 2016

thibautjombart commented Aug 14, 2016

behavior of the original glPca and would prefer to leave it on the author.

libor-m commented Aug 15, 2016

thibautjombart commented Apr 11, 2017

zkamvar commented Apr 13, 2017

libor-m commented Sep 25, 2024

zkamvar commented Sep 27, 2024

libor-m commented Sep 30, 2024