Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Not Run on Large Datasets when using python 3.7 #6

Open
MattScicluna opened this issue Apr 12, 2022 · 7 comments · May be fixed by #7
Open

Does Not Run on Large Datasets when using python 3.7 #6

MattScicluna opened this issue Apr 12, 2022 · 7 comments · May be fixed by #7

Comments

@MattScicluna
Copy link

MattScicluna commented Apr 12, 2022

Hi,
I am not able to use MS PHATE with python 3.7.

Reproducible example:

virtualenv --no-download phate_test_env # change path to env as needed
source phate_test_env/bin/activate # load env
pip install git+https://github.com/KrishnaswamyLab/Multiscale_PHATE
python

import numpy as np
from multiscale_phate import multiscale_phate
data = np.random.uniform(size=(5000000,200)) # make sure its big
msphate_obj = multiscale_phate.Multiscale_PHATE(n_pca=None, n_jobs=48, knn=200)
msphate_obj.fit(data)

sterr is:

  File "[...]/lib/python3.7/site-packages/multiscale_phate/multiscale_phate.py", line 158, in fit
    self.hash = utils.hash_object(X)
  File "[...]/lib/python3.7/site-packages/multiscale_phate/utils.py", line 18, in hash_object
    return hash(pickle.dumps(X))
OverflowError: cannot serialize a bytes object larger than 4 GiB

edited to make data shape more realistic

scottgigante added a commit that referenced this issue Apr 12, 2022
@scottgigante scottgigante linked a pull request Apr 12, 2022 that will close this issue
@scottgigante
Copy link
Collaborator

scottgigante commented Apr 12, 2022

@MattScicluna could you please install from branch scottgigante/bugfix/pickle and see if this solves your issue?

@MattScicluna
Copy link
Author

Thanks for the prompt response. I will try this and get back to you.

@MattScicluna
Copy link
Author

Hi Scott,
Thanks, this does solve the OverflowError

@MattScicluna
Copy link
Author

@scottgigante I tried running the code on a real dataset of comparable size. Unfortunately, there is now a different bug.
During the Calculating partitions... step, I get the following error:

--------------------------------------------------------------------------------
LokyProcess-[...] failed with traceback: 
--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "[...]/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
    prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5

I checked that the code works in 3.8

@scottgigante
Copy link
Collaborator

Interesting. So protocol 5 is supported by python 3.7 but not for joblib. I've added a change to the same branch, let me know if that helps.

@MattScicluna
Copy link
Author

Hi Scott,
It gave me the same error.
I am wondering if loky is defaulting to cloudpickle (which could be using protocol 5)?
I added a couple of lines to compress.py.

In the beginning of compress.py:
from joblib.externals.loky import set_loky_pickler

I put this just before the while np.max(cluster_counts) > ...:
set_loky_pickler('pickle')

Those lines seemed to fix the bug, but I am not 100% sure of the side effects...

@scottgigante
Copy link
Collaborator

@MattScicluna I did some reading and looks like that's a good solution. I've put it on that same branch -- let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants