Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop when using Scikit-learn #58

Open
VNDRN opened this issue Apr 20, 2020 · 3 comments
Open

Performance drop when using Scikit-learn #58

VNDRN opened this issue Apr 20, 2020 · 3 comments
Labels
question Further information is requested

Comments

@VNDRN
Copy link

VNDRN commented Apr 20, 2020

When training Scikit-learn models I noticed that the code finishes significantly faster in a normal cell. When running the same function in a JupyterDash app, it takes up to 20% more time to finish.

I tested it with the random forest regression and support vector regression models using multiprocessing. Can JupyterDash not use multiprocessing? Or does a regular codecell in a notebook have more resources than the code run within a JupyterDash app?

@GibbsConsulting
Copy link
Owner

Are you running the code inside a callback?

There is nothing special about JupyterDash although it does add a listener onto a port for the dash callbacks; this is on the (default) asyncio event loop. I don't know if Scikit-learn changes or modifies this at all.

Are you in a position to share an example?

@VNDRN
Copy link
Author

VNDRN commented Apr 26, 2020

The function that runs the training of the model is declared outside of the app. It is called however from within a callback. I will try to provide an example but have to redact some info due to a NDA with my workplace

from joblib import Parallel, delayed
import multiprocessing
from sklearn.ensemble import RandomForestRegressor

def forestTrainer(amount):
    t1 = datetime.now()
    model = RandomForestRegressor(n_estimators=amount)
    def trainPeriod period(i):
        model.fit(train_data)
        test_predict = model.predict(test_data)
        return mean_absolute_error(test_data)

    num_cores = multiprocessing.cpu_count()
    scores = Parallel(n_jobs=num_cores)(delayed(trainPeriod)(i) for i in range(1,len(data))
    t2 = datetime.now()-t1
    return ("The average MAE over all experiments is {}, time is {}".format(round((float(sum(scores))/len(scores)),2), t2))

When used in JupyterDash, I call the function like this

def update_output(n_clicks):
    if n_clicks is None:
        raise PreventUpdate
    return forestTrainer(amount)

@GibbsConsulting
Copy link
Owner

The multiprocessing module uses separate processes to do the calculation. I don't know to what extent this is affected by how you call it.

As a test, are you also able to measure the relative calc times if you dont use the muiltiprocessing Parallel feature?

@GibbsConsulting GibbsConsulting added the question Further information is requested label May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants