-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insane memory usage #32
Comments
Hi @BEFH... |
Hey @BEFH, sounds like you have a lot of data on your hands. Nice! Are you attempting to use one of our trained models or are you training your own? |
@weekend37 hi, |
@weekend37, I'm training my own. It's microarray, so I need to filter the reference to get good overlap. Doing this seems to help, along with splitting the target dataset into multiple sets of samples. I just want to be sure doing this is not hurting the quality of the reference or how good the accuracy estimates are. |
Hi @BEFH , Would you be able to share with us the config.yaml file that you are using (if you are using a custom config). If not, please let us know which default config you are using. We can explore some of the config options that lower your memory requirements. |
For me, I ended up pruning my data (from 1.5kk markers), to 300k… |
I have just had a gnomix run die after attempting to use more than 1.4 TB of memory. Yes, terabytes. These are unimputed GSA microarray data phased using eagle. I am fitting the model myself using the suggested microarray configuration, and for now, I am only calculating local ancestry on chromosome 17. Based on reference overlap even after generating the local model, it looks like I will either need to filter the reference before model generation or impute the data.
I suspect the issue is partially sample size. I have 31,705 samples in that cohort. I am also running it on a GDA cohort (10,859 samples, and another cohort of 13 samples, and it did not die on the smallest cohort. I have a couple of questions on how to optimize this:
Firstly, it appears that the model generation only uses the reference dataset and not the sample to which it will be applied. I wrote a script to compare the models generated with different datasets and they appear to be identical. Is that the case? I ran without calibration, so is that the case with calibration?
Secondly, is there any problem with first generating the models, then applying them to all of the different datasets? Do you have a recommendation for a minimal dataset to use for that for the model generation to happen as fast as possible?
The text was updated successfully, but these errors were encountered: