Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v290 reprocessing killed #783

Open
clairem789 opened this issue Oct 3, 2024 · 20 comments
Open

v290 reprocessing killed #783

clairem789 opened this issue Oct 3, 2024 · 20 comments

Comments

@clairem789
Copy link
Collaborator

After about 6-12h of processing, I found the terminal with a "Killed: 9" and all terminated.
In google it says that the application has received a signal...
Not sure what to do and when this will show up again. Any clue? Maybe a memory leak?
I'll start ip again with an eye on the activity panel to check memory.
thanks

@clairem789
Copy link
Collaborator Author

Possibly related to multiprocessing options once more, but I do not find those options anymore. Have there been changes on this topic from v289?
thanks

@njcuk9999
Copy link
Owner

Yer these Killed: 9 can be anything... from a command like killall -9 python or kill -9 {pid} or from a memory issue or many others - its something "outside" python hence no python error and python is just killed.

As for the multiprocessing the options are still there (they were never there by default I believe)

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
# 	when parallelising recipes
#	dtype=string default=process
#	options = pool, process
REPROCESS_MP_TYPE = process

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
# 	when validating recipes
#	dtype=string default=process
#	options = linear, pool, process, pathos
REPROCESS_MP_TYPE_VAL = process

You'll have to check your old setup but I think the REPROCESS_MP_TYPE_VAL on your machine had to be set to linear?

@clairem789
Copy link
Collaborator Author

My past notes are not precise enough unfortunately. I thought that it worked with default options in the last version but I may be wrong. I restarted with pool option (val). If it fails again I'll try the linear.

@larnoldgithub
Copy link
Collaborator

I'm also moving to 290 and was wondering if REPROCESS_MP_TYPE_VAL shouldbe set to 'linear', as it's 'process' by default. 'process' seems to work for the minidata set, but for the full set of data ? I'll set the kw to linear.

@larnoldgithub
Copy link
Collaborator

I hed set it to linear since the beginning with the 288, as it failed with 'process'. what does 'pool' do ?

@clairem789
Copy link
Collaborator Author

clairem789 commented Oct 4, 2024

Status here: I tried all three options, and had a memory leak with all, even with :
REPROCESS_MP_TYPE_VAL.value = 'linear'
(in apero-drs/apero/core/instruments/spirou/default_constants.py)
Unfortunately I don't know what to try next.

@larnoldgithub
Copy link
Collaborator

larnoldgithub commented Oct 4, 2024

I launched apero_precheck after a fresh installation of the 290 and I can see the memory usage linearly increasing while APERO is updating the index db.
Screenshot 2024-10-03 at 19 41 05

@clairem789
Copy link
Collaborator Author

@luc, this is with any of the options for REPROCESS_MP_TYPE_VAL.value?

@larnoldgithub
Copy link
Collaborator

larnoldgithub commented Oct 4, 2024

linear
But the memory increase above is before anything with the processing: it's during the index db update with apero_precheck.
is it an issue with mySQL ? claire when did your crash occur: preprocessing ?

@clairem789
Copy link
Collaborator Author

My crashes come at the validation process like you. It was the case before (v284) with the "process" option but I thought it had been fixed at the 288 version.

@njcuk9999
Copy link
Owner

I had set it to linear since the beginning with the 288, as it failed with 'process'. What does 'pool' do ?

process and pool are just two different ways of multiprocessing: https://stackoverflow.com/questions/18176178/python-multiprocessing-process-or-pool-for-what-i-am-doing

Possibly related to multiprocessing options once more, but I do not find those options anymore. Have there been changes on this topic from v289?

There have been no changes related to this but its a complex web (one which I'm definitely simplifying for v0.8)

My crashes come at the validation process like you. It was the case before (v284) with the "process" option but I thought it had been fixed at the 288 version.

This was never "fixed" as I still never got to the bottom of what was causing it - the "linear" option used to fix it for you so that seemed "enough" to have that (slower) option.

@larnoldgithub and @clairem789 can you both try with v0.7.289 and v0.7.288 and verify that the problem comes from v0.7.290 I'll have to go through all changes line-by-line to see what changed that could possible affect it.

@clairem789
Copy link
Collaborator Author

So I guess that the issue is not seen on UdM machines...?
I could try to do this test with 288, doing:
git checkout v0.7.288-stable-test
replacing process by linear in the default_constants for VAL
processing the complete run, and checking the memory usage during the first hours.
Will keep you informed!

@njcuk9999
Copy link
Owner

njcuk9999 commented Oct 4, 2024

You never should be changing the default_constants!!

Please use the user_config.ini and user_constants.ini files....

you can read the default_constants and default_config to look for constants to change but all changes should be added to the user_config.ini or user_constants.ini file (in your setup directory) - the values in user_xxx.ini will always overwrite default_xxx.py and also you wont be able to change branch if you modify the python files - so please don't do that!

i.e. add the following to user_constants.ini

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
# 	when parallelising recipes
#	dtype=string default=process
#	options = pool, process
REPROCESS_MP_TYPE = process

# Define whether to use multiprocess "pool" or "process" or use "linear" mode
# 	when validating recipes
#	dtype=string default=process
#	options = linear, pool, process, pathos
REPROCESS_MP_TYPE_VAL = linear

@njcuk9999
Copy link
Owner

So I guess that the issue is not seen on UdM machines...?

I haven't seen any such issues with NIRPS or SPIRou - though both machines have 300+GB of RAM.

NIRPS is only doing daily processing and I haven't done a large run for either NIRPS or SPIRou.

@clairem789
Copy link
Collaborator Author

OK, sorry for my mistake in changing the wrong file. I'm doing this 2-3 times a year, not enough to remember all small details. It would be nice if it could explain it all, actually!
The NW machine also has >300Gb but that's not enough.
So at UdeM you haven't run the 290 version on the whole Spirou data set then?

@njcuk9999
Copy link
Owner

Not a full reduction - the last runs have been done with v0.7.290 (though this is not as recommended as doing a full re-run) but again processing a single run may not show this issue as badly as redoing everything.

@larnoldgithub
Copy link
Collaborator

OK, sorry for my mistake in changing the wrong file. I'm doing this 2-3 times a year, not enough to remember all small details. It would be nice if it could explain it all, actually! The NW machine also has >300Gb but that's not enough. So at UdeM you haven't run the 290 version on the whole Spirou data set then?

I did the same error in the past... you should have somewhere a folder like .../config/myprofile/ where myprofile is the name of your 'installation' of apero, like offline290. in this folder you have a bunch of files:
database.yaml install.sh install.yaml offline290.bash.setup offline290.sh.setup offline290.zsh.setup user_config.ini user_config.ini.org user_constants.ini user_constants.ini.org

the *.org are the original files I have cp just in case.

The way apero works is that it first reads the default and then updates the values with the user values, then starts the processing.

@larnoldgithub
Copy link
Collaborator

larnoldgithub commented Oct 4, 2024

I did run the 288 with 'process' months ago, it crashed. I set it to 'linear' and has been very stable regarding the PROC at least? I didn't see any memory leak.

For my apero_precheck last night with the 290, the memory usuage increased linearly during the db update, then came back to the 'background level' of the machine. @njcuk9999 do you think this is expected behavior ? apero_precheck ended with no error.
Screenshot 2024-10-04 at 07 53 20

I'll not be able to make a test with the 289 before next week.

@clairem789
Copy link
Collaborator Author

So to close this issue, it was due to my mistake of modifying the REPROCESS_MP_TYPE_VAL option value in the wrong file (default instead of user's files). Incidentally I confirm that the linear option for this parameter is the one working for NewWorlds machine.
Sorry for this!

@njcuk9999
Copy link
Owner

Thanks for clearing this up, I'd rather this than trying to figure out what caused it to break in newer versions and not older ones!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants