Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reset command from previous branch #724

Merged
merged 15 commits into from
Oct 31, 2023
Merged

Add reset command from previous branch #724

merged 15 commits into from
Oct 31, 2023

Conversation

aquan9
Copy link
Collaborator

@aquan9 aquan9 commented Sep 19, 2023

Make a beeflow reset command with warning message. The command just finds and removes the .beeflow directory.

This should hopefully resolve #708

This is a continuation of PR #712

@aquan9 aquan9 added the WIP Work in progress label Sep 19, 2023
@aquan9 aquan9 mentioned this pull request Sep 19, 2023
@aquan9 aquan9 removed the WIP Work in progress label Sep 19, 2023
@aquan9 aquan9 requested a review from pagrubel September 19, 2023 21:29
Copy link
Collaborator

@pagrubel pagrubel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See beeflow/wf_manager/resources/wf_utils.py
You can use get_bee_workdir to find the path

beeflow/client/core.py Outdated Show resolved Hide resolved
beeflow/client/core.py Outdated Show resolved Hide resolved
beeflow/client/core.py Outdated Show resolved Hide resolved
beeflow/client/core.py Outdated Show resolved Hide resolved
beeflow/client/core.py Outdated Show resolved Hide resolved
beeflow/client/core.py Outdated Show resolved Hide resolved
docs/sphinx/commands.rst Outdated Show resolved Hide resolved
docs/sphinx/commands.rst Outdated Show resolved Hide resolved
@pagrubel
Copy link
Collaborator

@aquan9 As I was reviewing I found some minor changes where .beeflow was still used and will commit them. However, I'm still testing. I believe I found an error if someone has a workflow running. I'll post soon.

@pagrubel
Copy link
Collaborator

This is an error that occured if a reset was done while workflows were still running. I'm thinking we should check for running workflows using beeflow list and advise the user to either let them finish or cancel them via beeflow cancel <wf_id>

@pagrubel
Copy link
Collaborator

oops forgot to post the error:

Waiting for components to cleanly stop.
Traceback (most recent call last):

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/bin/beeflow", line 6, in <module>
    sys.exit(main())

  File "/vast/home/pagrubel/BEE/BEE/beeflow/client/bee_client.py", line 554, in main
    app()

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 289, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 280, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1157, in __call__

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1078, in main

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1688, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 1434, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/click/core.py", line 783, in invoke

  File "/vast/home/pagrubel/.cache/pypoetry/virtualenvs/hpc-beeflow-YDRVf3zF-py3.9/lib/python3.9/site-packages/typer/main.py", line 607, in wrapper

  File "/vast/home/pagrubel/BEE/BEE/beeflow/client/core.py", line 428, in reset
    shutil.rmtree(directory_to_delete)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 732, in rmtree
    _rmtree_safe_fd(fd, path, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 665, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 671, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())

  File "/projects/opt/centos8/x86_64/miniconda3/py39_4.12.0/lib/python3.9/shutil.py", line 669, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)

OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu'

@pagrubel
Copy link
Collaborator

pagrubel commented Sep 27, 2023

So if I had a workflow running when I did the beeflow core reset it left a neo4j process running:
ps aux |grep pagrubel |grep -v grep| grep -E 'bee|slurmrest|neo4j' pagrubel 3228289 6.9 1.0 46490656 2892124 ? Sl 13:41 0:29 /usr/local/openjdk-8/bin/java -cp /var/lib/neo4j/plugins:/var/lib/neo4j/conf:/var/lib/neo4j/lib/*:/var/lib/neo4j/plugins/* -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+TrustFinalNonStaticFields -XX:+DisableExplicitGC -Djdk.tls.ephemeralDHKeySize=2048 -Djdk.tls.rejectClientInitiatedRenegotiation=true -Dunsupported.dbms.udc.source=tarball -Dfile.encoding=UTF-8 org.neo4j.server.CommunityEntryPoint --home-dir=/var/lib/neo4j --config-dir=/var/lib/neo4j/conf

And more if there was more than one workflow running. I think we should check for running workflows and inform the user that they will be cancelled if they continue with the reset, then we will need to kill the GDB instances for that user.

@aquan9
Copy link
Collaborator Author

aquan9 commented Oct 3, 2023

I'm wondering if the changes to fix this need to happen at the level of "quit" call. Because as it stands, the "beeflow stop" command should also have the same problem.

Both beeflow stop, and beeflow reset are calling:

resp = cli_connection.send(paths.beeflow_socket(), {'type': 'quit'})

@pagrubel
Copy link
Collaborator

Discussion during Oct 10 meeting:

neo4j orphaned processes have a file on ~/.beeflow so beeflow core stop works but beeflow core reset fails since it deletes the ~/.beeflow
~/.beeflow/worflows/<wf_id> is bind mounted into neo4j in /tmp so as long as an instance is running ~/.beeflow can't be deleted

The pid for each neo4j instance is in the wf_manager database so we could kill those.

We also need to evaluate beeflow cancel <wf_id> which leaves orphaned neo4j instances around

We still need to look at using a different database system, but fix this now.

For now should we search for any running workflows and if there are print a message telling the user they need to either wait or cancel the workflows.

@pagrubel
Copy link
Collaborator

1.) I get this error if -a is used and <bee_workdir>.backup already exists:
error.txt
If the -a --archive flag is set, check for the file, before doing anything else and give a warning and exit.

2.) Maybe we should only be archiving the archives directory and the logs. I get this error when I try to archive (when the above doesn't apply). I think it has to do with some of the active sockets and processes. I'm thinking we should only copy <bee_workdir>/archives and logs, maybe the db files. Would that help?
error-archive.txt

If I don't care to keep anything everything works fine.

@pagrubel
Copy link
Collaborator

@aquan9 I think if you will just copy the logs and archives the -a option will work. You may want to query if they want to copy the container_archive directory if it exists, since the user can change that to another location in the configuration file and the files can be quite large.

@pagrubel pagrubel self-requested a review October 30, 2023 23:57
@pagrubel pagrubel requested a review from jtronge October 30, 2023 23:58
@pagrubel
Copy link
Collaborator

@jtronge Since I made the last changes would you please review them

@jtronge
Copy link
Collaborator

jtronge commented Oct 31, 2023

This seems to work for me. If I tried to submit a workflow with the --no-start option, then I ended up with the OSError: [Errno 39] Directory not empty: 'x86_64-linux-gnu' error on calling reset, but maybe this is expected for that case.

@pagrubel pagrubel merged commit 1bd5980 into develop Oct 31, 2023
4 checks passed
@pagrubel pagrubel deleted the reset-beeflow2 branch October 31, 2023 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Beeflow Reset Command
3 participants