-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --nodown Option to Exclude Nodes That Are Down #449
base: master
Are you sure you want to change the base?
Conversation
Add in an option that is similar to the -v option in pdsh. The idea is to skip all hosts that are currently down for one reason or another. The groups.conf file would have something like this in it: [genders] map: nodeattr -n $GROUP all: nodeattr -n -A list: nodeattr -l down: whatsup -n -d || /bin/true Example usage without -D: > clush -a -b /path/to/command.sh host8: mcmd: connect failed: No route to host host12: mcmd: connect failed: No route to host clush: host[8,12] (2): exited with exit code 1 --------------- host[1-7,9-11] (10) --------------- Hello World Example usage with -D: > clush -D -a -b /path/to/command.sh --------------- host[1-7,9-11] (10) --------------- Hello World
Looks like I broke a test. Will fix. Thanks. |
Use named parameters instead of relying on position when initializing UpcallGroupSource class.
Hello. Would like to start a conversation about this pull request. Would like to give clush the ability as described here to skip nodes that are "down." On large clusters, there may be several nodes that are down regularly and would prefer not to send commands to them at all. This pull request is a start at that. Would love your feedback! Thanks! |
Thanks for this PR. Few questions to understand this better
My experience running supercomputer indicates that the source giving the group of nodes is often a totally different one to the one giving the status of nodes. ie:
where you usually combine them with:
I'm looking toward naming the option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a whole I agree with @degremont -- this would be nice to be able to use a group from another source instead of hardcoding a down mapping in each source.
Maybe add a config option that automatically 'ands' with a group specified in config unless explicitely disabled? That is more generic and would suit your usage while avoiding having to redefine the down attribute everywhere.
I agree the '&states:up' idiom is a bit cumbersome so having some shortcut would definitely be nice though.
@@ -224,12 +227,12 @@ def _upcall_cache(self, upcall, cache, key, **args): | |||
raise GroupSourceNoUpcall(upcall, self) | |||
|
|||
# Purge expired data from cache | |||
if key in cache and cache[key][1] < time.time(): | |||
if key in cache and cache[key]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't look right, you're purging the whole cache instead of a time-based expiration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Needs to be fixed. Was seeing a problem where cache[key][1]
was not defined and was blowing up. Did this temporarily and forgot to go back and fix. Thanks!
self.logger.debug("PURGE EXPIRED (%d)'%s'", cache[key][1], key) | ||
del cache[key] | ||
|
||
# Fetch the data if unknown of just purged | ||
if key not in cache: | ||
if key not in cache or not cache[key]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one could make sense but could use its own commit at least, possibly its own PR.
You're basically saying not to cache empty groups and I'm not sure I agree e.g. if your "down" set is currently empty then it won't be cached and getting the list of down nodes is possibly expensive.
whatsup is a tool that calculates what hosts are up/down in a cluster
Not sure there is a need. The "down" is optional. I am not a slurm expert, but you might do it like:
I think there may be times where we want to run the command everywhere and visibly see that we got an error connecting to that host. Can't think of a great reason at the moment, but probably just tired.
Totally fine with changing the name/polarity of the option if you prefer. I was just mapping this to the -v option in pdsh (and was trying to give it a similar feel). From pdsh manpage:
|
I feel like we should go toward something like
Then find a proper way to declare the upcall, maybe in
And add to clush
In my thinking, running on To be discussed... |
Thanks @hawartens for the info! And I like @degremont's idea very much. It would only require modification of clush, and we don't need to add a new group upcall. What do you think @hawartens, would that fit your needs? |
A couple of points:
With that in mind I don't think adding a new option to clush would be worth it, I'm not fan of having too many options. A source has the drawback of being heavier syntax if you're going to list many nodeset or nodes but there might be some syntax trick to make @up:node[0-1000] or @up:group1,group2 to work, like escaping the comma? would need to play a bit more with that but it doesn't look impossible to me. We definitely should ship an example of such a source if you come up with one, though, so others can also benefit more readily. |
@thiell I am good with any idea we have here that ends up making it very simple to avoid nodes that are down as the |
Not entirely sure I agree here. I prefer your initial assessment |
I think @thiell and I agreed on processing with a new configuration entry in
Don't hesitate to force-push your branch after doing these changes |
Sorry I have not gotten to this yet, been focused on other things. Will get back to it soon. |
@degremont Finally looking at this again. Just want to make sure I really understand what you are asking for here. It sounds to me like instead of adding the
And other sites can define down_nodes something else (or unset by default). Not 100% sure, but I believe this also means that I still need to add resolver code for that into |
Actually, the idea is indeed to add a new entry in
which could be anything else, like:
or any nodeset syntax
That way, this will handle the nodeset resolution for you and you will not have nothing to do. You just need in the code something like (kind of):
|
@degremont Okay. Sounds good. Thanks for the clarification. |
Add in an option that is similar to the -v option in pdsh. The idea
is to skip all hosts that are currently down for one reason or another.
The groups.conf file would have something like this in it:
Example usage without -D: