-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix ctrl-loss-tmo flag support #30
base: main
Are you sure you want to change the base?
Conversation
📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces a new command-line flag, In the The Finally, the Possibly related PRs
Suggested reviewers
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (8)
cmd/connect-all.go (2)
58-59
: Consider adding a shorthand flag option for consistency.Most other flags in this command have shorthand options (e.g.,
-a
,-s
,-q
). Let's consider adding a shorthand forctrl-loss-tmo
to maintain consistency with the existing pattern.Here's a suggested change:
-cmd.Flags().IntP("ctrl-loss-tmo", "", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)") +cmd.Flags().IntP("ctrl-loss-tmo", "l", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")
Line range hint
75-75
: Fix undefined variable usage.There seems to be an undefined
kato
variable being used in theDiscoverRequest
. We should useviper.GetInt("connect-all.kato")
for consistency.Here's the suggested fix:
- Kato: kato, + Kato: viper.GetInt("connect-all.kato"),pkg/nvmeclient/nvme_client.go (2)
505-507
: Add parameter documentation for better maintainability.Let's add documentation for the new
ctrlLossTMO
parameter to clarify its purpose and expected values.func ConnectAll(discoveryRequest *hostapi.DiscoverRequest, maxIOQueues int, kato int, - ctrlLossTMO int) ([]*CtrlIdentifier, error) { + ctrlLossTMO int) ([]*CtrlIdentifier, error) { + // ctrlLossTMO: Controller loss timeout in seconds. -1 disables the timeout.
535-539
: Consider adding validation for extreme timeout values.While the implementation is correct, let's consider adding validation for the
ctrlLossTMO
parameter to prevent potential issues with extreme values. TheToOptions()
method already checks for>= -1
, but we might want to add an upper bound check.func ConnectAllNVMEDevices(logPageEntries []*hostapi.NvmeDiscPageEntry, hostnqn string, transport string, maxIOQueues int, kato int, ctrlLossTMO int, ) []*CtrlIdentifier { + // Validate ctrlLossTMO bounds + if ctrlLossTMO > 3600 { // Example: limit to 1 hour + logrus.Warnf("ctrlLossTMO value %d exceeds maximum allowed (3600), capping to 3600", ctrlLossTMO) + ctrlLossTMO = 3600 + }Also applies to: 552-552
pkg/clientconfig/cache.go (4)
51-53
: Let's maintain consistent field naming conventions.The field naming is inconsistent between structs:
TKey.ctrlLossTMO
uses camelCaseConnection.CtrlLossTMO
uses PascalCaseAdditionally, let's enhance the comment for better clarity.
type TKey struct { transport string Ip string port int - ctrlLossTMO int + CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable) } type Connection struct { // ... - CtrlLossTMO int // seconds + CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable) }Also applies to: 65-65
491-493
: Let's improve code readability with a more concise initialization.The TKey initialization can be more readable on a single line.
- key := TKey{transport: newEntry.Transport, Ip: newEntry.Traddr, - port: newEntry.Trsvcid, Nqn: newEntry.Subsysnqn, - hostnqn: newEntry.Hostnqn, ctrlLossTMO: newEntry.CtrlLossTMO} + key := TKey{ + transport: newEntry.Transport, + Ip: newEntry.Traddr, + port: newEntry.Trsvcid, + Nqn: newEntry.Subsysnqn, + hostnqn: newEntry.Hostnqn, + ctrlLossTMO: newEntry.CtrlLossTMO, + }
507-507
: Let's enhance the error message for better debugging.The error message could be more descriptive about the mismatch.
- err := fmt.Errorf("Entry %+v not cached, though '%s' is in cache", newEntry, conn) + err := fmt.Errorf("Entry %+v not found in cache entries, but connection '%s' exists in cache connections", newEntry, conn)
We need to update the
getEntryFromReferral
function to include the CtrlLossTMO fieldThe
Entry
struct includes aCtrlLossTMO
field, but when creating new entries from referrals ingetEntryFromReferral
, we're not setting this field. This could lead to inconsistent timeout behavior.
- In
pkg/clientconfig/cache.go
, we need to updategetEntryFromReferral
to include theCtrlLossTMO
field in the returnedEntry
.- The timeout value should be propagated from the referral context or use a default value consistent with the rest of the codebase.
🔗 Analysis chain
Line range hint
513-590
: Let's verify referral handling with the new timeout field.The referral handling code in
HandleReferrals
and related functions should be reviewed to ensure proper handling of theCtrlLossTMO
field when processing referrals.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if CtrlLossTMO is properly handled in referral-related code echo "Checking referral handling code for CtrlLossTMO usage..." # Search for referral-to-entry conversion points rg -A 5 "getEntryFromReferral|addConnectionsFromReferrals" # Look for potential timeout-related patterns in referral handling ast-grep --pattern 'func getEntryFromReferral($_, $_) { $$$ return &Entry{ $$$ } }'Length of output: 2343
Script:
#!/bin/bash # Let's check the Entry struct definition and its usage echo "Checking Entry struct definition..." rg -A 10 "type Entry struct" pkg/clientconfig/ # Let's also check if CtrlLossTMO is used in the Entry struct echo -e "\nChecking CtrlLossTMO usage in Entry..." rg "CtrlLossTMO" pkg/clientconfig/ # Let's examine the Referral struct and NvmeDiscPageEntry to understand the timeout field echo -e "\nChecking Referral and NvmeDiscPageEntry definitions..." rg -A 5 "type.*NvmeDiscPageEntry.*struct"Length of output: 2095
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
cmd/connect-all.go
(3 hunks)pkg/clientconfig/cache.go
(4 hunks)pkg/clientconfig/conf_parser.go
(3 hunks)pkg/nvmeclient/nvme_client.go
(4 hunks)service/service.go
(2 hunks)
🔇 Additional comments (9)
cmd/connect-all.go (2)
22-24
: LGTM! Clean import organization.
The imports are well-organized and necessary for the new functionality.
79-83
: LGTM! Let's verify the ConnectAll interface.
The implementation looks good and properly integrates the new timeout parameter. Let's verify the corresponding changes in the nvmeclient package.
✅ Verification successful
The ConnectAll interface matches the usage perfectly
The verification shows that the ConnectAll
function in pkg/nvmeclient/nvme_client.go
correctly accepts the ctrlLossTMO
parameter, and there are no other usages of ConnectAll
that need updating. The implementation is consistent across the codebase.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the ConnectAll function signature in nvmeclient package
# Check the ConnectAll function definition
ast-grep --pattern 'func ConnectAll($_, $_, $_, $_) $_ {
$$$
}'
# Look for any other usages of ConnectAll that might need updating
rg -A 2 'ConnectAll\('
Length of output: 1252
pkg/clientconfig/conf_parser.go (1)
165-176
: Let's improve the ctrl-loss-tmo flag handling
While the basic parsing is implemented correctly, we should consider a few enhancements:
- Let's add validation for the maximum allowed value
- We should set a default value when none is provided
- It would be helpful to document the flag's meaning and valid range in a comment
Here's a suggested enhancement:
+const (
+ // Maximum controller loss timeout in seconds
+ // -1: disabled, 0: no timeout, >0: timeout in seconds
+ MaxCtrlLossTMO = 3600 // 1 hour
+)
case "-l", "--ctrl-loss-tmo":
i++
value := strings.TrimSpace(s[i])
ctrlLossTMO, err := strconv.ParseInt(value, 10, 32)
if err != nil {
return nil, &ParserError{
Msg: fmt.Sprintf("bad controller loss timeout value"),
Details: fmt.Sprintf("%s is not a valid int", s[i]),
Err: err,
}
}
+ if ctrlLossTMO > MaxCtrlLossTMO {
+ return nil, &ParserError{
+ Msg: fmt.Sprintf("controller loss timeout value too large"),
+ Details: fmt.Sprintf("maximum allowed value is %d", MaxCtrlLossTMO),
+ Err: nil,
+ }
+ }
e.CtrlLossTMO = int(ctrlLossTMO)
+} else {
+ e.CtrlLossTMO = -1 // Set default to disabled
Let's verify the usage of CtrlLossTMO across the codebase:
service/service.go (2)
59-59
: Formatting change noted
This is a minor formatting adjustment that doesn't affect functionality.
258-260
: Implementation aligns with PR objectives
The addition of conn.CtrlLossTMO
parameter to the ConnectAllNVMEDevices
call successfully implements the ctrl-loss-tmo flag support. Let's verify the parameter usage across the codebase to ensure consistent implementation.
Let's run this script to verify the parameter usage:
✅ Verification successful
CtrlLossTMO parameter is consistently implemented across the codebase
The verification shows that the CtrlLossTMO
parameter is properly integrated:
- Defined in the
Connection
struct with clear documentation (seconds) - Correctly initialized through the configuration chain:
- Command line via
connect.ctrl-loss-tmo
flag - Properly passed through connection initialization
- Consistently used in
ConnectAllNVMEDevices
calls
- Command line via
The implementation is complete and consistent with no issues found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the consistent usage of CtrlLossTMO across the codebase
# Check for CtrlLossTMO field definition in Connection struct
echo "Checking Connection struct definition..."
rg -p "type\s+Connection\s+struct" -A 20
# Check for CtrlLossTMO initialization
echo "Checking CtrlLossTMO initialization..."
rg "CtrlLossTMO\s*[:=]"
# Check for ConnectAllNVMEDevices usage to ensure all calls include the new parameter
echo "Checking ConnectAllNVMEDevices usage..."
rg -p "ConnectAllNVMEDevices\s*\(" -A 3
Length of output: 1958
pkg/nvmeclient/nvme_client.go (2)
31-32
: LGTM!
The imports are properly organized.
512-513
: LGTM!
The parameter is correctly passed through to the underlying function.
pkg/clientconfig/cache.go (2)
68-68
: LGTM: Function signature update is clean and well-integrated.
The addition of the ctrlLossTMO
parameter to newConnection
aligns well with the new functionality.
70-73
: LGTM: Connection initialization is properly handled.
The CtrlLossTMO
field is correctly initialized along with other connection properties.
06493fa
to
188de4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yogev-lb seems legit, but:
- does this change the default behavior or do you only get a different behavior if you set
ctrl-loss-tmo
to something other than -1? - how did you test this? here are the
discovery-client
tests we have in the CI, at least 279 is a must:
-> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
-> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
-> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
-> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
-> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
-> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]
- @elada-lb does the PR check work already? can we try it on this PR?
btw @yogev-lb gentle reminder, this is a public repo |
@muliby-lb to answer your Q:
|
Yea we should be good to go with running it |
@elada-lb thanks can you point me at the documentation on how to run it? |
PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb) |
@elada-lb doesn't look like it ran much... Can you take a look? |
|
Thanks @elada-lb, I see it now. Looking at the workflow file, I see
which unless I am mistaken means it ran on the master branch and not on this PR branch, |
@muliby-lb @ronen-lb The CI passed for the DC acroding to this run: https://github.com/lightbitslabs/lbcitests/actions/runs/12006959193 waiting for approval... |
@yogev-lb I ran 239 a couple of times manually and it fails both times because the @elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check. |
I'm pretty sure that's the case, we're currently looking into this. |
After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues. |
278ca61
to
364fe8d
Compare
@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look? |
@muliby-lb Why'd you pass |
@elada-lb quite possibly I misunderstood how it's meant to work. Where do I specify which branch/PR to run the PR check on? |
Oh I see. Well, you don't need to pass the branch name since the you select the branch to run the workflow on, like so: |
364fe8d
to
7bf33b3
Compare
7bf33b3
to
9c9be35
Compare
Quality Gate passedIssues Measures |
@yogev-lb fixing the tests so that they will pass is OK, but I'm still not convinced your change makes sense in the field. Will we start getting support calls because the dsc is down? I realize it can't work if the modules aren't loaded, but I'm concerned that it will go down and stay down even after the module is loaded. Unless systemd will restart it after a few mins? |
@@ -55,7 +55,7 @@ LABEL maintainers="Lightbits Labs" \ | |||
|
|||
RUN apk update && \ | |||
apk add --no-cache \ | |||
util-linux | |||
util-linux curl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yogev-lb who run the livenes probe and how ?
55aa786
to
8011d33
Compare
8011d33
to
c5a449a
Compare
c5a449a
to
8011d33
Compare
8011d33
to
f45b96f
Compare
till now we didn't support this flag in the DC configuraion. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service. we differneciate between entries to the cache coming from user/referrals the reason we do that is that a user may provide ctrl-loss-tmo on a server that is not the default, and in case a referral will notify on a new server added we will not have this value. when we add the referral entry we will go over all the user entries and if one of them is not nil we will use this one and apply if to the referral one. sadly the DS will not be able to provide us this info about the new server we connected. issue: LBM1-35562
there are still OS out that does not load nvme-tcp by default. this will cause the DC to not function properly when it tries to connect to the io-controller. today we will see it only in the logs but the server will run successfully and it is tricky to locate. this change will fail-fast, on service load and will report what in high probablility needed to be done: nvme-tcp.
today when we fail on application error we will see the help msg which is not relevant for these errors and is confusing.
if we want to run liveness probes on discovery-client we need to run it from the container. when a service exposes an endpoint - like this one exposing http it is good practice to expose an healthz endpoint that can be invoked by k8s or docker-compose and indicate about container status.
If the interval will be 0 we will panic on: `panic: non-positive interval for NewTicker` set the deafult to 5s to prevent such issues Signed-off-by: alon <[email protected]>
We notice that there may be a race between writing and closing the file and shooting down the host abruptly. Signed-off-by: alon <[email protected]>
f45b96f
to
8d61b4e
Compare
We do not need to be depends on any external repo's version. this is an independed repo that need to controll it's own version Signed-off-by: alon <[email protected]>
8d61b4e
to
2f09aae
Compare
Quality Gate passedIssues Measures |
till now we didn't support this flag in the DC configuration. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service.
This change was detected when working on DMS.
We noticed in negative tests that when we kill the node of the volume we are using (vol become unavailable) we have
for ever (until volume becomes available again)
This is not a good behavior for us cause we want to fail fast (
1m
) and decide if we want to proceed or not. (we as opposed to other customers don't really care about the data (we can always retry the clone) and we don't support data resume yet so we can't retry when available.It supersized me that this feature is not working - since i remember that we told many customers (including @yant-lb in the LAB) that it does.
see: