Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ctrl-loss-tmo flag support #30

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

yogev-lb
Copy link
Contributor

@yogev-lb yogev-lb commented Nov 21, 2024

till now we didn't support this flag in the DC configuration. it was added only on the connect command line but not as a service. needed to update the parser of the file, update the cache and pass this information to the ConnectAll command from the service.

This change was detected when working on DMS.
We noticed in negative tests that when we kill the node of the volume we are using (vol become unavailable) we have
for ever (until volume becomes available again)

This is not a good behavior for us cause we want to fail fast (1m) and decide if we want to proceed or not. (we as opposed to other customers don't really care about the data (we can always retry the clone) and we don't support data resume yet so we can't retry when available.

It supersized me that this feature is not working - since i remember that we told many customers (including @yant-lb in the LAB) that it does.

see:

Copy link

coderabbitai bot commented Nov 21, 2024

📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough
📝 Walkthrough

Walkthrough

The pull request introduces a new command-line flag, ctrl-loss-tmo, to the connect-all command in the cmd/connect-all.go file, allowing users to specify a controller loss timeout period. The default value is set to -1, indicating that the timeout is disabled. This flag is integrated into the command's functionality and is passed to the nvmeclient.ConnectAll function as an additional argument.

In the pkg/clientconfig/cache.go file, modifications include the addition of the ctrlLossTMO field to the TKey and Connection structs, along with updates to the newConnection function to accept this parameter. The addEntry method is also adjusted to propagate the CtrlLossTMO value when creating new connections.

The pkg/clientconfig/conf_parser.go file sees the addition of a CtrlLossTMO field in the Entry struct, with corresponding updates to the parsing logic to handle the new command-line argument.

Finally, the nvme_client.go file updates the ConnectAll and ConnectAllNVMEDevices functions to incorporate the new timeout parameter, while the service/service.go file adjusts the Start method to pass this parameter during device connections.

Possibly related PRs

Suggested reviewers

  • anton-lb: Familiar with the connection management logic and related changes.
  • arturm-lb: Experienced in the command-line interface and configuration aspects of the project.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (8)
cmd/connect-all.go (2)

58-59: Consider adding a shorthand flag option for consistency.

Most other flags in this command have shorthand options (e.g., -a, -s, -q). Let's consider adding a shorthand for ctrl-loss-tmo to maintain consistency with the existing pattern.

Here's a suggested change:

-cmd.Flags().IntP("ctrl-loss-tmo", "", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")
+cmd.Flags().IntP("ctrl-loss-tmo", "l", -1, "controller loss timeout period (in seconds). Timeout is disabled by default (-1)")

Line range hint 75-75: Fix undefined variable usage.

There seems to be an undefined kato variable being used in the DiscoverRequest. We should use viper.GetInt("connect-all.kato") for consistency.

Here's the suggested fix:

-		Kato:      kato,
+		Kato:      viper.GetInt("connect-all.kato"),
pkg/nvmeclient/nvme_client.go (2)

505-507: Add parameter documentation for better maintainability.

Let's add documentation for the new ctrlLossTMO parameter to clarify its purpose and expected values.

 func ConnectAll(discoveryRequest *hostapi.DiscoverRequest,
 	maxIOQueues int, kato int,
-	ctrlLossTMO int) ([]*CtrlIdentifier, error) {
+	ctrlLossTMO int) ([]*CtrlIdentifier, error) {
+	// ctrlLossTMO: Controller loss timeout in seconds. -1 disables the timeout.

535-539: Consider adding validation for extreme timeout values.

While the implementation is correct, let's consider adding validation for the ctrlLossTMO parameter to prevent potential issues with extreme values. The ToOptions() method already checks for >= -1, but we might want to add an upper bound check.

 func ConnectAllNVMEDevices(logPageEntries []*hostapi.NvmeDiscPageEntry,
 	hostnqn string, transport string,
 	maxIOQueues int, kato int,
 	ctrlLossTMO int,
 ) []*CtrlIdentifier {
+	// Validate ctrlLossTMO bounds
+	if ctrlLossTMO > 3600 { // Example: limit to 1 hour
+		logrus.Warnf("ctrlLossTMO value %d exceeds maximum allowed (3600), capping to 3600", ctrlLossTMO)
+		ctrlLossTMO = 3600
+	}

Also applies to: 552-552

pkg/clientconfig/cache.go (4)

51-53: Let's maintain consistent field naming conventions.

The field naming is inconsistent between structs:

  • TKey.ctrlLossTMO uses camelCase
  • Connection.CtrlLossTMO uses PascalCase

Additionally, let's enhance the comment for better clarity.

 type TKey struct {
     transport string
     Ip        string
     port      int
-    ctrlLossTMO int
+    CtrlLossTMO int // Controller loss timeout in seconds (-1 to disable)
 }

 type Connection struct {
     // ...
-    CtrlLossTMO  int // seconds
+    CtrlLossTMO  int // Controller loss timeout in seconds (-1 to disable)
 }

Also applies to: 65-65


491-493: Let's improve code readability with a more concise initialization.

The TKey initialization can be more readable on a single line.

-    key := TKey{transport: newEntry.Transport, Ip: newEntry.Traddr,
-        port: newEntry.Trsvcid, Nqn: newEntry.Subsysnqn,
-        hostnqn: newEntry.Hostnqn, ctrlLossTMO: newEntry.CtrlLossTMO}
+    key := TKey{
+        transport:    newEntry.Transport,
+        Ip:          newEntry.Traddr,
+        port:        newEntry.Trsvcid,
+        Nqn:         newEntry.Subsysnqn,
+        hostnqn:     newEntry.Hostnqn,
+        ctrlLossTMO: newEntry.CtrlLossTMO,
+    }

507-507: Let's enhance the error message for better debugging.

The error message could be more descriptive about the mismatch.

-    err := fmt.Errorf("Entry %+v not cached, though '%s' is in cache", newEntry, conn)
+    err := fmt.Errorf("Entry %+v not found in cache entries, but connection '%s' exists in cache connections", newEntry, conn)

We need to update the getEntryFromReferral function to include the CtrlLossTMO field

The Entry struct includes a CtrlLossTMO field, but when creating new entries from referrals in getEntryFromReferral, we're not setting this field. This could lead to inconsistent timeout behavior.

  • In pkg/clientconfig/cache.go, we need to update getEntryFromReferral to include the CtrlLossTMO field in the returned Entry.
  • The timeout value should be propagated from the referral context or use a default value consistent with the rest of the codebase.
🔗 Analysis chain

Line range hint 513-590: Let's verify referral handling with the new timeout field.

The referral handling code in HandleReferrals and related functions should be reviewed to ensure proper handling of the CtrlLossTMO field when processing referrals.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if CtrlLossTMO is properly handled in referral-related code
echo "Checking referral handling code for CtrlLossTMO usage..."

# Search for referral-to-entry conversion points
rg -A 5 "getEntryFromReferral|addConnectionsFromReferrals" 

# Look for potential timeout-related patterns in referral handling
ast-grep --pattern 'func getEntryFromReferral($_, $_) {
  $$$
  return &Entry{
    $$$
  }
}'

Length of output: 2343


Script:

#!/bin/bash
# Let's check the Entry struct definition and its usage
echo "Checking Entry struct definition..."
rg -A 10 "type Entry struct" pkg/clientconfig/

# Let's also check if CtrlLossTMO is used in the Entry struct
echo -e "\nChecking CtrlLossTMO usage in Entry..."
rg "CtrlLossTMO" pkg/clientconfig/

# Let's examine the Referral struct and NvmeDiscPageEntry to understand the timeout field
echo -e "\nChecking Referral and NvmeDiscPageEntry definitions..."
rg -A 5 "type.*NvmeDiscPageEntry.*struct" 

Length of output: 2095

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4c9f6bf and f34f4b6.

📒 Files selected for processing (5)
  • cmd/connect-all.go (3 hunks)
  • pkg/clientconfig/cache.go (4 hunks)
  • pkg/clientconfig/conf_parser.go (3 hunks)
  • pkg/nvmeclient/nvme_client.go (4 hunks)
  • service/service.go (2 hunks)
🔇 Additional comments (9)
cmd/connect-all.go (2)

22-24: LGTM! Clean import organization.

The imports are well-organized and necessary for the new functionality.


79-83: LGTM! Let's verify the ConnectAll interface.

The implementation looks good and properly integrates the new timeout parameter. Let's verify the corresponding changes in the nvmeclient package.

✅ Verification successful

The ConnectAll interface matches the usage perfectly

The verification shows that the ConnectAll function in pkg/nvmeclient/nvme_client.go correctly accepts the ctrlLossTMO parameter, and there are no other usages of ConnectAll that need updating. The implementation is consistent across the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the ConnectAll function signature in nvmeclient package

# Check the ConnectAll function definition
ast-grep --pattern 'func ConnectAll($_, $_, $_, $_) $_ {
  $$$
}'

# Look for any other usages of ConnectAll that might need updating
rg -A 2 'ConnectAll\('

Length of output: 1252

pkg/clientconfig/conf_parser.go (1)

165-176: Let's improve the ctrl-loss-tmo flag handling

While the basic parsing is implemented correctly, we should consider a few enhancements:

  1. Let's add validation for the maximum allowed value
  2. We should set a default value when none is provided
  3. It would be helpful to document the flag's meaning and valid range in a comment

Here's a suggested enhancement:

+const (
+    // Maximum controller loss timeout in seconds
+    // -1: disabled, 0: no timeout, >0: timeout in seconds
+    MaxCtrlLossTMO = 3600 // 1 hour
+)

 case "-l", "--ctrl-loss-tmo":
     i++
     value := strings.TrimSpace(s[i])
     ctrlLossTMO, err := strconv.ParseInt(value, 10, 32)
     if err != nil {
         return nil, &ParserError{
             Msg:     fmt.Sprintf("bad controller loss timeout value"),
             Details: fmt.Sprintf("%s is not a valid int", s[i]),
             Err:     err,
         }
     }
+    if ctrlLossTMO > MaxCtrlLossTMO {
+        return nil, &ParserError{
+            Msg:     fmt.Sprintf("controller loss timeout value too large"),
+            Details: fmt.Sprintf("maximum allowed value is %d", MaxCtrlLossTMO),
+            Err:     nil,
+        }
+    }
     e.CtrlLossTMO = int(ctrlLossTMO)
+} else {
+    e.CtrlLossTMO = -1 // Set default to disabled

Let's verify the usage of CtrlLossTMO across the codebase:

service/service.go (2)

59-59: Formatting change noted

This is a minor formatting adjustment that doesn't affect functionality.


258-260: Implementation aligns with PR objectives

The addition of conn.CtrlLossTMO parameter to the ConnectAllNVMEDevices call successfully implements the ctrl-loss-tmo flag support. Let's verify the parameter usage across the codebase to ensure consistent implementation.

Let's run this script to verify the parameter usage:

✅ Verification successful

CtrlLossTMO parameter is consistently implemented across the codebase

The verification shows that the CtrlLossTMO parameter is properly integrated:

  • Defined in the Connection struct with clear documentation (seconds)
  • Correctly initialized through the configuration chain:
    • Command line via connect.ctrl-loss-tmo flag
    • Properly passed through connection initialization
    • Consistently used in ConnectAllNVMEDevices calls

The implementation is complete and consistent with no issues found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the consistent usage of CtrlLossTMO across the codebase

# Check for CtrlLossTMO field definition in Connection struct
echo "Checking Connection struct definition..."
rg -p "type\s+Connection\s+struct" -A 20

# Check for CtrlLossTMO initialization
echo "Checking CtrlLossTMO initialization..."
rg "CtrlLossTMO\s*[:=]"

# Check for ConnectAllNVMEDevices usage to ensure all calls include the new parameter
echo "Checking ConnectAllNVMEDevices usage..."
rg -p "ConnectAllNVMEDevices\s*\(" -A 3

Length of output: 1958

pkg/nvmeclient/nvme_client.go (2)

31-32: LGTM!

The imports are properly organized.


512-513: LGTM!

The parameter is correctly passed through to the underlying function.

pkg/clientconfig/cache.go (2)

68-68: LGTM: Function signature update is clean and well-integrated.

The addition of the ctrlLossTMO parameter to newConnection aligns well with the new functionality.


70-73: LGTM: Connection initialization is properly handled.

The CtrlLossTMO field is correctly initialized along with other connection properties.

pkg/clientconfig/conf_parser.go Outdated Show resolved Hide resolved
@yogev-lb yogev-lb requested a review from ronen-lb November 21, 2024 12:41
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch 2 times, most recently from 06493fa to 188de4b Compare November 21, 2024 13:19
Copy link
Contributor

@muliby-lb muliby-lb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yogev-lb seems legit, but:

  1. does this change the default behavior or do you only get a different behavior if you set ctrl-loss-tmo to something other than -1?
  2. how did you test this? here are the discovery-client tests we have in the CI, at least 279 is a must:
    -> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
    -> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
    -> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
    -> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
    -> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]
  1. @elada-lb does the PR check work already? can we try it on this PR?

pkg/clientconfig/conf_parser.go Outdated Show resolved Hide resolved
@muliby-lb
Copy link
Contributor

btw @yogev-lb gentle reminder, this is a public repo

@muliby-lb
Copy link
Contributor

CC @arturm-lb @anton-lb

cmd/connect-all.go Show resolved Hide resolved
application/app.go Show resolved Hide resolved
cmd/serve.go Show resolved Hide resolved
@yogev-lb
Copy link
Contributor Author

yogev-lb commented Nov 24, 2024

@muliby-lb to answer your Q:

  1. that is correct - default was and still is -1 which means endless wait.
  2. I verified it with DMS CI which does about 600 vol attachments. Also @alon-lb added negative scenario to verify that this fix is indeed working as expected. (@alon-lb please provide link to PR here)
  3. I counted on the CI to run basic DC functionallity but i will run a few - like 279 - and report.

@elada-lb
Copy link
Collaborator

@yogev-lb seems legit, but:

  1. does this change the default behavior or do you only get a different behavior if you set ctrl-loss-tmo to something other than -1?
  2. how did you test this? here are the discovery-client tests we have in the CI, at least 279 is a must:
    -> /home/muli/lightbits/src/systests/racktests/1923_etcd_mTLS_cross_server_connectivity_new_installation.py [CI: sanity]
    -> /home/muli/lightbits/src/systests/racktests/2301_physical_alma_8_installation_sanity.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/239_alma_8_vm_installation_test.py [CI: sanity minisanity]
    -> /home/muli/lightbits/src/systests/racktests/279_duros_discovery_client.py [CI: weekly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/280_monitoring_installation_test.py [CI: weekly monitoring-stack]
    -> /home/muli/lightbits/src/systests/racktests/529_discovery_vm_installation_add_node.py [CI: weekly]
    -> /home/muli/lightbits/src/systests/racktests/609_vm_installation_pull.py [CI: nightly (unstable)]
    -> /home/muli/lightbits/src/systests/racktests/625_vm_installation_add_new_server_to_cluster.py [CI: weekly]
  1. @elada-lb does the PR check work already? can we try it on this PR?

Yea we should be good to go with running it

@muliby-lb
Copy link
Contributor

Yea we should be good to go with running it

@elada-lb thanks can you point me at the documentation on how to run it?

@elada-lb
Copy link
Collaborator

elada-lb commented Nov 24, 2024

Yea we should be good to go with running it

@elada-lb thanks can you point me at the documentation on how to run it?

You can find it here (note this is still a WIP)

@muliby-lb
Copy link
Contributor

@muliby-lb
Copy link
Contributor

PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb)

@elada-lb doesn't look like it ran much... Can you take a look?

@elada-lb
Copy link
Collaborator

PR check running here: https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311 (thanks @elada-lb)

@elada-lb doesn't look like it ran much... Can you take a look?
You can see in https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311/job/33466540385#step:3:28 it prints out this URL, which leads to the internal link to the CI run

@muliby-lb
Copy link
Contributor

You can see in https://github.com/LightBitsLabs/discovery-client/actions/runs/12006948311/job/33466540385#step:3:28 it prints out this URL, which leads to the internal link to the CI run

Thanks @elada-lb, I see it now. Looking at the workflow file, I see

      manifest_branch:
        description: "Manifest branch to run the PR checks with"
        default: "master"
        required: true
        type: string

which unless I am mistaken means it ran on the master branch and not on this PR branch, yogev/fix-ctrl-loss-tmo?

@yogev-lb
Copy link
Contributor Author

@muliby-lb @ronen-lb The CI passed for the DC acroding to this run:

https://github.com/lightbitslabs/lbcitests/actions/runs/12006959193

waiting for approval...

@muliby-lb
Copy link
Contributor

muliby-lb commented Nov 26, 2024

@yogev-lb (1) I am not sure the PR check actually ran on the right branch, waiting for @elada-lb to confirm; (2) are all of the comments you received on this PR addressed? I usually just resolve those that have been handled ("I disagree because <reason>" is also a legit way to handle a comment).

@muliby-lb
Copy link
Contributor

muliby-lb commented Nov 26, 2024

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?

@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

@elada-lb
Copy link
Collaborator

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?

@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

I'm pretty sure that's the case, we're currently looking into this.

@elada-lb
Copy link
Collaborator

@yogev-lb I ran 239 a couple of times manually and it fails both times because the dsc exits since it can't find /dev/nvme-fabrics. Please consider whether this is indeed the right behavior, given that the dsc is a service, maybe we better revert to the old behavior and wait until the module will be loaded?
@elada-lb this strengthens my suspicion that the PR check (which shows 239 as passed) actually ran on master and not with @yogev-lb branch. Please check.

I'm pretty sure that's the case, we're currently looking into this.

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 278ca61 to 364fe8d Compare November 26, 2024 19:36
@muliby-lb
Copy link
Contributor

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

@muliby-lb
Copy link
Contributor

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look?

@elada-lb
Copy link
Collaborator

elada-lb commented Nov 27, 2024

After merging #31 and a few other adjustments in lbcitests, you should be able to run the PR checks with supporting side branches as expected. Please try this out and let me know if there are other issues.

Thanks @elada-lb, I'm giving it another spin now.

@elada-lb failed in https://github.com/LightBitsLabs/lbcitests/actions/runs/12044433121/job/33581436912, can you please take a look?

@muliby-lb Why'd you pass manifest_branch=yogev/fix-ctrl-loss-tmo and not just the default master? If I understand correctly, this isn't a branch that exists in the manifest repository, so it fails.

@muliby-lb
Copy link
Contributor

@muliby-lb Why'd you pass manifest_branch=yogev/fix-ctrl-loss-tmo and not just the default master? If I understand correctly, this isn't a branch that exists in the manifest repository, so it fails.

@elada-lb quite possibly I misunderstood how it's meant to work. Where do I specify which branch/PR to run the PR check on?

@elada-lb
Copy link
Collaborator

@muliby-lb Why'd you pass manifest_branch=yogev/fix-ctrl-loss-tmo and not just the default master? If I understand correctly, this isn't a branch that exists in the manifest repository, so it fails.

@elada-lb quite possibly I misunderstood how it's meant to work. Where do I specify which branch/PR to run the PR check on?

Oh I see. Well, you don't need to pass the branch name since the you select the branch to run the workflow on, like so:
image

@elada-lb elada-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 364fe8d to 7bf33b3 Compare November 27, 2024 09:27
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 7bf33b3 to 9c9be35 Compare November 27, 2024 10:22
@muliby-lb
Copy link
Contributor

@elada-lb @yogev-lb the good news is that it looks like the PR check ran to completion, the bad news is that both tests failed :-)

Copy link

@muliby-lb
Copy link
Contributor

@yogev-lb fixing the tests so that they will pass is OK, but I'm still not convinced your change makes sense in the field. Will we start getting support calls because the dsc is down? I realize it can't work if the modules aren't loaded, but I'm concerned that it will go down and stay down even after the module is loaded. Unless systemd will restart it after a few mins?

@@ -55,7 +55,7 @@ LABEL maintainers="Lightbits Labs" \

RUN apk update && \
apk add --no-cache \
util-linux
util-linux curl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yogev-lb who run the livenes probe and how ?

@alon-lb alon-lb force-pushed the yogev/fix-ctrl-loss-tmo branch 2 times, most recently from 55aa786 to 8011d33 Compare January 8, 2025 08:19
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8011d33 to c5a449a Compare January 16, 2025 16:39
@alon-lb alon-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from c5a449a to 8011d33 Compare January 19, 2025 07:32
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8011d33 to f45b96f Compare January 21, 2025 12:21
yogev-lb and others added 7 commits January 21, 2025 15:26
till now we didn't support this flag in the DC configuraion.
it was added only on the connect command line but not as a service.
needed to update the parser of the file, update the cache
and pass this information to the ConnectAll command from
the service.

we differneciate between entries to the cache coming from user/referrals

the reason we do that is that a user may provide ctrl-loss-tmo on a server that is
not the default, and in case a referral will notify on a new server added we will
not have this value.

when we add the referral entry we will go over all the user entries and if
one of them is not nil we will use this one and apply if to the referral
one. sadly the DS will not be able to provide us this info about the new
server we connected.

issue: LBM1-35562
there are still OS out that does not load nvme-tcp
by default. this will cause the DC to not function properly
when it tries to connect to the io-controller.

today we will see it only in the logs but the server will
run successfully and it is tricky to locate.

this change will fail-fast, on service load and will report
what in high probablility needed to be done: nvme-tcp.
today when we fail on application error we will see the help
msg which is not relevant for these errors and is
confusing.
if we want to run liveness probes on discovery-client we need to
run it from the container.

when a service exposes an endpoint - like this one exposing http
it is good practice to expose an healthz endpoint that can be
invoked by k8s or docker-compose and indicate about container
status.
If the interval will be 0 we will panic on:
`panic: non-positive interval for NewTicker`
set the deafult to 5s to prevent such issues

Signed-off-by: alon <[email protected]>
We notice that there may be a race between writing and closing the file and
shooting down the host abruptly.

Signed-off-by: alon <[email protected]>
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from f45b96f to 8d61b4e Compare January 21, 2025 15:26
yogev-lb and others added 2 commits January 27, 2025 05:53
We do not need to be depends on any external repo's version.
this is an independed repo that need to controll it's own version

Signed-off-by: alon <[email protected]>
@yogev-lb yogev-lb force-pushed the yogev/fix-ctrl-loss-tmo branch from 8d61b4e to 2f09aae Compare January 27, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants