Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yogev/enhance debugging #1

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Yogev/enhance debugging #1

wants to merge 6 commits into from

Conversation

yogev-lb
Copy link
Collaborator

@yogev-lb yogev-lb commented May 2, 2022

  • Added some minor changes to the package in order to be able to invoke additional API calls that were not supported.
  • Added a String() method that makes it easier to debug the IF_SEND/IF_RECV calls (not clean but eases the debug)
  • Added IsGlobal API for exposing the isGlobal Range property.
  • Fix polling mechanism since some of the devices operations timeout on the default tmo set.

@yogev-lb yogev-lb requested review from adir-lb, lev-lb and sagi-lb May 2, 2022 06:24
if len(resp) > 0 {
cont = false
}
case <-timeoutC:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yogev-lb just to understand -
the old code did have timeout (100 iterations X 10ms interval = 1s timeout). what you are saying is he default session timeout wasn't taken into account?
can you explain the purpose of this timeout (better add such explanation to the commit message)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adir-lb that is exectily what im saying
some operations - like revert for intel drives took around 5s
this caused a timeout.
the spec talks about a default timeout of 30s i think so i set it to this tmo value by default.

@@ -27,6 +27,7 @@ var (
ErrPropertiesCallFailed = errors.New("the properties call returned non-zero")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yogev-lb can you elaborate in commit message why you had to move the definitions?

@@ -209,6 +419,7 @@ func (m *MethodCall) Execute(c CommunicationIntf, proto drive.SecurityProtocol,
return nil, ErrReceivedUnexpectedResponse
}

// log.Printf("exec cmd: %s", m.String(b))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this comment.

}
_, ok4 := param.([]byte)
if ok4 {
//log.Printf("got []byte on params[%d]", i)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't forget to remove the code in comment.

pkg/core/method.go Outdated Show resolved Hide resolved
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch from 0048f9a to 802df72 Compare May 8, 2022 08:10
@yogev-lb
Copy link
Collaborator Author

yogev-lb commented May 8, 2022

after looking today at upstream repo i see that there are many tasks that are started to be implemeneted that we implemented in our library:

for example they do a big refactor in this PR:

Also they add get-msid and set-sid to the library which i implemented in the wrapping sed library here:

I may wait with the upstream PR after the open-source-firmware#88 is merged
to see what are the leftovers to merge.

@ChriMarMe
Copy link

Hello there, I wrapped up the refactor PR in upstream repo. TakeOwnership will take more time and rework of sessions as well.

A review would be helpful to get the changes in asap, so you can continue with your PR. (I also would be happy to see a PR of these changes in the upstream repo ;)

@ChriMarMe
Copy link

Hello, me again.

open-source-firmware#94

This implements take ownership and SetMBREnable.

It would be awesome if you can provide me with some feedback. :)

@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 4 times, most recently from b680502 to 5eb192f Compare May 31, 2022 12:40
@@ -237,6 +237,7 @@ func NewControlSession(d drive.DriveIntf, d0 *Level0Discovery, opts ...ControlSe
// Technically we should be able to advertise 0 here and the disk should pick
// for us, but that results in small values being picked in practice.
rhp.MaxComPacketSize = 1024 * 1024 // 1 MiB for good measure
//rhp.MaxComPacketSize = 1024 * 32 // LBM1-20784
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yogev-lb : er... that line looks thoroughly commented out. is that by design, given the commit message?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that required? Until now the only usecase for MaxComPacketSize is when transmitting PBA image to MBR, but that doesnt effect recieving capabilities, which is handled by HostProperties ( at least that is the way I understand it)

Also check out "takeOwnership" branch in upstream repo.

Maybe it is benefitial for all of us if we can work on upstream together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChriMarMe : please see the commit message that introduces this change. we're getting sporadic errors on at least one SED model when using such huge buffers. so far, dropping it to, IIRC, pretty much anything below 1MB seemed to fix the issue for that particular model, but we're not taking any chances.

it might be argued that the problem we're seeing is peculiar to a certain server model (PCIe topology with switches) and a certain SED model, and so this might be addressed with a model-specific "quirk", but i personally am not very fond of those. we might also be able to do a bit better than hard-coding a value (any value, really, 32kB is no more or less magical than 1MB) while keeping it generic. this is our current brute-force hack. "fix it twice", and all that... ;)

Copy link

@ChriMarMe ChriMarMe Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry. I overlooked the commit message.

Do you think it could be benefitial if we create an optional command parameter which forces certain properties, e.g. MaxComPacketSize to a given value? Thinking of something like -comsize or -compacksize

edit:
-hcomsize/hcompacksize (for HostProperties)
-tcomsize/tcompacksize (for TPerProperties)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChriMarMe : manually passed opts seem sub-optimal at this point. they certainly should work, but their usability is questionable: we probably shouldn't assume that the caller of the method will know the model of the device at all times ahead of time, and have the full list of the affected models - this is essentially the codifying of "quirks" i referred to above, but pushed up the stack.

what i had in mind is something a little more automatic, like, perhaps, detecting that the TPer returns apparently bogus negotiated properties (the returned host half), and only in these cases doing on the host side what the TPer should have done and explicitly re-negotiating properties with values that would work for both sides.

so far we've been able to observe what looks like buggy properties negotiation on just one SED model from just one vendor, and possibly affected by the PCIe connectivity topology at that. we'll want to have a word with the vendor first and sleep on it for a bit. then we can revisit this and propose a more sensible fix worth upstreaming.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChriMarMe :

I read it the way that the TPer uses the Values supplied in HostProperties parameter. So responsibility to get the values correct is on the hosts side.

i don't think this is a viable interpretation. TPers (SEDs) have limited HW resources and plenty of features are optional, so i don't think the host can dictate to TPer session params outside the scope of TPer's capabilities. besides, if this was so - why even bother with reporting back the TPer properties separately?

What I understood from your former quote was: TPer has some fixed values for HostProperties which needs to be reported to the host.

i don't think that's it either. my interpretation is that the TPer has a set of capabilities (inc. max buffer sizes) that it supports. it reports those in the response SMUID.Properties.Properties (see 5.2.2.1.2.1). they are "fixed" insofar as they're limited by the SED's HW & FW, and they don't change depending on the host that's sitting on the other side of the wire or the values that host might have specified, but they are not HostProperties.

the response SMUID.Properties.HostProperties (5.2.2.1.2.2), on the other hand, is what the TPer proclaims it'll use for a given set of sessions (q.v ComID). these are not fixed, and host SW passing in different HostProperties in the request might have different corresponding values returned, at TPer's discretion (within the bounds of the capabilities of both sides, i would imagine). this corresponds to the following two passages that me and you quoted, respectively:

this portion of the method result SHALL include the communications limitations and capabilities that the TPer SHALL use for messages sent from the TPer to the Host.

the TPer SHALL respond with ALL host properties it supports, with their current values, in the HostProperties parameter of its method response

in the meantime we were able to reproduce the issue we're seeing with the SED model in question with plenty of other SEDs of the same model, on several different server chassis of different models by different vendors with a few different Linux kernel versions. so it seems to be a vendor specific issue or a SED model specific one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I got it. Thank you for taking the time to explain it to me.

Great work! 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do keep in mind that i'm not one of the spec co-authors, so it's just my interpretation. empirically, it does appear to match the interpretation of most of the NVMe SED vendors we were able to lay our hands on so far - but not this one particular vendor, it seems. :)

i'll keep this thread posted once we hear back from their team.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChriMarMe : after the 1st phase of root-causing the issue the vendor's engineering team seems to agree with my take on the situation (see above) and apparently a fix is planned for a future FW release. we'll see how that one pans out.

unfortunately, that will take a while and won't address drives already deployed [with legacy FW]. therefore i think i'd like to take you up on the idea of adding an option to cap MaxComPacketSize. we're only interested in the library side of things, not the binaries. at the very least we'll need to add an option to core.NewControlSession() (ControlSessionOpt), something like WithMaxComPacketSize(uint), but i also suspect an InitializeOpt will be required for locking.Initialize(). AFAICT this should be backwards compatible with the existing library users, and should allow us to address the immediate problem - albeit in a somewhat heavy-handed fashion.

would such a change be acceptable for you? if so - we can whip up a corresponding separate PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds reasonable to introduce that. There is a real world problem and you need it fixed. Also your proposed solution follows a pattern already used and it is optional for any user.

@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 3 times, most recently from 3b22df3 to 48a8600 Compare June 14, 2022 09:19
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch from 48a8600 to 7ba869f Compare June 16, 2022 08:18
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 3 times, most recently from 5a4840d to c62b435 Compare June 28, 2022 12:20
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 3 times, most recently from add36fb to 79d5bc2 Compare July 7, 2022 12:14
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 5 times, most recently from 9aee3b2 to ccd8c31 Compare August 1, 2022 12:46
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch from ccd8c31 to 14a2c6e Compare August 3, 2022 14:39
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch 5 times, most recently from f810132 to 883c0bd Compare August 21, 2022 11:48
@yogev-lb yogev-lb force-pushed the yogev/enhance-debugging branch from 883c0bd to 46f9344 Compare August 23, 2022 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants