-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange calls during network boot #649
Comments
I do observe something similar with Dell Latitude 3590, OptiPlex 3040 and others. Checking wiht
The last packet seems incorrectly parsed by It feels like some kind of bad memory access — "Onboard" is one of the EFI boot options on my end, probably the same holds true for "USB" in @raddirad s case. The strange character is |
I've made some progress trying to understand the changes between 15.6. and 15.8. Adding: return EFI_SUCCESS; right here: Line 415 in 14d6339 (i.e. after the special case handling several devices), things work again with my affected systems. Of course, that's not a real solution, but it highlights how the bad loader name appears. So it seems that the Since the garbage does not start with |
maybe @julian-klode or @vathpela could take a look at this? Thanks in advance |
That looks like a I have absolutely no idea why this is being passed in the load options, but "is this a fully formed boot variable" is a thing that certainly could be tested for and ignored. |
Okay to be fair a little bit weird boot variable - the structure is like this psuedocode:
So that's: |
So in summary: 1) I have no idea why there's a boot variable hanging out here, 2) I have no idea why the device path list in the boot variable has this weird vendor device path, but 3) it is basically a reasonably well formed boot variable, and we could probably test for that, but I'd rather know why Dell is doing this, because it doesn't really seem like they should be. |
Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version. |
Also, I know it affects at least the Dell Optiplex 5040, 7040, 3060 and 5060 and Latitude 5400. For me, it's any Dell desktop or laptop I've needed to PXE install to so far. |
Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously. |
I've not tried that path yet.
Part of our use case is to also utilize the "wake on lan + PXE" BIOS option
to auto install from an offline condition at a *remote* location. I'll need
to check if that is still possible. Sadly the older Optiplex (older than
##60) are not WMI capable and will physically need to have options changed.
The newer ones I can change BIOS via the SYS file system
(/sys/devices/virtual/firmware-attributes/dell-wmi-sysman/) from the
command line.
I will not be at a location to check/test until next Wednesday, however.
PJ
…On Thu, May 23, 2024 at 8:24 AM nathan-omeara ***@***.***> wrote:
Was there a path chosen to help with this issue on the shim side? I have
all Dell systems with this issue. My only choice at this time seems to be
to downgrade shim-x64 to a 15.6 version.
Have you tried going into the UEFI settings and in the 'boot sequence'
section, unchecking the 'onboard nic' and 'usb' choices? You can still use
f12 to choose a single-boot target of usb or network boot, but if you are
permanently netbooting systems that won't work, obviously.
—
Reply to this email directly, view it on GitHub
<#649 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2ANMFFZQPZSHVA2HG2QLTZDXUZZAVCNFSM6AAAAABFM7P5UWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXGEYDEOBXG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*Disclaimer*
The information contained in this communication from the
sender is confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you are
hereby notified that any disclosure, copying, distribution or taking action
in relation of the contents of this information is strictly prohibited and
may be unlawful.
This email has been scanned for viruses and malware, and
may have been automatically archived. Cassens
|
We are installing the OS via PXE/UEFI Netboot, and thus disabling boot choices is not an option. In addition the shim doesn't even try to load the grub via PXE/UEFI Netboot and hangs at the error described in OP. |
Yes, if you are only installing the OS, you can press f12 to do a one-time boot to PXE, even when PXE is not in the 'boot sequence' list. That is how I have been able to work around this bug to install the OS via network boot. |
In our case we are loading the shim via PXE and this bug happens before the shim chainloads the grub via PXE. |
Yes, that is how this bug is occurring. I would still suggest you try the workaround. It isn't a great solution, but it seems to work, and still allows you to interactively network boot for OS install. |
Ok, now I get it. Yeah for me personally this is doable, but I can't tell our customers to this things if they have a lot of affected devices. |
Indeed, thanks for the proposed workaround, in fact in our case we reinstall nodes without user interaction (i.e. by triggering a PXE boot remotely, by adding it to the boot order temporarily, then rebooting), so this does not help with the many distributed desktop machines we operate. |
This is the commit that introduces this issue. If I revert it, I can boot my dell (that I finally got hands-on with) with the Onboard devices still in the boot sequence. So, I'm guessing this is getting confused by the weird Dell boot entries, and screwing up the load path for grubx64.efi |
Possible fix: Lines 1262 to 1263 in 0287c6b
Add TFTP_ERROR here: if (!use_fb && (efi_status == EFI_INVALID_PARAMETER ||
efi_status == EFI_NOT_FOUND ||
efi_status == EFI_TFTP_ERROR)) { In my testing, this gets it booting over network again. |
Any guess as to how long a change like may take to make it into a updated release package? |
maybe @vathpela @jsetje or @julian-klode could say more on if this might get upstream |
Thank you for getting my attention. Just testing for the extra error is probably reasonable, but I'm also curious why we get a variable that looks like that. Since I exposed this, I'll certainly help get a fix in. |
All of your help is much appreciated! Thank you for helping to resolve this
issue.
PJ
…On Tue, Jun 4, 2024 at 8:12 PM Jan Setje-Eilers ***@***.***> wrote:
Thank you for getting my attention. Just testing for the extra error is
probably reasonable, but I'm also curious why we get a variable that looks
like that. Since I exposed this, I'll certainly help get a fix in.
—
Reply to this email directly, view it on GitHub
<#649 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2ANMDP55MFP6GTPQJP7WTZFZQZJAVCNFSM6AAAAABFM7P5UWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGY3TKMJWGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*Disclaimer*
The information contained in this communication from the
sender is confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you are
hereby notified that any disclosure, copying, distribution or taking action
in relation of the contents of this information is strictly prohibited and
may be unlawful.
This email has been scanned for viruses and malware, and
may have been automatically archived. Cassens
|
I started asking around to see if I could find a system to test this with, which made me wonder about a 2-in-1 with a built in NIC. So I looked at the original report again. I bet that this all has something to do with how the docking station brings the NIC in. |
It definitely isn't specific to docking stations. I have one dell on-hand with the issue, a Latitude 5300 (built-in NIC, no external NIC). But I also have one other dell, one HP, and one MS surface that do not show this issue, all 3 of those using USB NICs only. It's worth noting that USB boot technically has the same basic issue, but the fallback code kicks in on USB boot, because the error handling I pointed out handles the error when it's on a filesystem, it just doesn't handle it when it's on TFTP. I also wonder if HTTP(s) boot would be another error code that would need to be added there, but my only on-hand device with HTTP(s) boot support is the dell that doesn't demonstrate this issue. If you pay attention on USB boot, you can see the same error, followed by the message here: Line 1265 in 0287c6b
This is what led me to try adding EFI_TFTP_ERROR to that statement. |
Hmm, yeah, forced a (similar?) error by renaming grubx64.efi on my http boot server and booting my other dell. I'm guessing because 0x23 (35?) is relatively new, and I'm using fedora's shipping version of shim 15.8 which was probably compiled with an earlier version of gnu_efi. So I'd suggest adding I certainly wouldn't object to fixing the parsing of the weird values (if there is an actual issue, and it isn't just Dell and Lenovo (and maybe others) doing something that breaks the standard) but harmonizing the fallback behavior between local filesystems and network boot makes sense to me. |
FWIW, we'll have to fix this forward. In addition to the patch that exposed this, we'll need non-hardcoded paths and names for UKIs. Hopefully I can get my hands on a setup that exposes this, but I'm also not opposed to keep trying unless we get a very specific error. |
This is not related to the Dock. I tested the 2-in-1 and a working Dell device. The 2-in-1 failed, the other one succeeded. @olifre mentioned other devices that show the same behaviour ("Dell Latitude 3590, OptiPlex 3040 and others") |
If you want anything tested, I have access to the 2-in-1 convertible I mention in OP. I can test new code |
I can immediately add to the list:
After that, I stopped doing systematic testing, as testing other models (we have an assortment of Dell OptiPlex systems, but no other Latitudes at hand) would mean temporarily stealing them from active users to test them in our test network. I can certainly try to grab a specific model if you know you can get a hand on any OptiPlex, check it and report back here. Combining my list with the information provided by @pjwelsh above, I think the full known Dell list is:
From those numbers, it seems quite likely all the OptiPlex |
Could either of you dump out the raw boot option data and attach it here? I'd like to poke at it in code instead of trying to interpret the hexdump in my head. You can just copy the appropriate |
So I did this on an OptiPlex 3050
There are different devices with an Boot000F* Onboard NIC
Boot0012* Onboard NIC(IPV4)
Boot0017* Onboard NIC
Boot0019* Onboard NIC(IPV6)
|
And to add some data points, the two entries matching "Onboard" on my Lattitude 5300:
Boot0004* Onboard NIC(IPV6)
When this device is encountering this error, the boot filename sent in the TFTP request (that should be This is readily visible in a packet capture of the TFTP download request. That is slightly different from the Optiplex example above, so I wonder if the TFTP file request in @raddirad 's example would have 0xc2 in front of "Onboard" in the TFTP request? |
Per this comment #649 (comment), it seems that the d0/c2/etc are the length of the boot option's file_path_list[] entry. It's possible that the length in my example just happens to match what is sent in the filename.. so I would be curious to see if it's different with different lengths. |
@nathan-omeara could you tell me how to get those hex values. I would like to provide infos. |
Maybe not.. "Onboard" shouldn't even be the path looked at, so it's probably more complicated than an off-by-one error. |
I happened to get hands-on with an Optiplex 7070 today, and verified it looks the same as my Latitude 5300: Length is |
I had some time to do some poking around again - made a build with verbose enabled, and added some extra debug prints. I made no real progress, I dumped some boot options from machines that work, and the basic structure seems the same, the option length value (0xd0 on my broken machines) does correctly point at the end of the end marker 0x7FFF0400 in every case, so I don't understand why specifically these Dells confuse the algorithm and end up with the length in the option name (or why it's trying to use the option name as a file path string at all, I feel like that should fail and not work in any of these cases). I think I would need to enroll my custom signing cert in the UEFI of a working machine so I can run the verbose shim on them as well to make any more progress, and I won't have time to do that any time soon. So I am still thinking my proposed fix over in #666 is still a good thing to do, even if this boot option parsing can be fixed. |
It's not Dell in this case, but rather shim. What's happening (I'm pretty sure), is that shim tries to parse the load option. But it's a very strange load option. Here's what it looks like as a regular EFI variable:
Ignore the first 4 bytes (07 00 00 00) as those are the EFI variable attributes. There are 2 of the 0x7FFF0400 end of device path nodes. When shim sees the first one, it stops parsing. Then if checks if the spot it got to corresponds to what the load option says was the length of the path list (d0 00 or 208 LE). Since the device path length points after the second end of end of device path node rather than the first, it thinks it's not a valid load option. When it thinks it's not an actual load option, it treats it like it's just a path and tries its best to strip it out after skipping an initial path as a workaround for some EFI shell misadventures. In your case, the first "path" is 01 00 00 00 where the 00 00 would be a NUL terminating the string in UCS-2. The second "path" that it ends up matching on is everything from d0 to the 00 00 NUL after the Onboard NIC description. So, I think the fix is to ignore the case where the load option end of device path node was found before the device path length said it would. That's unusual but valid. What would end up happening in that case is shim would try to parse the actual load option optional data it was looking for. Those are the 00 00 42 4f bytes at the end. I don't know what those are supposed to represent, but they'd be treated by shim as an empty string since it would find the leading NUL and you would keep the default second stage of using grub. I have a patch to try that out.
That or temporarily turn off secure boot. Shim does all the same things in that case except for validating the image it's going to start.
I think it is a good idea, but I think the implementation can be better to convert actual not found cases to |
When looking for load option optional data, the parser asserts that the byte after the end of device path node is the same as what the file path length says it should be. While unusual, it is valid if the end of device path node comes before the end of the file path list. That supports some unusual Dell load options where there are two device paths in the list but the first is terminated by an End Entire Device Path. Maybe they intended to use an End Device Path Instance node there? Who knows. Either way, treating it as invalid ends up trying to read paths from the beginning of the option with obviously poor results. Fixes: rhboot#649 Signed-off-by: Dan Nicholson <[email protected]>
If anyone wants to give #694 a spin, that would be great. |
That fits what I was seeing, I was able to confirm that we exit Line 210 in e064e7d
with i less than 208.
I feel like that function should be a loop until i >= fplistlen, instead of a single pass, but I was second guessing myself. |
I checked out your branch and built it, and can confirm it requests I see your patch is not looping on Lines 421 to 443 in e064e7d
and then when it runs split_load_options again here: Lines 445 to 446 in e064e7d
it's parsing the second option, and that gets us the right data? I will do a deeper dive tomorrow, but that's my quick impression. |
What's happening is The whole point of
What Not sure about this part of my earlier analysis:
It should only skip the first path if it appears to match the path of the loaded image. I doubt the path the the loaded image looks like |
I considered that, but I think it's more correct the way I have it. All that
It would be nice to determine that all the paths in the device path list were valid, but shim isn't a load option linter. It's just trying to determine how to use the data based on its shape. |
I took a closer look at the file not found fallback, and I think #695 is a nicer way to handle it. It's completely untested, though. |
It's been a few months since this was reported, and various ways to deal with this haven been proposed. However, there's still no official solution, a commit in https://github.com/rhboot/shim. Now I'd like to bring this to an end, can I help with that? I have various Dell devices available for testing. Our customers use them a lot, and of course I'd prefer things to be smooth for them. |
maybe @vathpela @jsetje or @julian-klode could say more on this |
or maybe @aronowski could help with bringing this to upstream |
We at Landesmedienzentrum Baden-Württemberg, a federal authority in Germany, support about 2000 schools with their school IT networks. We rely on opsi, a canny open-source solution, for OS and software deployment. Some schools have been affected by the issues mentioned here in #649 and they are still unable to (re-) install computers in their network. |
It would be slightly less of an issue if Dell exposed those option in one
of their wmi-sysman attributes to control like many of the other BIOS
options:
ls
/sys/devices/virtual/firmware-attributes/dell-wmi-sysman/attributes/*/current_value
You could at least *remotely* (or programmatically) fix the setting then
instead of needing to touch *every* system like now :(
…On Thu, Dec 19, 2024 at 5:07 AM Martin Ewest ***@***.***> wrote:
We at Landesmedienzentrum Baden-Württemberg
<https://www.lmz-bw.de/netzwerkloesung>,
a federal authority in Germany, support about 2000 schools with their
school IT networks. We rely on opsi
<https://opsi.org/en/>,
a canny open-source solution, for OS and software deployment. Some schools
have been affected by the issues mentioned here in #649
<#649>
and they are still unable to (re-) install computers in their network.
We would greatly appreciate any progress on #649
<#649>
and #666
<#666>
to finally get a patch review on #428
<rhboot/shim-review#428>
.
Thanks and cheers to all contributors!
—
Reply to this email directly, view it on GitHub
<#649 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2ANMBJJNUGZ73EDI6Q7R32GKSILAVCNFSM6AAAAABFM7P5UWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJTGQZTKOJYHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*Disclaimer*
The information contained in this communication from the
sender is confidential. It is intended solely for use by the recipient and
others authorized to receive it. If you are not the recipient, you are
hereby notified that any disclosure, copying, distribution or taking action
in relation of the contents of this information is strictly prohibited and
may be unlawful.
This email has been scanned for viruses and malware, and
may have been automatically archived. Cassens
|
Originally, I thought just removing network boot and usb boot from the automatic boot order fixed it... because it does on my test laptop. Then someone told me it didn't work on theirs of the same model. I believe the magic is: If you have a bios/firmware admin password set, it doesn't matter if you remove those from the automatic boot order, it still confuses shim and fails to netboot. So in many secure environments, there is no 'acceptable' answer to this other than usb boot instead. |
A possible workaround is to create the file ( As mentioned already in #649 (comment) and also implemented in abotzung/foguefi@c4c4d6c. In my case (dnsmasq as TFTP server), this worked : ln -s grubx64.efi $(printf "\322")Onboard Apparently the non-printable character can differ, so you might need to find out which one works for you. I have found mine using : sudo systemctl stop dnsmasq
sudo strace -e file dnsmasq -d |
Hi
I have a problem with shim 15.8 and a Dell Latitude 5300 2-in-1 Notebook.
This Noteboot uses the latest Firmware 1.29
It is connected to Ethernet via a Thunderbolt USB-C Docking Station.
We do a lot of netbooting with a current shim 15.8. This shim is signed by Micrsooft, although the problem isn't Secureboot related.
When Netbooting on this specific machine we get a strange request via TFTP
Then the system fails and boots in a SupportAssist mode by Dell.
To verify it's not related to our shim i took the latest 15.8 shim from Canonical, with the same result.
Other systems, like Dell 5430 or vSphere or Proxmox VMs aren't affected. As for now this is the only system I know that has this issue
Other systems request the grub binary as expected after the revocations.efi is not found.
The text was updated successfully, but these errors were encountered: