-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SEL interpretations for SuperMicro X10 #19
Comments
Hi @dragonpaw, I am a little reluctant to add OEM support by only backwards engineering strings between the two tools as you are suggesting. There are too many subtleties that probably shouldn't be assumed. For starters, how about pinging Supermicro and see if you can get the official documentation for these OEM codes? I have done this for them in the past and they have obliged. |
I can reach out to them, but it seems like that might be more difficult than you think. (Based of my prior attempts with them. Though if you know someone there maybe you'd have better luck than I.) I feel like at least some interpretation is better than none though. |
There are circumstances that reverse engineered interpretation may be fine, but dimm errors is an area I'm a bit cautious of. There are many subtleties and details (not to mention which specific motherboards they apply to), that I'm a bit wary of without full information. I suppose if you had a lot of unique/different examples of dimm errors, perhaps there could be enough confidence that we were doing the right thing. If you haven't skimmed through |
There are definitely unknowns. For DIMM errors, I know the first byte is the type of error and the second is the slot. So those are a bit more complicated than some. I do have errors for at least 3 kinds of errors, and 2 slots. (Since these machines I only use 2 slots of the 16 they have.) Things like SMART error are a fair bit simpler and don't seem to have the second OEM byte. I'll take a look at the strings and see... |
Are the sensor types/event types/sensor numbers/generator ids (and maybe other stuff) all the same in all the DIMM events? You may need to use
So we wouldn't know how the other 14 slots are specified? Ping Supermicro. If you can't get anything, I will ping them too. |
Across machines the same codes get mapped the same way. So for example a memory error on two hosts, with different dimm slots, have the same first byte but a different second byte. And two with the same slot having problems on different machines, but with different problems (uncorrectable vs. correctable) show the same second byte. |
With the exception of the record ID fields, is everything else in the SEL events identical? You can use |
product_ids.txt ipmiutil says this in the comment
|
per @TomHetmer e-mail on mailing list, these are product IDs X10DRH => 2201 |
More notes from ipmiutil.
|
I have a lot of SuperMicro servers that exhibit various SEL messages that are not currently understood by ipmi-sel. (Memory DIMMs failing, SMART errors, etc. All the things you get when you have a few hundred of the things.) The SuperMicro-provided ipmicfg tool understands these OEM codes, but ipmi-sel does not. It would be really helpful for me if these were added to ipmi-sel, since the official tool only works on x86 and I need to manage these servers from an ARM machine where I can only run ipmi-sel.
I'm happy to provide the comparative output of the two tools for the errors I get.
The text was updated successfully, but these errors were encountered: