Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SEL interpretations for SuperMicro X10 #19

Open
dragonpaw opened this issue Jan 5, 2018 · 10 comments
Open

Add SEL interpretations for SuperMicro X10 #19

dragonpaw opened this issue Jan 5, 2018 · 10 comments

Comments

@dragonpaw
Copy link

I have a lot of SuperMicro servers that exhibit various SEL messages that are not currently understood by ipmi-sel. (Memory DIMMs failing, SMART errors, etc. All the things you get when you have a few hundred of the things.) The SuperMicro-provided ipmicfg tool understands these OEM codes, but ipmi-sel does not. It would be really helpful for me if these were added to ipmi-sel, since the official tool only works on x86 and I need to manage these servers from an ARM machine where I can only run ipmi-sel.

I'm happy to provide the comparative output of the two tools for the errors I get.

@chu11
Copy link
Owner

chu11 commented Jan 5, 2018

Hi @dragonpaw, I am a little reluctant to add OEM support by only backwards engineering strings between the two tools as you are suggesting. There are too many subtleties that probably shouldn't be assumed.

For starters, how about pinging Supermicro and see if you can get the official documentation for these OEM codes? I have done this for them in the past and they have obliged.

@dragonpaw
Copy link
Author

I can reach out to them, but it seems like that might be more difficult than you think. (Based of my prior attempts with them. Though if you know someone there maybe you'd have better luck than I.)

I feel like at least some interpretation is better than none though.

@chu11
Copy link
Owner

chu11 commented Jan 5, 2018

There are circumstances that reverse engineered interpretation may be fine, but dimm errors is an area I'm a bit cautious of. There are many subtleties and details (not to mention which specific motherboards they apply to), that I'm a bit wary of without full information. I suppose if you had a lot of unique/different examples of dimm errors, perhaps there could be enough confidence that we were doing the right thing.

If you haven't skimmed through libfreeipmi/sel/ipmi-sel-string*, you may want to take a look to see what you might be getting yourself into :-)

@dragonpaw
Copy link
Author

There are definitely unknowns. For DIMM errors, I know the first byte is the type of error and the second is the slot. So those are a bit more complicated than some. I do have errors for at least 3 kinds of errors, and 2 slots. (Since these machines I only use 2 slots of the 16 they have.) Things like SMART error are a fair bit simpler and don't seem to have the second OEM byte.

I'll take a look at the strings and see...

@chu11
Copy link
Owner

chu11 commented Jan 6, 2018

Are the sensor types/event types/sensor numbers/generator ids (and maybe other stuff) all the same in all the DIMM events? You may need to use --debug to look at everything.

(Since these machines I only use 2 slots of the 16 they have.)

So we wouldn't know how the other 14 slots are specified?

Ping Supermicro. If you can't get anything, I will ping them too.

@dragonpaw
Copy link
Author

Across machines the same codes get mapped the same way. So for example a memory error on two hosts, with different dimm slots, have the same first byte but a different second byte. And two with the same slot having problems on different machines, but with different problems (uncorrectable vs. correctable) show the same second byte.

@chu11
Copy link
Owner

chu11 commented Jan 9, 2018

With the exception of the record ID fields, is everything else in the SEL events identical? You can use --hex-dump to see.

@totoCZ
Copy link

totoCZ commented Dec 5, 2018

product_ids.txt
Adding a reference. to the mailing list as requested:
https://lists.gnu.org/archive/html/freeipmi-users/2018-12/msg00003.html

ipmiutil says this in the comment

          /* ver 2 method: 2A 80 = P1_DIMMB1 */                                     
          /* SuperMicro says:                                                   
           *  pair: %c (data2 >> 4) + 0x40 + (data3 & 0x3) * 3, (='B')          
           *  dimm: %c (data2 & 0xf) + 0x27,                                    
           *  cpu:  %x (data3 & 0x03) + 1);
          */     

@chu11
Copy link
Owner

chu11 commented Dec 5, 2018

per @TomHetmer e-mail on mailing list, these are product IDs

X10DRH => 2201
X10DRW-E => 2148
X11SPi-TF => 2369
X10SLL-F => 2049
X10DRL-i => 2097
X11DDW-NT => 2407
X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F => 2051

@chu11
Copy link
Owner

chu11 commented Dec 11, 2018

More notes from ipmiutil.

#define NPAIRS  26                                                                                                                                                                                                 
char rgpair[NPAIRS] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; 
...
      cpu = (b3 & 0x0F) + 1; /*0x80=CPU1, 0x81=CPU2*/                                                                                                                                                              
      pair = ((bdata & 0xF0) >> 4) - 1; /*0x10=pairA, 0x20=pairB*/                                                                                                                                                 
      if (pair < 0) pair = 0;                                                                                                                                                                                      
      if (pair > NPAIRS) pair = NPAIRS - 1;                                                                                                                                                                        
      dimm = (bdata & 0x0F) - 9; /*0x0A=dimmX1, 0x0B=dimmX2*/                                                                                                                                                      
      if (dimm < 0)                                                                                                                                                                                                
         n = sprintf(desc,DIMM_UNKNOWN);  /* invalid */                                                                                                                                                            
      else                                                                                                                                                                                                         
         n = sprintf(desc,"P%d_DIMM%c%d",cpu,rgpair[pair],dimm);   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants