Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTOR-1075 : Doc(operatingsystems-linux-snmp) : Add troubleshooting section for snmp uptime issue #3850

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,6 @@ Configure those extra SNMP options in the host/host template configuration in th
| -e | --securityengineid |
| -E | --contextengineid |


### UNKNOWN: SNMP GET Request : Timeout

Often, a timeout comes from:
Expand Down Expand Up @@ -160,6 +159,76 @@ run into this error.
For interfaces and storage checks, options exist to ask the probe to use
an other OID (e.g. `--oid-filter='ifDesc' --oid-display='ifDesc'`).

### Problème d'uptime

### Contexte sur le sysUpTime dans SNMP

Lorsque l'uptime dépasse 497 jours, un problème spécifique peut se produire en raison de la
manière dont l'uptime est représenté dans le format TimeTicks utilisé par SNMP. Le `sysUpTime`
dans SNMP est un nombre exprimé en TimeTicks, qui représente le nombre de centi-secondes
écoulées depuis le dernier démarrage du système. Ce nombre est stocké dans un format de
32 bits, ce qui signifie qu'il peut stocker des valeurs comprises entre 0 et 4 294 967 295.
Ainsi l'uptime atteint sa valeur maximale après environ 497 jours
(environ 4 294 967 295 centi-secondes). Lorsque cette limite est dépassée, un débordement
(overflow) se produit, ce qui signifie que le compteur recommence à zéro.

### Comment identifier le problème ?

Vous pouvez identifier que l'uptime a dépassé la limite de 497 jours en vérifiant directement
sur l'équipement (si c'est possible) son uptime (sans interroger via SNMP). Par exemple
pour Linux, utilisez la commande suivante :

```commandline
uptime
14:32:12 up 500 days, 3:04, 2 users, load average: 0.15, 0.10, 0.09
```

Ce qui indique que le système est en fonctionnement depuis 500 jours, 3 heures et 4 minutes.

### Solution proposée en amont : l'option --check-overload

La majorité des modèles de services associés à l'uptime via SNMP utilisent l'option ` --check-overload`
qui va permettre de gérer le débordement de l'uptime après 497 jours. Pour cela ils vont utiliser le cache
du plugin pour déterminer l'ancien uptime et calculer le dépassement qui a eu lieu afin d'ajuster la
valeur d'uptime retournée par le plugin. Ainsi le débordement est transparent et ne génère pas de
fausse alerte vis-à-vis de l'uptime et l'utilisateur n'a rien à faire de particulier.

### Si le dépassement a eu lieu mais l'option --check-overload n'était pas présente dans la commande du plugin

Dans le cas où l'option ` --check-overload` n'était pas présente dans la commande du plugin avant que le
dépassement ait lieu, il est possible de corriger la situation de la façon suivante :

Lancez la commande du plugin en ajoutant l'option ` --check-overload`:

```commandline
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin=os::linux::snmp::plugin --mode=uptime --hostname=XXXX --snmp-version='2c' --snmp-community='public' --check-overload
OK: System uptime is: 11h 28m 39s | 'uptime'=41319.00s;;;0;
```

Ensuite vérifiez que l'option s'est ajoutée dans le cache du plugin :

```commandline
cat /var/lib/centreon/centplugins/cache_<hostname>_uptime
{"last_time":170905862051,"overload":0,"uptime":"4131920"}
```

Remplacez la valeur de l'option "overload" par 1 et vérifiez que cela a fonctionné :

```commandline
sed -i 's/"overload":0/"overload":1/g' /var/lib/centreon/centplugins/cache_<hostname>_uptime
cat /var/lib/centreon/centplugins/cache_<hostname>_uptime
{"last_time":170905862051,"overload":1,"uptime":"4131920"}
```

Vous pouvez ensuite relancer la commande du plugin avec l'option ` --check-overload` : le
résultat devrait tenir compte du dépassement et correspondre aux informations d'uptime du
système que vous avez pu vérifier manuellement :

```commandline
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin=os::linux::snmp::plugin --mode=uptime --hostname=XXXX --snmp-version='2c' --snmp-community='public' --check-overload
OK: System uptime is: 497d 13h 58m 41s | 'uptime'=42991121.00s;;;0;
```

## HTTP and API checks

### UNKNOWN: Cannot decode response (add --debug option to display returned content)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,74 @@ run into this error.
For interfaces and storage checks, options exist to ask the probe to use
an other OID (e.g. `--oid-filter='ifDesc' --oid-display='ifDesc'`).

### Uptime issue

### Context on sysUpTime in SNMP

When the uptime exceeds 497 days, a specific issue can occur due to how uptime is represented
in the TimeTicks format used by SNMP. The `sysUpTime` in SNMP is a number expressed in TimeTicks,
which represents the number of centi-seconds that have passed since the system was last rebooted.
This number is stored in a 32-bit format, which means it can hold values between 0 and 4,294,967,295.
Thus, the uptime reaches its maximum value after approximately 497 days of uptime
(about 4,294,967,295 centi-seconds). When this limit is exceeded, an overflow occurs, meaning
the counter resets to zero.

### How to identify the issue?

You can identify that the uptime has exceeded the 497-day limit by checking the uptime directly
on the device (if possible) without querying via SNMP. For example, on Linux, use the following command:

```commandline
uptime
14:32:12 up 500 days, 3:04, 2 users, load average: 0.15, 0.10, 0.09
```

This indicates that the system has been running for 500 days, 3 hours, and 4 minutes.

### Proposed solution: the --check-overload option

Most service templates associated with uptime in SNMP use the `--check-overload` option,
which allows for managing the overflow of uptime after 497 days. It utilizes the plugin’s
cache to determine the previous uptime and calculate the overflow that occurred, adjusting
the uptime value returned by the plugin. Thus, the overflow becomes transparent and does
not generate a false alert regarding uptime, and the user does not need to take any specific action.

### If the overflow occurred but the --check-overload option was not used in the plugin command

In cases where the `--check-overload` option was not included in the plugin command
before the overflow occurred, you can correct the situation by following these steps:

Run the plugin command by adding the -`--check-overload` option:

```commandline
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin=os::linux::snmp::plugin --mode=uptime --hostname=XXXX --snmp-version='2c' --snmp-community='public' --check-overload
OK: System uptime is: 11h 28m 39s | 'uptime'=41319.00s;;;0;
```

Then, check that the option has been added to the plugin’s cache:

```commandline
cat /var/lib/centreon/centplugins/cache_<hostname>_uptime
{"last_time":170905862051,"overload":0,"uptime":"4131920"}
```

Replace the "overload" value with 1 and check that the change worked:

```commandline
sed -i 's/"overload":0/"overload":1/g' /var/lib/centreon/centplugins/cache_<hostname>_uptime
cat /var/lib/centreon/centplugins/cache_<hostname>_uptime
{"last_time":170905862051,"overload":1,"uptime":"4131920"}
```

You can then rerun the plugin command with the `--check-overload` option, and the result
should account for the overflow and reflect the correct system uptime information, as
you manually checked:

```commandline
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin=os::linux::snmp::plugin --mode=uptime --hostname=XXXX --snmp-version='2c' --snmp-community='public' --check-overload
OK: System uptime is: 497d 13h 58m 41s | 'uptime'=42991121.00s;;;0;
```

## HTTP and API checks

### UNKNOWN: Cannot decode response (add --debug option to display returned content)
Expand Down