DDN External Checkers is a collection of elementary checks aimed at complementing the monitoring of a DDN SFA-based storage infrastructure.
In a DDN storage infrastructure environment, main supervision is expected to be achieved through the use of the SFA OS API, but a couple of items which are not available through the SFA OS API need to be monitored separately in order to:
- Provide insight on the storage infrastructure status
- Prevent forthcoming failures
DDN External Checkers are divided into two distinct, complementary modules:
- DDN SFA Checkers: Monitoring of DDN SFA controllers
- DDN EXAScaler Checkers: Monitoring of DDN Object Storage Servers (OSS)
DDN External Checkers produce ASCII files as an output. In an integrated monitoring platform, these files should ideally serve as input data for the main monitoring solution (Nagios, Icinga).
The DDN EXAScaler Checkers performs the following checks:
Check | Purpose |
---|---|
Collectd | collectd systemd Unit Status |
Connection to MGS | Loss of Connectivity to MGS |
Degraded | Degraded State |
Filesystems | Local Filesystems Occupancy |
HA Resources | High Availability Resources Status |
High CPU Load | Occurrences of High CPU Load Warnings |
MB C3 Threshold | MB C3 Threshold Setting |
Uptime | Uptime |
- Function:
ddn-es-checker::collectd
- Purpose: Reports
collectd
systemd unit status. - Type: Failure Detection
- Output:
sfa01o01: collectd: status=active
- Function:
ddn-es-checker::connection-to-mgs
- Purpose: Counts number of times connection to MGS was lost on the current day. Frequent disconnections from the MGS might be a symptom of a forthcoming failure.
- Type: Preventive Failure Detection
- Output:
sfa01o01: connection-to-mgs: lost=0
- Function:
ddn-es-checker::degraded
- Purpose: Reports OSS which have been put in degraded state. Degraded state is required to prevent an OSS which starts exhibiting errors from failing.
- Type: Storage Infrastructure Status Monitoring
- Function:
ddn-es-checker::filesystems
- Purpose: Reports
/
and/var
OSS filesystems occupancy. Full occupancy of one of the local filesystems would make the high-availability processes fail immediately. - Type: Preventive Failure Detection
- Output:
sfa01o01: filesystems: fs-root=72;fs-var=12
- Function:
ddn-es-checker::ha-resources
- Purpose: Counts number of high-availability resources which are currently configured / disabled. All high-availability resources should be all configured whenever the OSS are in optimal condition.
- Type: Preventive Failure Detection
- Output:
sfa01o01: ha-resources: configured=42;disabled=
- Function:
ddn-es-checker::high-cpu-load
- Purpose: Counts number of occurrences and spots maximum / average CPU load. Frequent high CPU load messages are often a symptom of a forthcoming failure.
- Type: Preventive Failure Detection
- Output:
sfa01o01: high_cpu_load: avg=0;max=0;count=0
The SFA External Checker performs the following checks:
Check | Purpose |
---|---|
Position Status | Position Status of Enclosures |
SCSI Events | Occurrences of SCSI Events |
Write-Back | Write-Back Feature Status |
- Function Name:
ddn-sfa-checker::position-status
- Purpose: Reports position status of every enclosure (SAS cables connection state). Value different from success would tend to indicate that SAS cables connection scheme has not been respected.
- Type: Storage Infrastructure Status Monitoring
- Action:
sfa01c0: position-status: 0=success,1=success,2=success,3=success,4=success,5=success,6=success,7=success
- Function Name:
ddn-sfa-checker::scsi-events
- Purpose: Counts number of occurrences of SCSI events over the last 3 days. SCSI events often indicate a forthcoming failure either of a SAS cable or an IO Module.
- Type: Preventive Failure Detection
- Output:
sfa01c0: scsi-events: 2021-10-15=0;2021-10-14=0;2021-10-13=0
- Function Name:
ddn-sfa-checker::write-back
- Purpose: Reports storage controllers with Write-Back feature disabled (Write-Through Mode). Write-Back should not be kept disabled longer that required by a specific maintenance operation.
- Type: Storage Infrastructure Status Monitoring
- Action:
sfa01c0: write-back: 0=true,1=true,2=true,3=true,4=true,5=true,6=true,7=true
-
xCAT Management Server:
- Distributed Shell Command
xdsh
- DDN:SFA Specific Distributed Shell Configuration
$ cat /opt/xcat/share/xcat/devicetype/DDN/SFA/config [main] [xdsh] pre-command=NULL post-command=NULL
Note: Use of xCAT Distributed Shell
xdsh
can be easily replaced by a different distributed shell - likepdsh
for instance - if preferred. - Distributed Shell Command
-
Target Objects:
- DDN SFA storage controllers
- Lustre Object Storage Servers (OSS)
Clone Git repository:
$ git clone https://github.com/nicolas-tallet/ddn-external-checkers.git
$ export PATH="${PWD}/ddn-passive-checkers/bin:${PATH}"
$ ddn-external-checkers -noderange-es NODERANGE_ES -noderange-sfa NODERANGE_SFA -system SYSTEM
Example:
$ ddn-external-checkers -noderange-es "sfa01cont[01-02]" -noderange-sfa "sfa01oss[01-08]" -system "sfa01"