- Description
- Dependencies
- How to use LogAider
- 1. Parsing and Filtering
- 2. Across-Field correlation
- Analysis based on RAS Log
- Extract value types for each field
- Gererate state features
- Construct value-combination pool based on schema and featureStates
- Calculate the number of value combinations by brute-force method
- Generate analysis for inputMsg.txt
- Generate spatial distribution (to be plotted in a graph by plot.PlotMiraGraph later on)
- Analysis based on job scheduling log (Cobalt)
- Analysis based on RAS Log
- 3. Analyze failure rate of components
- 4. Plot error distribution
- 5. Generate monthly and daily Log Analysis Results
- 6. Analyze the error propagation
- 7. Analyze Spatial-correlation
Today’s large-scale supercomputers are producing a huge amount of log data. Exploring various potential correla- tions of fatal events is crucial for understanding their causality and improving the working efficiency for system administra- tors. To this end, we developed a toolkit, named LogAider, that can reveal three types of potential correlations: across- field, spatial, and temporal. Across-field correlation refers to the statistical correlation across fields within a log or across multiple logs based on probabilistic analysis. For analyzing the spatial correlation of events, we developed a generic, easy- to-use visualizer that can view any events queried by users on a system machine graph. LogAider can also mine spatial correlations by an optimized K-meaning clustering algorithm over a Torus network topology. It is also able to disclose the temporal correlations (or error propagations) over a certain period inside a log or across multiple logs, based on an effective similarity analysis strategy.
This code corresponds to the paper published in CCGrid2017: LogAider - A tool for mining potential correlations in HPC system logs.
This code is also used in the analysis of the BlueG/Q systems (mainly the ANL MIRA system) in the following two papers:
[1] Sheng Di, Hanqi Guo, Eric Pershey, Marc Snir, Franck Cappello, "Characterizing and Understanding HPC Job Failures over The 2K-day Life of IBM BlueGene/Q System", IEEE/IFIP 49th International Conference on Dependable Systems and Networks (IEEE DSN19), Portland, USA, 2019.
[2] Sheng Di, Hanqi Guo, Rinku Gupta, Eric Pershey, Marc Snir, Franck Cappello, "Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System," in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), 2018.
As the developer, I strongly recommend you to read the CCGrid paper carefully to understand how LogAider works, in that the following description may use some terms defined in the paper, such as value-combination pool.
The Mira RAS log data are available to download from https://reports.alcf.anl.gov/data/mira.html
The above log data page includes all severity levels of messages (including INFO, WARN and FATAL). If you are interested in only fatal events, we already extracted them to be here. You can also extract the fatal events by yourself using analysis.RAS.CollectWarnFatalMessags (requiring a huge memory setting though), which will be described later.
LogAider is coded in Java, so you need to install JDK 1.8+ (checking the version of JDK on your machine using 'java --version')
After installing JDK, you are ready to use LogAider by running the corresponding bash scripts or the java programs.
(Optional: If you want to plot spatial locations, you need to have Gnuplot installed on your machine)
LogAider provides a rich set of analysis functions as listed below, for mining the correlations of events in a Reliability, Availability and Serviablity (RAS) log. In the following, we use the RAS log of MIRA supercomputer (BlueG/Q system) as an example. We provide flexible schema files for users to edit, in order to adapt to other systems.
This part discusses how to parse and filter the data
- Extract all warn and fatal messages from original log
- Script: CollectWarnFatalMessages.sh
- Source Code: analysis.RAS.CollectWarnFatalMessages.java
- Usage: java analysis.RAS.CollectWarnFatalMessags [schemaPath] [severity_index] [file or directory: -f/-d] [logDir/logFile] [log_extension]
- Example: Example: java -Xmx50000m CollectWarnFatalMessags /home/sdi/Catalog-project/miralog/schema/basicSchema.txt 4 -d /home/sdi/Catalog-project/miralog csv
schema file is used to specify the format of the log data. For example, in MIRA RAS log, the basicScheme.txt looks like this:
#Column name schema Data type name Length
RECI DSYSIBM INTEGER 4 0 No
MSG_ID SYSIBM CHARACTER 8 0 Yes
CATEGORY SYSIBM CHARACTER 16 0 Yes
COMPONENT SYSIBM CHARACTER 16 0 Yes
SEVERITY SYSIBM CHARACTER 8 0 Yes
EVENT_TIME SYSIBM TIMESTAMP 10 6 No
JOBID SYSIBM BIGINT 8 0 Yes
BLOCK SYSIBM CHARACTER 32 0 Yes
LOCATION SYSIBM CHARACTER 64 0 Yes
SERIALNUMBER SYSIBM CHARACTER 19 0 Yes
CPU SYSIBM INTEGER 4 0 Yes
COUNT SYSIBM INTEGER 4 0 Yes
CTLACTION SYSIBM VARCHAR 256 0 Yes
MESSAGE SYSIBM VARCHAR 1024 0 Yes
DIAGS SYSIBM CHARACTER 1 0 No
QUALIFIER SYSIBM CHARACTER 32 0 Yes
In the above example, /home/sdi/Catalog-project/miralog is the directory containing the original RAS log data files, whose extensions are csv.
Some examples are showns below:
[sdi@sdihost RasLog]$ ls
ANL-ALCF-RE-MIRA_20130409_20131231.csv ANL-ALCF-RE-MIRA_20140101_20141231.csv ANL-ALCF-RE-MIRA_20150101_20151231.csv ANL-ALCF-RE-MIRA_20160101_20161231.csv ANL-ALCF-RE-MIRA_20170101_20170831.csv
[sdi@sdihost RasLog]$ cat ANL-ALCF-RE-MIRA_20130409_20131231.csv
"RECID","MSG_ID","CATEGORY","COMPONENT","SEVERITY","EVENT_TIME","JOBID","BLOCK","LOCATION","SERIALNUMBER","CPU","COUNT","CTLACTION","MESSAGE","DIAGS","QUALIFIER","MACHINE_NAME"
13113415,"0008003C","Software_Error","FIRMWARE","INFO","2013-04-01 00:00:37.072514",180219,"MIR-40000-737F1-4096","R0D-M0-N05-J06","74Y9656YL1CK135701C",65,"","","DDR0 PHY was recalibrated(0): taken = 196 usec. Previous cal was 209.5 seconds ago.","F","655148938 ","mira"
13113416,"00080034","DDR","FIRMWARE","INFO","2013-04-01 00:00:42.804000",180268,"MIR-44400-77771-1024","R1C-M0-N12-J14","00E5870YL1FB227033C",8,8713,"","DDR Correctable Error Summary : count=8713 MCFIR error status: [POWERBUS_WRITE_BUFFER_CE] This bit is set when a PBUS ECC CE is detected on a PBus write buffer read op;","F","50419698 ","mira"
13113417,"0008002F","BQC","FIRMWARE","INFO","2013-04-01 00:00:42.806127",180268,"MIR-44400-77771-1024","R1C-M0-N12-J14","00E5870YL1FB227033C",8,27355,"","L1P Correctable Error Summary : count=27355 cores=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15 L1P_ESR : [ERR_RELOAD_ECC_X2] correctable reload data ECC error;","F","50419698 ","mira"
13113418,"00080030","BQC","FIRMWARE","INFO","2013-04-01 00:00:42.807463",180268,"MIR-44400-77771-1024","R1C-M0-N12-J14","00E5870YL1FB227033C",8,400,"","L2 Array Correctable Error Summary : count=400 slices=10 L2_INTERRUPT_STATE error status: [EDR_CE] Coherence array correctable error;","F","50419698 ","mira"
13113419,"0006100E","Optical_Module","MMCS","WARN","2013-04-01 00:00:51.477275","","","R21-M0-N09-O13","","","","","Health Check detected an abnormal condition for the optical module at location R21-M0-N09-O13. The condition is related to POWER 11 .","F","","mira"
.....
hint :
- -Xmx50000m means applying for 50GB for JVM to run this test. The size of memory required depends on the data items you need to process.
- If you encounter the error 'java.lang.OutOfMemoryError: Java heap space', this means you need to increase the memory size by using -Xmx.
- All the java classes need to be run using java <arguments...>: e.g., java analysis.RAS.CollectWarnFatalMessags .... instead of java CollectWarnFatalMessags ....
- Classify Log Based on MessageID
- Script: -
- Source Code: filter.ClassifyLogBasedonMessageID.java
- Usage: java filter.ClassifyLogBasedonMessageID [inputLogFile] [outputDir]
- Example: java ClassifyLogBasedonMessageID /home/sdi/Catalog-project/miralog/totalFatalMsg.fat /home/sdi/Catalog-project/miralog/FilterAndClassify
inputLogFile refers to the file containing all the fatal messages (assuming you are focused on only fatal messages in the study).
outputDir refers to the directory that will contain the output results.
The output will look as follows:
[sdi@sdihost FilterAndClasify]$ ls
00010001.fltr 00040028.ori 0004009A.ori 000400E7.ori 00040122.ori 00040144.ori 00062005.ori 0008000B.ori 00080029.ori 00090101.ori 00090213.ori 000B0005.ori
00010001.ori 00040037.fltr 000400A0.fltr 000400ED.fltr 00040124.fltr 00040148.fltr 00070214.fltr 0008000C.fltr 0008003A.fltr 00090102.fltr 00090216.fltr allEvents-2015.txt
00010007.fltr 00040037.ori 000400A0.ori 000400ED.ori 00040124.ori 00040148.ori 00070214.ori 0008000C.ori 0008003A.ori 00090102.ori 00090216.ori allEvents.txt
00010007.ori 00040058.fltr 000400A5.fltr 000400EE.fltr 00040125.fltr 0004014A.fltr 00070219.fltr 00080014.fltr 0008003D.fltr 00090103.fltr 00090217.fltr FFFE0015.fltr
0001000A.fltr 00040058.ori 000400A5.ori 000400EE.ori 00040125.ori 0004014A.ori 00070219.ori 00080014.ori 0008003D.ori 00090103.ori 00090217.ori FFFE0015.ori
......
The file names are the message IDs, and the extensions refer to original messages (.ori) or filtered messages (.fltr).
- TemporalSpatialFilter
- Script: TemporalSpatialFilter.sh
- Source Code: filter.TemporalSpatialFilter.java
- Usage: java filter.TemporalSpatialFilter [-t/-s/-ts] [classifiedLogDir] [extension] [maintenance-time-file] [reservation-period-file] [outputDir]
- Example 1: java TemporalSpatialFilter -t /home/sdi/Catalog-project/miralog/FilterAndClassify ori /home/sdi/Work/Catalog-project/Catalog-data/schema/maintainance-period.txt /home/sdi/Work/Catalog-project/Catalog-data/schema/total-reservation.txt /home/sdi/Catalog-project/miralog/FilterAndClassify
- Example 2: java TemporalSpatialFilter -ts /home/sdi/Work/Catalog-project/Catalog-data/FilterAndClassify ori /home/sdi/Work/Catalog-project/Catalog-data/schema/maintainance-period.txt /home/sdi/Work/Catalog-project/Catalog-data/schema/total-reservation.txt /home/sdi/Work/Catalog-project/Catalog-data/FilterAndClassify
-t/-s/-ts indicates only-temporal filter, only spatial-filter or temporal-spatial filter.
classifiedLogDir is the output dir pointed out by filter.ClassifyLogBasedonMessageID.
extension is the extension of the classified files in the classified dir
maintenance-time-file is the file containig the maintainance periods
reservation-perid-file contains all the reservation periods, each corresponding only to one user/job. That is, all the failures happening in the same researvation period sould be counted as one failure event, according to the MIRA system administrator. The MIRA's 5-year reservation file is total-reservation.txt, which can be found in the schema/ of the package.
outputDir is the output directory that will cotnaint the filtered log data results.
The content of maintenance-time-file is shown below:
2013-04-01 14:00:09,2013-04-02 01:13:42
2013-04-08 14:00:17,2013-04-09 00:34:52
2013-04-15 14:00:04,2013-04-15 21:27:55
2013-04-22 14:00:07,2013-04-23 00:32:10
2013-04-29 14:00:05,2013-04-30 01:04:23
2013-05-06 14:00:21,2013-05-06 23:06:17
2013-05-13 14:00:15,2013-05-14 00:08:54
2013-05-22 22:30:13,2013-05-22 23:22:25
2013-05-23 22:30:04,2013-05-24 03:29:49
2013-05-24 17:00:00,2013-05-25 02:50:02
2013-05-28 14:00:14,2013-05-29 00:26:31
2013-06-03 14:00:04,2013-06-03 22:19:53
2013-06-10 14:00:05,2013-06-11 04:55:29
2013-06-17 11:00:09,2013-06-18 00:00:46
2013-06-24 14:30:04,2013-06-25 02:43:38
......
In the outputDir, there will be a new sub-directory generated for storing the further filtered messages, e.g., no-Maint-filter-interval=1800s_43200s, as shown below, where no-Maint-filter means it excludes the maintenance periods as specified by the maintenance-time-file, and interval=1800s_43200s refers to the two window sizes used to control the filtering. The details can be found in our CCGrid17 paper.
00010001.fltr 00040028.ori 0004009A.ori 000400E7.ori 00040122.ori 00040144.ori 00062005.ori 0008000B.ori 00080029.ori 00090101.ori 00090213.ori 000B0005.ori
00010001.ori 00040037.fltr 000400A0.fltr 000400ED.fltr 00040124.fltr 00040148.fltr 00070214.fltr 0008000C.fltr 0008003A.fltr 00090102.fltr 00090216.fltr allEvents-2015.txt
00010007.fltr 00040037.ori 000400A0.ori 000400ED.ori 00040124.ori 00040148.ori 00070214.ori 0008000C.ori 0008003A.ori 00090102.ori 00090216.ori allEvents.txt
00010007.ori 00040058.fltr 000400A5.fltr 000400EE.fltr 00040125.fltr 0004014A.fltr 00070219.fltr 00080014.fltr 0008003D.fltr 00090103.fltr 00090217.fltr FFFE0015.fltr
0001000A.fltr 00040058.ori 000400A5.ori 000400EE.ori 00040125.ori 0004014A.ori 00070219.ori 00080014.ori 0008003D.ori 00090103.ori 00090217.ori FFFE0015.ori
0001000A.ori 00040059.fltr 000400AA.fltr 000400F8.fltr 00040131.fltr 0004014D.fltr 0007021C.fltr 00080016.fltr 00090001.fltr 00090104.fltr 000A0003.fltr no-Maint-filter-interval=1800s_43200s
- Extract all error messages (with non-exit code)
- Script: -
- Source Code: analysis.Job.CollectErrorMessages.java
- Usage: java analysis.Job.CollectErrorMessages [schemaPath] [severity_index] [logDir] [log_extension]
- Example: java analysis.Job.CollectErrorMessages /home/sdi/Catalog-project/miralog/RAS-Job/Job/basicSchema/basicSchema.txt 14 /home/sdi/Catalog-project/miralog/RAS-Job/Job csv
basicSchema.txt for the jog scheduling log (Cobalt), for example, will look as follows:
QUEUED_TIMESTAMP
START_TIMESTAMP
END_TIMESTAMP
QUEUED_DATE_ID
START_DATE_ID
END_DATE_ID
RUNTIME_SECONDS
WALLTIME_SECONDS
REQUESTED_CORES
USED_CORES
REQUESTED_NODES
USED_NODES
REQUESTED_CORE_HOURS
REQUESTED_CORE_SECONDS
USED_CORE_HOURS
USED_CORE_SECONDS
COBALT_PROJECT_NAME_GENID
COBALT_USER_NAME_GENID
MACHINE_PARTITION
EXIT_CODE
QUEUE_GENID
MODE
MACHINE_NAME
RESID
DELETED_BY_GENID
JOBID
PROJECT_NAME_GENID
In our example, MIRA log, the job log is also stored in the form of csv, similar to RAS log, but it has different schema.
The snapshot of one job log file is shown below:
#"QUEUED_TIMESTAMP","START_TIMESTAMP","END_TIMESTAMP","QUEUED_DATE_ID","START_DATE_ID","END_DATE_ID","RUNTIME_SECONDS","WALLTIME_SECONDS","REQUESTED_CORES","USED_CORES","REQUESTED_NODES","USED_NODES","REQUESTED_CORE_HOURS","REQUESTED_CORE_SECONDS","USED_CORE_HOURS","USED_CORE_SECONDS","COBALT_PROJECT_NAME_GENID","COBALT_USER_NAME_GENID","MACHINE_PARTITION","EXIT_CODE","QUEUE_GENID","MODE","MACHINE_NAME","RESID","DELETED_BY_GENID","JOBID","PROJECT_NAME_GENID"
"2014-12-30 16:03:17.000000","2014-12-31 23:26:29.000000","2015-01-01 00:27:15.000000",20141230,20141231,20150101,"3646.0000","7200.0000","8192.0000","8192.0000","512.0000","512.0000","8296.6756","29868032.0000","8296.6756","29868032.0000",71848090445552,57948927142633,"MIR-088C0-3BBF1-512",0,89991570492271,"script","mira",-1,"",389021,68961232793033
"2014-12-09 09:09:05.000000","2014-12-31 13:56:08.000000","2015-01-01 01:26:14.000000",20141209,20141231,20150101,"41406.0000","43200.0000","16384.0000","16384.0000","1024.0000","1024.0000","188443.3067","678395904.0000","188443.3067","678395904.0000",85022475703164,13148949161706,"MIR-00C00-33F71-1-1024",0,51795839728692,"script","mira",-1,"",378419,53366083443800
"2014-12-18 17:39:38.000000","2014-12-31 18:19:21.000000","2015-01-01 00:19:58.000000",20141218,20141231,20150101,"21637.0000","21600.0000","65536.0000","65536.0000","4096.0000","4096.0000","393889.5644","1418002432.0000","393889.5644","1418002432.0000",55300184085639,68190251275985,"MIR-40000-737F1-4096",143,89991570492271,"c16","mira",-1,"",383815,51754639787485
......
Output: the above command will generate totalFatalMsg.fat, which contains only error messages regarding jobs.
- Calculate job failures based on users
- Script: -
- Source Code: analysis.Job.CalculateFailuresBasedonUsers.java
- Usage: java CalculateFailuresBasedonUsers [wlLengthFailureFile] [proj_exit_file_fs] [proj_exit_file_pe] [proj_outputFile] [user_exit_file_fs] [user_exit_file_pe] [user_outputFile]
- Example: java CalculateFailuresBasedonUsers /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/lengthAnalysis/breakWCJobList.ori /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/featureState/COBALT_PROJECT_NAME_GENID/COBALT_PROJECT_NAME_GENID-EXIT_CODE.fs /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/featureState/COBALT_PROJECT_NAME_GENID/COBALT_PROJECT_NAME_GENID-EXIT_CODE.pe90 /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/projFailure.out /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/featureState/COBALT_USER_NAME_GENID/COBALT_USER_NAME_GENID-EXIT_CODE.fs /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/featureState/COBALT_USER_NAME_GENID/COBALT_USER_NAME_GENID-EXIT_CODE.pe90 /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/cobalt/userFailure.out
hints :
- We omit the detailed description to the job-related analysis commands. In addition to CalculateFailuresBasedonUsers, there are more analysis codes in the package analysis.Job. Please find the source codes there for details.
- Extract value types for each field
- Script: -
- Source Code: analysis.RAS.ExtractValueTypes4EachField
- Usage: java ExtractValueType4EachField [schema] [inputDir] [extension] [outputDir]
- Example: java ExtractValueType4EachField /home/sdi/eventlog/schema/basicSchema.txt /home/sdi/eventlog csv /home/sdi/eventlog/schema/fullSchema
The output is a directory that contains multiple files each containing the types of values for one field. The output involves two types: 'withRatio' and 'withCount'. 'withRatio' means the value types will be associated with a percentage of the portion; while the 'withCount' means being associated with the number of values. Some examples are shown below: (Note: below are just examples showing how the output would looks like. The specific ratios/count numbers may different with different checking time intervals of the log. )
The example 'withRatio': the percentage % is shown with the value type.
[sdi@sdihost failureRateProperty]$ cd withRatio/
[sdi@sdihost withRatio]$ ls
BLOCKSIZE.fsr BLOCKSIZE-modify.fsr CATEGORY.fsr COMPONENT.fsr EVENT_COUNT.fsr EVENT_ID.fsr FIRST_LOCATION.fsr FIRST_REC_TIME.fsr LAST_REC_TIME.fsr LOCATION_MODE.fsr MSG_ID.fsr SEVERITY.fsr
[sdi@sdihost withRatio]$ cat CATEGORY.fsr
# CATEGORY null null 0 0 false
Node_Board 0.12376238
Infiniband 0.24752475
DDR 0.37128714
Message_Unit 0.6188119
Coolant_Monitor 0.86633664
Process 1.6089109
AC_TO_DC_PWR 1.7326733
Block 2.7227721
Cable 3.4653466
Card 11.014852
BQL 18.935642
BQC 28.712872
Software_Error 29.579206
[sdi@sdihost withRatio]$
The example 'withCount': the count is shown with the value type.
[sdi@sdihost failureRateProperty]$ cd withCount/
[sdi@sdihost withCount]$ ls
BLOCKSIZE.fsc CATEGORY.fsc COMPONENT.fsc EVENT_COUNT.fsc EVENT_ID.fsc FIRST_LOCATION.fsc FIRST_REC_TIME.fsc LAST_REC_TIME.fsc LOCATION_MODE.fsc MSG_ID.fsc SEVERITY.fsc
[sdi@sdihost withCount]$ cat CATEGORY.fsc
# CATEGORY null null 0 0 false
Node_Board 1
Infiniband 2
DDR 3
Message_Unit 5
Coolant_Monitor 7
Process 13
AC_TO_DC_PWR 14
Block 22
Cable 28
Card 89
BQL 153
BQC 232
Software_Error 239
- Gererate state features
- Script: GenerateStateFeaturs.sh
- Source Code: analysis.RAS.GenerateStateFeatures
- Usage: java analysis.RAS.GenerateStateFeatures [basicSchema] [fullSchemaDir] [schemaExt] [logDir] [logExt] [outputDir] [fields....]
- Example: java analysis.RAS.GenerateStateFeatures /home/sdi/Catalog-project/miralog/schema/basicSchema.txt /home/sdi/Catalog-project/miralog/schema/fullSchema/withCount fsc /home/sdi/Catalog-project/miralog csv /home/sdi/Catalog-project/miralog/featureState CATEGORY COMPONENT CPU CTLACTION LOCATION MSG_ID SEVERITY
This function is to generate the state features, in order to calculate the posterior probability based on observed evidences. By 'state', we mean the a specific target value of a field whose probability will be calculated. For instance, the users may want to what is the probability of COMPONENT=CNK when MSG_ID=00010001 and CATEGORY=Software_error. In this example, COMPONENT=CNK is the target state, and MSG_ID=00010001 and CATEGORY=Software_error is called 'evidence'. The across-field correlation analysis can answer this question.
- Construct value-combination pool based on schema and featureStates
- Script: BuildFieldValueCombination.sh
- Source Code: analysis.RAS.BuildFieldValueCombination
- Usage: java BuildFieldCombination [maxElementCount] [basicSchemaFile] [fullSchemaDir] [extension] [featureStateDir] [fsExt] [outputDir] [fieldNames....]
- Example: java BuildFieldCombination 5 /home/sdi/Catalog-project/miralog/schema/basicSchema2.txt /home/sdi/Catalog-project/miralog/schema/fullSchema/withRatio fsr /home/sdi/Catalog-project/miralog/featureState pr /home/sdi/Catalog-project/miralog/fieldCombination CATEGORY COMPONENT CTLOCATION MSG_ID SEVERITY
output: the directory fieldCombination that contains the value-combination probability pool.
-
Calculate the number of value combinations by brute-force method (valid for both RAS and Job log)
- Script: CalculateCountsForValueCombinations.sh
- Source Code: analysis.RAS.CalculateCountsForValueCombinations
- Usage: java CalculateCountsForValueCombinations [basicSchemaFile] [fullSchemaDir] [fullSchemaExt] [logDir] [extension] [outputFile] [fields....]
- Example: java CalculateCountsForValueCombinations /home/sdi/Catalog-project/miralog/schema/basicSchema.txt /home/sdi/Catalog-project/miralog/schema/fullSchema/withRatio fsr /home/sdi/Catalog-project/miralog csv /home/sdi/Catalog-project/miralog/fieldValueCombination CATEGORY COMPONENT CTLACTION MSG_ID SEVERITY
output: Generate vc.count file in the dir fieldValueCombination.
-
Generate analysis for inputMsg.txt
- Script: ComputePostProbabilityBasedonMsg.sh
- Source Code: analysis.RAS.ComputePostProbabilityBasedonMsg.java
- Usage: java ComputePostProbabilityBasedonMsg [fieldListFile] [vcCountHashMapFile] [inputMessageFile] [outputResultFile] [confidenceLevel]
- Example: java ComputePostProbabilityBasedonMsg /home/sdi/Catalog-project/miralog/fieldValueCombination/fieldList.txt /home/sdi/Catalog-project/miralog/fieldValueCombination/vc.count "/home/sdi/Catalog-project/miralog/inputMsg.txt" /home/sdi/Catalog-project/miralog/analyzeMsg 0.95
This function is to compute the posterior probability based on given messages (inputMessageFile contains the given messages), and then select the target states in terms of the specified confidence level. In the above demonstration, the example inputMsg.txt can be found in the directory - example-input of the package. It contains three messagses. The ComputePostProbabilityBasedonMsg function analyzes the occurence probability across fields.
output: the directory analyzeMsg (there is an example in the direcotry example-output of package) 0.prob corresponds to the first message in inputMsg.txt. "2:Card,12:END_JOB ==> 1:000400ED : 1.0" indicatees that the case with COMPONENT=Card and CTLACTION=END_JOB will definitely belong to MSG_ID=000400ED.
[sdi@sdihost analyzeMsg]$ ls
0.prob 1.prob 2.prob
[sdi@sdihost analyzeMsg]$ cat 0.prob
# MSG_ID 1 ; CATEGORY 2 ; COMPONENT 3 ; SEVERITY 4 ; CTLACTION 12
# 39684092,000400ED,Card ,MC ,FATAL ,2015-04-20-21.38.10.221308,,MIR-00000-73FF1-16384 ,R02-M1-N14 ,00E5792YL10K135702C,,,END_JOB,FREE_COMPUTE_BLOCK ; BOARD_IN_ERROR ; Detected that this board has become unusable,F, 906789997 ,74
16:74 ==> 1:000400ED : 1.0
2:Card,12:END_JOB ==> 1:000400ED : 1.0
3:MC,12:END_JOB ==> 1:000400ED : 1.0
1:000400ED ==> 2:Card : 1.0
16:74 ==> 2:Card : 1.0
3:MC,12:END_JOB ==> 2:Card : 1.0
1:000400ED ==> 3:MC : 1.0
16:74 ==> 3:MC : 1.0
......
- Generate spatial distribution (to be plotted in a graph by plot.PlotMiraGraph later on)
- Script: ComputeErrorDistribution.sh
- Source Code: analysis.RAS.ComputeErrorDistribution.java
- Usage: java ComputeErrorDistribution [[filterFieldIndex] [filterValue] ....] [logDir] [logExtension] [locationIndex] [separator] [outputDir] [merge/separate] [isAND (or OR)]
- Example 1: java ComputeErrorDistribution 4 FATAL 12 END_JOB /home/sdi/Catalog-project/miralog csv 8 - /home/sdi/Catalog-project/miralog/errLocDistribution merge true
- Example 2: java ComputeErrorDistribution 4 FATAL 1 00062001 /home/sdi/Catalog-project/miralog csv 8 - /home/sdi/Catalog-project/miralog/errLocDistribution/FATAL_MSGID_00062001 merge false
[[filterFieldIndex] [filterValue] ....] indicates the key-value pairs to be used for the spatial-correlation analysis. In the example 1, the user specifies SEVERITY=FATAL and CTLOCATION=END_JOB used to filter the messages.
[logDir] such as /home/sdi/Catalog-project/miralog indictes the directory containing the csv log files.
[logExtension] such as csv is the extension of the log file
[locationIndex] indicates the index id of the location field. In the above example, the location field's index is 8, e.g., the location information R02-M1-N14 is the 8th field in the following message (index count starts with 0).
39684092,000400ED,Card,MC,FATAL,2015-04-20-21.38.10.221308,,MIR-00000-73FF1-16384,R02-M1-N14-J12,00E5792YL10K135702C,,,END_JOB,FREE_COMPUTE_BLOCK ; BOARD_IN_ERROR ; Detected that this board has become unusable,F, 906789997,74
[seperator] indicates how to separate the location value "R02-M1-N14-J12", using '-' in this above example.
[outputDir] specifies the output dir.
[merge/seperate] specifies the way of outputing the spatial analysis results. For instance, 'seperate' will store the results in four different files level0.err, level1.err, level2.err and level3.err, separately. Each file stores the statistical results with the corresponding level/layer (Rack, midplane, node board and compute card).
[isAND (or OR)] specifies the operator to be applied on the multiple key-value pairs. For instance, int he example 1, it adopts 'true', meaning SEVERITY=FATAL AND CTLOCATION=END_JOB will be used to filter the messages.
- Extract value types for each field
- Script: -
- Source Code: analysis.Job.ExtractValueTypes4EachField.java
- Usage: java ExtractValueType4EachField [schema] [inputDir] [extension] [outputDir]
- Example: java ExtractValueType4EachField /home/sdi/eventlog/schema/basicSchema.txt /home/sdi/eventlog csv /home/sdi/eventlog/schema/fullSchema
-
- Script: -
- Source Code: analysis.Job.GenerateStateFeatures.java
- Usage: java GenerateStateFeatures [basicSchema] [fullSchemaDir] [schemaExt] [logDir] [logExt] [outputDir] [fields....]
- Example: java GenerateStateFeatures /home/sdi/Catalog-project/miralog/RAS-Job/Job/basicSchema/basicSchema.txt /home/sdi/Catalog-project/miralog/fullSchema/fullSchema/withRatio fsr /home/sdi/Catalog-project/miralog/RAS-Job/Job csv /home/sdi/Catalog-project/miralog/RAS-Job/Job/featureState capability exit_code major_project mode nodes_cost percentile prod_queue project_name queue science_field science_field_short size_buckets3 size_cost user
Similar to analysis.RAS.GenerateStateFeatures.java
- Calculate the number of value combinations by brute-force method
- Script: -
- Source Code: analysis.Job.CalculateCountsForValueCombinations.java
- Usage: java CalculateCountsForValueCombinations [basicSchemaFile] [fullSchemaDir] [fullSchemaExt] [logDir] [extension] [outputFile] [fields....]
- Example: java CalculateCountsForValueCombinations /home/sdi/Catalog-project/miralog/RAS-Job/Job/basicSchema/basicSchema.txt /home/sdi/Catalog-project/miralog/fullSchema/fullSchema/withRatio fsr /home/sdi/Catalog-project/miralog/RAS-Job/Job csv /home/sdi/Catalog-project/miralog/fieldValueCombination capability exit_code major_project mode nodes_cost percentile prod_queue project_name queue science_field science_field_short size_buckets3 size_cost user
Similar to analysis.RAS.CalculateCountsForValueCombinations.java
- Generate analysis for inputMsg.txt
- Script: -
- Source Code: analysis.Job.ComputePostProbabilityBasedonMsg.java
- Usage: java ComputePostProbabilityBasedonMsg [fieldListFile] [vcCountHashMapFile] [inputMessageFile] [outputResultFile] [confidenceLevel]
- Example: java ComputePostProbabilityBasedonMsg /home/sdi/Catalog-project/miralog/RAS-Job/Job/fieldValueCombination/fieldList.txt /home/sdi/Catalog-project/miralog/RAS-Job/Job/fieldValueCombination/vc.count "/home/sdi/Catalog-project/miralog/RAS-Job/Job/inputMsg.txt" /home/sdi/Catalog-project/miralog/RAS-Job/Job/analyzeMsg 0.95
Similar to analysis.RAS.ComputePostProbabilityBasedonMsg.java
- Generate error distribution
(This class is used for generating/plotting the location distribution in the MIRA graph)- Script: -
- Source Code: analysis.Job.ComputeJobMessageCounts.java
- Usage: java ComputeJobMessageCounts [onlyCheckErr?] [jobLog] [exitCodeIndex] [locationCodeIndexx] [outputDir]
- Example 1: java ComputeJobMessageCounts true/false /home/sdi/Catalog-project/miralog/RAS-Job/Job/scrubbed-201410-data.csv 14 23 /home/sdi/Catalog-project/miralog/RAS-Job/Job/locDistribution/err
- Example 2: java ComputeJobMessageCounts false /home/sdi/Catalog-project/miralog/Adam-job-log/ascovel_jobhistory.csv 24 3 /home/sdi/Catalog-project/miralog/Adam-job-log/all
Similar to analysis.RAS.ComputeErrorDistribution.java
This part focuses on the analysis of failure rate for specific metrics. Note that the analysis in this subsection should always be conducted based on certain-filtered logs. For instance, the input is supposed to be 'FilterAndClassify' directory.
output: Generate fatal-msg-count.txt and monthly errors.
- Compute the distribution of fine-grained components based on messages and location fields (such as Rxx-Ixx-Jxx-Uxx)
- Script: -
- Source Code: filter.Summarize4.java
- Usage: java filter.Summarize4 [classifyFilterDir] [extension] [outputFile]
- Example: java Summarize4 /home/sdi/Catalog-project/miralog/RAS-Job/RAS/FilterAndClassify fltr /home/sdi/Catalog-project/miralog/RAS-Job/RAS/FilterAndClassify/summarize
output: Generate fatal-locationKey-count.txt
- Search for jobs with break-wallclock-failures
- Script: -
- Source Code: analysis.Job.SearchJobswithBreakWallClockFailure.java
- Usage: java SearchJobswithBreakWallClockFailure [jobLogFile] [basicSchema] [outputDir]
- Example: java SearchJobswithBreakWallClockFailure /home/sdi/Catalog-project/miralog/RAS-Job/Job/scrubbed-201410-data.csv /home/sdi/Catalog-project/miralog/RAS-Job/Job/basicSchema/basicSchema.txt /home/sdi/Catalog-project/miralog/RAS-Job/Job/lengthAnalysis
output: Generate lengthAnalysis directory.
(Preliminary: You need to finish step analysis.RAS.ComputeErrorDistribution, analysis.Job.ComputeJobMessageCounts, or , before doing this step)
- Generate the gnuplot plot script in order to plot the machines in a image for the purpose of spatial-correlation study
- Script: -
- Source Code: plot.PlotMiraGraph.java
- Usage: java PlotMiraGraph [gnuplotTemplateFile] [distributionDir] [extension] [layoutSchemaFile] [maxLevel] [outputFileName]
- Example 1: java PlotMiraGraph /home/sdi/Catalog-project/miralog/gnuplot/temp-layout.p /home/sdi/Catalog-project/miralog/errLocDistribution/fatal err /home/sdi/Catalog-project/miralog/gnuplot/computeRackLayoutSchema.txt 2 /home/sdi/Catalog-project/miralog/gnuplot/errdis_fatal_compute.p
- Example 2: java PlotMiraGraph /home/sdi/Catalog-project/miralog/gnuplot/temp-layout.p /home/sdi/Catalog-project/miralog/errLocDistribution/fatal err /home/sdi/Catalog-project/miralog/gnuplot/ioRackLayoutSchema.txt 2 /home/sdi/Catalog-project/miralog/gnuplot/errDis_fatal_io.p
- Example 3: java PlotMiraGraph /home/sdi/Catalog-project/miralog/gnuplot/temp-layout.p /home/sdi/Catalog-project/miralog/errLocDistribution/FATAL_MSGID_00062001 err /home/sdi/Catalog-project/miralog/gnuplot/computeRackLayoutSchema.txt 2 /home/sdi/Catalog-project/miralog/errLocDistribution/FATAL_MSGID_00062001/gnuplot/errdis_fatal_compute.p
- Example 4: java PlotMiraGraph /home/sdi/Catalog-project/miralog/gnuplot/temp-layout.p /home/sdi/Catalog-project/miralog/Adam-job-log/err err /home/sdi/Catalog-project/miralog/gnuplot/computeRackLayoutSchema.txt 2 /home/sdi/Catalog-project/miralog/Adam-job-log/err/dis_compute.p
output: the gnuplot file that can be used to plot the graph using Gnuplot.
example output:
[gnuplotTemplateFile] specifies the template file of gnuplot script. There are two candidate files in example-input/ and we suggest to use temp-layout-jpeg.p becase the other one temp-layout.p will generate .eps file, taking a long time and large space to store.
[distributionDir] specifies the error distribution data, which is generated using error distribution analysis code such as analysis.RAS.ComputeErrorDistribution.java.
[extension] indicates the extension of data files. In the above examples, the extension is always 'err', pointing to the level1.err, level2.err....
[layoutSchemaFile] specifies the architecture of the system. In MIRA, the architecture is RACK - Midplane - Nodeboard - Cardboard, with respect to the compute machines.
[maxLevel] specifies the max level to plot (e.g., maxLevel=2 means that only plotting the errors/failures regarding rack and midplane.
[outputFileName] specifies the output file name.
The users can modify the 'layout schema' file to enable the PlotMiraGraph.java to suit other supercomputer logs with different architectures. The following shows the content of layout schema file for compute racks (computeRackLayoutSchema.txt)
#layout schema
#For each layout, level must be put in the first place, followed by other attributes of the layout. All the abbistributes followed by the "level" will be put in one layout until another level appears.
#mira_cluster:rack
level=0
fullName=Mira System (Compute Nodes)
customize=true
count=48
row=3
column=16
titleRepresentBase=-
#rack:midplane
level=1
fullName=Rack
nickName=R
customize=false
count=2
rowMajor=true
#titleRepresentType could be binary, oct, hex, decimal, or decimal2 (decimal with 2-digits representation)
titleRepresentBase=hex
#midplane:nodeboard
level=2
fullName=Midplane
nickName=M
customize=false
count=16
rowMajor=true
titleRepresentBase=binary
#node:card_board
level=3
fullName=Node
nickName=N
#customize=false
#count=32
#rowMajor=true
titleRepresentBase=decimal2
The following shows the content of layout schema file for I/O racks (ioRackLayoutSchema.txt)
#layout schema
#For each layout, level must be put in the first place, followed by other attributes of the layout. All the abbistributes followed by the "level" will be put in one layout until another level appears.
#mira_cluster:i/o rack
level=0
fullName=Mira System (IO nodes)
customize=true
count=6
row=3
column=2
titleRepresentBase=-
#rack:I/O drawer
level=1
fullName=Rack
nickName=Q
customize=false
count=9
rowMajor=true
#titleRepresentType could be binary, oct, hex, or decimal
titleRowBase=decimal
titleColumnBase=binary
#offset: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,.... (starts from 0)
titleRepresentOffset=16
#I/O drawer:Computer Card
level=2
fullName=IO drawer
nickName=I
customize=true
count=8
row=2
column=4
#rowMajor=true
titleRepresentBase=decimal
#Computer Card:core
level=3
fullName=Computer Card
nickName=J
#customize=false
#count=8
#rowMajor=true
titleRepresentBase=hex
- use 'separate' mode to get monthly results
- Script: -
- Source Code: analysis.RAS.ComputeErrorDistribution.java
- Usage: -
- Example: -
-
Generate monthly data results for category and component
- Script: -
- Source Code: filter.Summarize_MonthlyFailureRate.java
- Usage: java Summarize_MonthlyFailureRate [fullSchemaDir] [monthlyFailureRateDir_BaseOnMsgID] [extension]
- Example: java Summarize_MonthlyFailureRate /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/schema/withRatio /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/ts/fatalEventMonthDis mct
-
- Script: -
- Source Code: analysis.RAS.ComputeDailyFilteredCount.java
- Usage: ComputeDailyFilteredCount [filterLogDir] [extension]
- Example: java ComputeDailyFilteredCount /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/no-Maint-no-DIAGS-filter-interval=240s/ts fltr
(This analysis can also be considered a more advanced filtering algorithm, which takes into account the similarity across the filtered messages).
-
Analyze the error propagation (with the same type) (if a fatal event happens, it will probably happen again within x hours?)
- Script: -
- Source Code: analysis.RAS.ComputeTmporalErrPropagation.java
- Usage: java ComputeTmporalErrPropagation [rootDir] [extension] [minutes]
- Example: java ComputeTmporalErrPropagation /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/no-MaintResv-no-DIAGS-filter-interval=240s/ts fltr 60
-
Similarity-based Temporal Error Propagation Analysis (Similarity-based Event Filter)
- Script: -
- Source Code: analysis.temporalcorr.SearchErrorPropagation.java
- Usage: java analysis.temporalcorr.SearchErrorPropagation [allEventsFile] [sameIDDelayThreshold(sec)] [diffIDDelayThreshold(sec)] [parameter setting file]
- Example: java analysis.temporalcorr.SearchErrorPropagation /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/no-MaintResv-no-DIAGS-filter-interval=240s/ts/allEvents.txt 240 1500 keyIndexClass.conf
[allEventsFile] is a file that was generated by the previous filter.TemporalSpatialFilter
[sameIDDelayThreshold(sec)] specifies the delay threshold for the messages/events with the same message ID. That is, the messages whose timestamps are within this distance in time dimension will be considered as candidate correlated messages. Please see Algorithm 2 of our CCGrid17 paper.
[diffIDDelayThreshold(sec)] specifies the delay threshold for the messages with different message IDs.
[parameter seting file] specifies the weights of the different attributes. Please see our CCGrid17 paper for understanding the weights in details.
output: We have put the example output files in the sub-directory: similarity-based-filter. Please read the README.txt in that folder to understand the results.
(first execute analysis.spatialcorr.GenerateContingencyTableForSigAnalysis.java, then execute analysis.significance.ChiSquareSingleTest)
- Generate Contingency Table For Significance Analysis
- Script: -
- Source Code: analysis.spatialcorr.GenerateContingencyTableForSigAnalysis.java
- Usage: -
- Example: -
This is hard-coded. Please see our CCGrid paper and source code to understand how to use it.
- Performance ChiSquare Single Test
- Script: -
- Source Code: analysis.significance.ChiSquareSingleTest.java
- Usage: java ChiSquaredTest [contingency_table_path]
- Example: java ChiSquaredTest /home/sdi/Catalog-project/miralog/RAS-Job/Job/featureState/science_field_short/science_field_short-mode.fs
( analysis.spatialcorr.kmeans.KMeansSolution (2 versions of outputs) and analysis.spatialcorr.kmeans.KMeansOpt (4 versions of outputs))
-
Generate K-means Clustering results
-
The traditional solution: K means solution
- Script: -
- Source Code: analysis.spatialcorr.kmeans.KMeansSolution
- Usage: java KMeansSolution [allowDuplicate?] [inputFilePath]
- Example: -
-
The optimized solution (with optimized number of clusters)
- Script: -
- Source Code: analysis.spatialcorr.kmeans.KMeansOpt
- Usage: java KMeansOpt [kmeansSolType (fixK or optK)] [initNumOfSets] [allowDuplicate?] [inputFilePath]
- Example: java KMeansOpt fixK 10 true /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/00090210.ori
-
-
Plot the K means clustering results
- Script: -
- Source Code: plot.PlotKMeansMidplanes.java
- Usage: java PlotMeansMidplanes [gnuplotTemplateFile] [inputFilePath]
- Example: java PlotMeansMidplanes /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/gnuplot/template.p /home/sdi/Catalog-project/miralog/one-year-data/ALCF-Data/RAS/FilterAndClassify/00090210.ori.fxtrue
input (kmeans clustering matrix
output of KMeansSolution or KMeansOpt); output (gnuplot file)