Error handling (introduction)

Errors and replication

A job instance can complete but produce incorrect output files, because

The host's CPU or GPU malfunctions. This is rare, but it can happen on hosts that are overclocked and/or overheated.
The user 'cheats' and uses a program that, masquerading as a BOINC client, returns job results without doing any computation. This can happen in a system (like Gridcoin) that gives monetary rewards for computing.

In some cases it may be possible to detect incorrect results by examining the outputs of a single job instance, perhaps by

checking the syntax of the output files
checking that numerical values lie in a plausible range
checking that the results plausibly correspond to the inputs; e.g. in physical simulations, that total energy is about the same.

But cheaters can potentially evade such checks. So BOINC provides another (optional) mechanism: replication. When this is used, each job is run on two different worker nodes. If the results agree, they are deemed to be correct, and one of the instances is marked as 'canonical'. If they don't agree, a third instance is created and sent to a different worker node. This continues until either a pair of agreeing instances is found, or a threshold on the number of instances is reached (in which case the job is marked as failing.

Different types of CPUs and GPUs, and different math libraries, can produce slightly different floating-point results. These differences can compound, as in the 'butterfly effect'; Two equally correct results can have different numbers. The comparison of replicated jobs for such applications must be 'fuzzy'. Typically this means that corresponding numbers are allowed to differ by some (application-specific) amount.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error handling (introduction)

Errors and replication

Clone this wiki locally