Skip to content

Error handling (introduction)

David Anderson edited this page Jan 15, 2024 · 3 revisions

Errors and replication

A job instance can complete but produce incorrect output files, because

  • The host's CPU or GPU malfunctions. This is rare, but it can happen on hosts that are overclocked and/or overheated.
  • The user 'cheats' and uses a program that, masquerading as a BOINC client, returns job results without doing any computation. This can happen in a system (like Gridcoin) that gives monetary rewards for computing.

In some cases it may be possible to detect incorrect results by examining the outputs of a single job instance, perhaps by

  • checking the syntax of the output files
  • checking that numerical values lie in a plausible range
  • checking that the results plausibly correspond to the inputs; e.g. in physical simulations, that total energy is about the same.

But cheaters can potentially evade such checks. So BOINC provides another (optional) mechanism: replication. When this is used, each job is run on two different worker nodes. If the results agree, they are deemed to be correct, and one of the instances is marked as 'canonical'. If they don't agree, a third instance is created and sent to a different worker node. This continues until either a pair of agreeing instances is found, or a threshold on the number of instances is reached (in which case the job is marked as failing.

Different types of CPUs and GPUs, and different math libraries, can produce slightly different floating-point results. These differences can compound, as in the 'butterfly effect'; Two equally correct results can have different numbers. The comparison of replicated jobs for such applications must be 'fuzzy'. Typically this means that corresponding numbers are allowed to differ by some (application-specific) amount.

Clone this wiki locally