Skip to content

Error handling (introduction)

David Anderson edited this page Jan 17, 2024 · 3 revisions

Errors

A job instance can complete successfully but produce incorrect output files:

  • The host's CPU or GPU malfunctions. This is rare, but it can happen on hosts that are overclocked and/or overheated.
  • The user 'cheats' and runs a program that, masquerading as a BOINC client, returns job results without doing any computation. This can happen in a system (like Gridcoin) that gives monetary rewards for computing.
  • A particular app version (say, a GPU version) may have a bug that other versions don't.

Single-result checking

In some cases it may be possible to detect incorrect results by examining the outputs of a single job instance, perhaps by

  • checking the syntax of the output files
  • checking that numerical values lie in a plausible range
  • checking that the results plausibly correspond to the inputs; e.g. in physical simulations, system energy is about the same.

BOINC lets you create application-specific validators that check the output files of a job. If the check fails, the job is retried (up to a limit).

Replication

But cheaters can potentially evade single-result checks. So BOINC provides another (optional) mechanism: replication. When this is used, each job is run on two different worker nodes. If the results agree, they are deemed to be correct, and one of the instances is marked as 'canonical'. If they don't agree, a third instance is created and sent to a different worker node. This continues until either a pair of agreeing instances is found, or a limit on the number of instances is reached, in which case the job is marked as failing.

Different types of CPUs and GPUs, and different math libraries, can produce slightly different floating-point results. These differences can compound, as in the 'butterfly effect'; Two equally correct results can have different numbers. The comparison of replicated jobs for such applications must be 'fuzzy'. Typically this means that corresponding numbers are allowed to differ by some (application-specific) amount.

If you use replication, your validator must also (in addition to checking single results) compare the results of two instances of the same job.

Clone this wiki locally