- Use of additional hardware to compensate for failures
- This can be done in two ways
- Fault detection, correction and Masking. Multiple hardware units may be assigned to do the same task in parallel and their results compared. If one or more units are faulty, we can express this to show up as a disagreement in the results.
- The second is to replace the malfunctioning units.
- Redundancy is expensive, duplicating or triplicating the hardware is justified only in most critical applications
- Static Pairing
- N modular Redundancy (NMR)
- Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme
- The Pairs runs identical software with identical inputs and should generate idientical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded
- This approach is depicted in the following figure, and it will work only when the interface is working fine and both the processors do not fail identically and around the same time.
[caption id="attachment_297" align="alignnone" width="500" caption="Static Pairing with Monitor"][/caption]
- So the interface is monitored by means of a monitor which monitors the interface. If the interface fails, the monitor takes care and if the monitor fails, the interface takes care. If both interface and monitor fails, then the system is down. The monitor block is added as a dotted box in the above figure
- It is a scheme for Forward Error Recovery.
- It works with N processors instead of one and voting on their output and N is usually odd.
- NMR can be illustrated by means of the following two ways
- There are N voters and the entire cluster produces N outputs
- There is just one voter
[caption id="attachment_299" align="alignnone" width="607" caption="N Modular Redundancy"][/caption]
- NMR clusters are designed to allow the purging of malfunctioning units. That is, when a failure is detected, the failed unit is checked to see whether or not the failure is transient. If it is not, it must be electrically isolated from the rest of the cluster and a replacement unit is switched on. The faster the unit is replaced, the more reliable the cluster.
[caption id="attachment_298" align="alignnone" width="419" caption="Self Purging"][/caption]
- Purging can be done either by hardware or by the operating system. Self purging consists of a monitor at each unit comparing its output against the voted output. If there is a difference, the monitor disconnects the unit from the system. The monitor can be described as a finite state machine with two states connect and isolate. There are two signals, diff which is set to 1 whenever the module output disagrees with the voter output and reconnect, which is a command from the system to reconnect the module.