Skip to main content


Showing posts from May, 2009

Hardware Redundancy

Hardware Redundancy
Use of additional hardware to compensate for failures
This can be done in two ways

Fault detection, correction and Masking. Multiple hardware units may be assigned to do the same task in parallel and their results compared. If one or more units are faulty, we can express this to show up as a disagreement in the results.
The second is to replace the malfunctioning units.
Redundancy is expensive, duplicating or triplicating the hardware is justified only in most critical applicationsTwo methods of hardware redundancy is given below are,
Static PairingN modular Redundancy (NMR)
Static Pairing

Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme
The Pairs runs identical software with identical inputs and should generate idientical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded
This approach is depicted in the following figure, and it will work only when th…

Fault and Error Containment

A Fault in one part of the system cause large voltage swings in the other parts of the system. So it is necessary to prevent from spreading through the system. This is called as containment.
This can be divided into
Fault Containment Zone (FCZ) and
A failure of some part of the computer outside an FCZ cannot cause any element inside that FCZ to fail
Hardware inside the FCZ must be isolated from the outside system.
Each FCZ should be have independent power supply and its own clock (may be synchronized with the other clocks)
Typically, the FCZ consists of a whole computer which includes processors, memory I/O and control interfaces.
Error Containment Zone (ECZ)
Prevent errors from propagating across zone boundaries. This is achived by means of voting redundant outputs.
Hardware Redundancy
Software Redundancy
Time Redundancy
Information Redundancy

Introduction to Fault Tolerance

Fault Tolerance Techniques
Introduction Hardware Faults – Occurs due to a physical defect of a system like a broken wire or a logic struck at 0 in a gate.Software faults – occurs due to a bug introduced in a system so the software misbehaves for a given set of inputsError – the manifestation of a fault is the error (Fault may occur anytime, but only the error manifests that fault)Fault Latency – the time between the onset of fault and its manifestation as an error is the fault latencyError Recovery Forward Error Recovery – the error is masked without any computations having to be redone.Backward Error Recovery - the system is rolled back to a moment in time before the error is believed to have occurred.
What Causes Failures?
There are three main causes of failures:Errors in the specification or design Mistakes in the specification and Design are very difficult to guard.Many hardware failures and all software failures occur due to such mistakes.It is difficult to ensure that the specificatio…