Introduction to Fault Tolerance
Fault Tolerance Techniques
- Hardware Faults – Occurs due to a physical defect of a system like a broken wire or a logic struck at 0 in a gate.
- Software faults – occurs due to a bug introduced in a system so the software misbehaves for a given set of inputs
- Error – the manifestation of a fault is the error (Fault may occur anytime, but only the error manifests that fault)
- Fault Latency – the time between the onset of fault and its manifestation as an error is the fault latency
- Error Recovery
- Forward Error Recovery – the error is masked without any computations having to be redone.
- Backward Error Recovery - the system is rolled back to a moment in time before the error is believed to have occurred.
What Causes Failures?
There are three main causes of failures:
- Errors in the specification or design
- Mistakes in the specification and Design are very difficult to guard.
- Many hardware failures and all software failures occur due to such mistakes.
- It is difficult to ensure that the specification is completely right.
- Defects in the components
- Hardware components can develop defects.
- Wear and tear of components
- Environmental effects
- Devices can be subjected to whole array of stresses, depending on the application.
- High ambient temperatures can melt components or otherwise damage them.
Faults are classified according to temporal and output behavior
- Temporal behavior classification
- Permanent faults
- Does not die away with time, remains until it is repaired
- Ex. Broken wires
- From the above Diagram: A(t)>0; B(t) =C(t) = D(t)=0
- Intermittent Fault
- It cycles between the fault active and fault benign states.
- Eg. Caused by loose wires
- From the above Diagram: A(T)>0; B(t)>0; D(t)>0; C(t)=0
- Transient Fault
- Dies away after some time
- Ex: environmental effects
- From the above Diagram: A(t)>0; C(t)>0; B(t)=D(t)=0
- Output behavior classification
- Non malicious