Introduction to Fault Tolerance

Fault Tolerance Techniques

  • Introduction
    • Hardware Faults – Occurs due to a physical defect of a system like a broken wire or a logic struck at 0 in a gate.
    • Software faults – occurs due to a bug introduced in a system so the software misbehaves for a given set of inputs
    • Error – the manifestation of a fault is the error (Fault may occur anytime, but only the error manifests that fault)
    • Fault Latency – the time between the onset of fault and its manifestation as an error is the fault latency
    • Error Recovery
      • Forward Error Recovery – the error is masked without any computations having to be redone.
      • Backward Error Recovery - the system is rolled back to a moment in time before the error is believed to have occurred.


What Causes Failures?

There are three main causes of failures:

  • Errors in the specification or design
    • Mistakes in the specification and Design are very difficult to guard.
    • Many hardware failures and all software failures occur due to such mistakes.
    • It is difficult to ensure that the specification is completely right.
  • Defects in the components
    • Hardware components can develop defects.
    • Wear and tear of components
  • Environmental effects
    • Devices can be subjected to whole array of stresses, depending on the application.
    • High ambient temperatures can melt components or otherwise damage them.


Fault Types

Faults are classified according to temporal and output behavior

  • Temporal behavior classification

    A(t)    B(t)




    C(t)    D(t)


  • Permanent faults
    • Does not die away with time, remains until it is repaired
    • Ex. Broken wires
    • From the above Diagram: A(t)>0; B(t) =C(t) = D(t)=0
  • Intermittent Fault
    • It cycles between the fault active and fault benign states.
    • Eg. Caused by loose wires    
    • From the above Diagram: A(T)>0; B(t)>0; D(t)>0; C(t)=0
  • Transient Fault
    • Dies away after some time
    • Ex: environmental effects
    • From the above Diagram: A(t)>0; C(t)>0; B(t)=D(t)=0
  • Output behavior classification
    • Malicious
    • Non malicious

Comments

Popular Posts