Dr. Ntokozo Mthembu (Advisor to the ODI Board) writes:
“Sharing Kumar’s concept of Reliability in Maintenance Engineering (Prof Uday Kumar, Lulea University of Technology, Norway – Luleå Tekniska Universitet, Norge)
The History of Reliability
The history of reliability in the modern age can be traced back to two distinct eras, namely post-1900 and post-World War II. During the former, reliability was intuitive, subject, and qualitative, while during the latter, it was based on a quantitative and mathematical approach characterised by planning, design, operation, and maintenance as its key components. While this is true, reliability could be said to be as old as mankind started to exchange services and goods, that is to say, as long as there have been social and trade intercourse amongst the people of the world. The period I am describing is not covered in this piece of idea sharing and exchange. It requires a different study by scientific philosophers and historians. Therefore, we shall leave it at that, hoping that some of the readers of this piece might be inspired to undertake this exploration. However, it is noted that we all know the effects of reliability and lack thereof in our daily lives. Expressions such as ‘he is unreliable’, ‘the system is broken’, ‘this vehicle is unreliable when driving in muddy conditions’, etc. Naturally, we associate reliability with positive results. I wonder how we define reliability that ensures that negative conditions subsist. Can we say that Eskom load-shedding schedules are reliable?
Our industrial world is replete with examples of the disastrous effects of unreliable systems and assets. The following disasters are but a few that we know of on a world scale:
- Three Mile Island Nuclear meltdown, Pennsylvania, USA, March 1979.
- The Bhopal disaster or Bhopal gas tragedy, the Union Carbide India Limited, December 1984.
- The Space Shuttle Challenger broke apart 73 seconds into its flight, January 1986.
- The Chernobyl disaster was a nuclear accident, in April 1986
- The Space Shuttle Columbia disintegrated as it re-entered the atmosphere over Texas and Louisiana, in February 2003.
- The Fukushima Daiichi Nuclear Power Plant disaster Fukushima, Japan, March 2011.
Bitter truth: the count continues …
Definitions of Reliability
Reliability spans all industries and owes its formal existence to aviation, military, nuclear, chemical, and process industries. It is described in many ways ranging from the Oxford/English dictionary to modern definitions inspired by science and engineering theory and applications. Oxford Dictionary defines reliability as ‘the quality of being trustworthy or of performing consistently well’.
Kumar describes it simply as ‘an integral part of design, operation, and maintenance for most engineering systems/assets. In a more formal way, The Institute of Electrical and Electronics Engineers (IEEE) defines it as ‘the ability of a system or component to perform its required functions under stated conditions for a specified period of time’.
In essence, reliability is an integral part of the design, operation, and maintenance of most engineering systems/assets.
Unpacking Kumar’s Description of Reliability
There are four keywords in Kumar – IEEE description of reliability that is worth underscoring, namely
1 ability,
2 required function,
3 stated conditions, and
4 specified period of time.
What is the significance of each keyword in the description of reliability, is the burning question?
- Ability: refers to the chance or likelihood that the system or component will work properly.
- Required function: means that the system requirements specification is the criterion against which reliability is measured. If the actual performance falls within the tolerance limits of the standards, the intended function of the system is treated as successful
- Stated conditions: deals with an intended function and a time frame that is needed, usually called a mission time.
- Specified period of time: the product performs its intended function adequately in one set of conditions and quite poorly in another. Stated conditions include air pressure, temperature, humidity, shock, and vibration.
Kumar’s example of a reliability statement utilizing all four of the definition elements might be: the system or component has a 99% probability of operating at greater than 80% of rated capacity for 500 h without failure, at ambient temperature 25–50°C, with no more than 55% humidity in a dust-free atmosphere.
Some basic concepts of reliability
- Reliability engineering: deals with the design and construction of systems and products (and related economics).
- Reliability science: is concerned with the properties of materials and the causes for deterioration leading to part and component failures.
- Reliability technology: deals with tools and methodology.
- Reliability management: deals with the various management issues in the context of managing the design (allocating resources), manufacture (ensuring parts are produced as per specification), and/or operation of reliable products and systems (systems are maintained properly).
The emphasis is on the business viewpoint since unreliability has consequences in terms of cost, time wasted, and, in certain cases, the welfare of an individual or even the security of a nation.
Reliability discipline
The concept of reliability discipline deals primarily with two types, namely hardware reliability and software reliability. Hardware reliability is associated with a hazard rate, which is defined as the failure rate per unit time, often shortened to the failure rate, and is modeled using the so-called “bathtub” curve. In software, the bathtub curve for hardware reliability strictly does not apply because the software does not typically wear out. The relationship between hardware and software in terms of reliability is shown by the wear-out period exhibited in Figure 1 below.
Figure 1: The bathtub curve for hardware and software wear-out-period regimes (DeBardeleben, 2010) (*)
For hardware, the hazard rate decreases initially (infant mortality phase), then is uniform, and finally increases. In software, the bathtub curve for software should look like the “software in theory”. In software, infant mortality is the debug phase.
Finally, there is a discipline of reliability that lurks behind hardware and software called human reliability. Human reliability is ubiquitous in all systems because of the human footprint in the design, deployment, and operations of all systems. Even the much-touted artificial intelligence cannot escape the clutches of human brains in its conceptual framework through design, commissioning, and implementation.
(*) Jones, William M., Daly, John T., and Nathan DeBardeleben, “Application Monitoring and Checkpointing in HPC: Looking Towards Exascale Systems”. This work builds on the material presented in “The Impact of Sub-optimal Checkpoint Intervals on Application Efficiency in Computational Clusters”, in the 19th ACM International Symposium on High-Performance Distributed Computing, Chicago, Illinois, June 20-25, 2010, pp. 276-279.”
Key 9 of the 20 Keys System for Operations Improvement is all about maintaining machines and equipment.
The key concepts of Key 9 are:
- Effective maintenance is a joint venture between operators and the maintenance/engineering function
- Improving overall equipment effectiveness (OEE)
- Keeping equipment logs, and performance history for equipment
- Preventive maintenance of machines/equipment (roll-out process by starting with important
machines/equipment)
To read Part 1: The Kumar’s Why, What, When, How, Who and Where Concepts of Engineering Maintenance, click here and to read Part 2: The Kumar’s How, Who, and Where concepts of engineering maintenance, click here.