Failure Isn’t Just an Option —It’s Unavoidable
In Part 2 of this 2-part series, we look at how to use the P-F Curve to determine the point of functional failure.
In Part 1 (which appeared in the March/April issue of Paper360°) we looked at changing our worldview about what defines failure. I explained the concept of “Functional Failure” and introduced the P-F Curve (see Fig. 1.) So, how does the P-F curve help us?
We use the P-F Curve to identify what can be done to detect a pending functional failure and to determine what the inspection frequency should be. We base the inspection frequency on one-half the length of the P-F interval. We do this to give us enough time to proactively plan and mitigate the pending functional failure. If half the P-F interval will not give us enough time to pro-actively plan for corrective maintenance, then we may consider an alternate detection method with a longer P-F interval or we may choose a frequency of less than half so we have ample time to plan.
It is important to note that in no way does the concept of criticality influence the frequency of inspection. Criticality’s only role is in determining if the inspection is worth doing. Those two previous sentences are sometimes hard to swallow because this flies in the face of traditional thinking and probably 95 percent of the maintenance and reliability industry.But, no matter how critical a system is considered to be, if its failure modes that lead to functional failure have identifiable P-F intervals, why would we inspect at a rate more frequent than half the P-F interval? Criticality plays no role in inspection frequency.
If half the P-F interval is not long enough to prevent functional failure, we may decide to shorten the inspection interval; however, we commonly find that frequencies are set at intervals much shorter than half of the P-F interval. This is done for a host of reasons: to make us feel better, in response to pressure to “never have another failure,” because it is “best practice,” or because “it’s what everyone else is doing.” The list goes on and on, but none of these reasons are based on a logical, sys-tematic approach. The P-F interval must set the inspection frequency—not criticality, not emotions, not best practices.
So the P-F Curve is fine for mechanical devices, but what about electrical components? They either work or they don’t, right? Well, maybe not.
Let’s go back to that idea of the functional failure and instead of a pump, let’s consider a finish products line. It doesn’t matter if it is fine paper converting, tissue, a winder, etc. because I bet everyone who reads this article at one time or another has cycled power on a computer, PLC, or drive to clear a fault. It is also very common to clear that fault and still meet the dai-ly/weekly/monthly production standard. In this example, functional failure is not defined at the drive rest, but when the drive fault disrupts production to such a degree that we do not meet our performance standard. Even if we define functional failure as a loss of our production standard, we still must define a P-F Interval.
Back to our earlier question. Electrical systems either work or they don’t, right? I will bet again that most of you have expe-rienced a drive rest only to see the drive fault return—first within weeks or months, then within days or weeks, then within hours or days—and finally, not to clear at all. There you go: a P-F Curve. From the first time the drive faults, you enter the P-F Curve at point P. Then, each time you rest the drive you progress down the curve until you reach functional failure.
Want more proof? How many times have you had a drive fault that you have experienced in the past, and at the first rest you go to the storeroom and make sure you have a spare drive in stock? I’ll bet many times. Without even knowing what the P-F Curve was, you were still acting on the P-F concept and your experience.
In this discussion, I have not mentioned criticality; however, there is a clear point of relevance. Criticality will play a role in the storeroom stocking decision.
One last point—and I’m very interested in reader feedback on this. Have you ever heard a manager state, “I don’t want to have another drive fail!”? I have heard this pronouncement for many mechanical components, but never electrical. I wonder if it’s because electrical components lend themselves more easily to the concept of random failures, while mechanical failures seem as if they should be time-based—even though both mechanical and electrical failures are most likely random in nature.
THE CAUSES OF FAILURE
There is another concept that supervisors/planners/engineers, as well as maintainers know very well—though it is often overlooked by operations managers. This is the concept of how the initiating cause of a functional failure develops.
A bearing defect caused from normal fatigue of a race will develop over a period of time (OAPOT); a bearing defect caused by human error (lubrication, overload, installation, damage, etc.) will occur suddenly. Understanding how a cause of failure develops (OAPOT or suddenly) will determine which strategy we should employ to mitigate it. For causes that develop OAPOT, we will look for a P-F Curve and a usable P-F interval; but for the causes that occur suddenly, most of the mitigations we should employ will be policy- or procedure-based.
As an example: for the cause of failure related to fatigue we may use vibration analysis, but for a lubrication error we should institute training/standardization/controls. Yet it is very common to determine that a failure has been caused by sudden human error and then react by implementing a strategy that is appropriate for an OAPOT-type cause.
See if this sounds familiar: A bearing failure occurs. The maintainer reviews previous vibration analysis, which shows no signs of a defect. The vibration team is adamant that their program would catch any normal OAPOT causes of failure, so thisfailure must be sudden in nature. Yet the management team increases the interval of vibration analysis instead of digging into what caused the sudden (human error) failure. Now we are wasting time taking additional vibration readings that are not needed. We degrade the confidence of the vibration team, and we will not prevent this sudden functional failure from reoccur-ring.
There is one additional way a cause of functional failure can occur, and it has to do with protective devices: I call it “this only matters if” (TOMI). The idea is that a failure of the protective device only matters if the protected function fails. For example, if you have a duty/standby pump, the standby pump only matters if the duty pump fails. Under normal circumstance, you will not know if the standby pump works unless you need it. The same holds true for other protective devices such as smoke detectors, spare tires, alarms, or light curtains. For TOMI-type functional failures, we must understand and determine how often the duty system is failing, how often the protective device is failing, and how often we are willing for both devices to be functionally failed (remember, “never” is unattainable). With these three pieces of information, we can calculate how often we must check our protective devices.
So, there you have the concepts every supervisor/planner/engineer must understand for real reliability to begin:
1. The P-F interval
2. That criticality plays no role in inspection frequency
3. The new definition of failure: Functional Failure
4. The different types of causes of failure: OAPOT, suddenly, or TOMI
If you have any questions or comments about the concepts I have outlined in this article, I would really like to hear from you. Feel free to contact me directly by e-mail or through the editor.
Jay Shellogg spent the last 16 years of his career working at a large pulp and paper mill, primarily as a senior environmental engineer and maintenance/reliability superintendent. During that time he encountered many challenges; in his own words,
“Some I overcame, and some I didn’t.” Contact him at [email protected].