PIP-II FMEA Training J. Holzbauer on behalf of the Technical
32 Slides3.89 MB
PIP-II FMEA Training J. Holzbauer on behalf of the Technical Integration Team Wednesday, December 9, 2020 A Partnership of: US/DOE India/DAE Italy/INFN UK/UKRI-STFC France/CEA, CNRS/IN2P3 Poland/WUST
Overview What is an FMEA? – What is covered? – How does it fit into the overall documentation of a design? How do I create an FMEA? – – – – How do I scope my FMEA? Template Overview Usage Examples Reviewing an FMEA (team-internal and external experts) How do I USE an FMEA? – When do I create one? – How do I include it in my design process? – How does my FMEA interact with PIP-II Review Requirements? 2 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
What is an FMEA? FMEA: Failure Mode and Effect Analysis – Prevention through Design Table Personnel Risk/Safety – FMEA Technical Risk Your FMEA is a critical part of your QC process – – – – Allows fair comparison and prioritization of risks Helps to identify where to focus QC efforts Focuses conversation within the technical group Documents process and provides a place to implement lessons learned as design proceeds – Grounds existing QC controls to their motivating factors (and vise versa) 3 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Three Pillars of FMEA Content 4 Process Failures Human Factors Procedural Errors Design Failures Failure During Operation Failure during correct usage External Failures Cascading failure from other systems Natural disasters or effects 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Breaking up Scope Cavity Clean String Tuner Warm Coupler Cryomodule Cryomodule Component s Coupler Cold Coupler No reason to have an entire L3 within one FMEA. It’s important to group scope logically, but to also ensure that there are no gaps between FMEAs. Consider when activities are handled by different teams of people with a decent scope break being the handoff between them, consider who will review each FMEA. Operations, maintenance, and lifecycle considerations should be captured in the most integrated/last FMEA. 5 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Process Failure Examples Any time a human is performing work during fabrication, installation, operation, maintenance, etc. failures can occur Examples (list by process): 1. 2. 3. 4. 5. 6. Installation of shipping supports for a Cryomodule to be shipped Performing maintenance of a rack-mounted electronic device Updating firmware on a PLC/FPGA system Misconfiguration of software during update Assembly of a beamline Pumping out and leak-checking a beam box Possible Failures: 1. 2. 3. 4. 5. 6. 6 Damage to beamline components during installation Tool short damages vulnerable components Wrong firmware version uploaded causing unpredictable behavior System fails to start due to configuration errors Bellows damaged during installation Improper back-fill contaminates clean vacuum space 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Process Failure Flow Example Installation of CM into Transport Frame Process Step Unintended Impact of CM and Frame Failure Mode Beamline vacuum hardware damaged, beamline vented Effect 7 01/29/2024 Large shock to CM damages nonbeamline CM components J. Holzbauer Failure Mode and Effect Analysis Training Scuff to CM paint
Design Failure Examples Failures during the normal usage of a device (transportation, operation, installation, etc.) not caused by process failures Examples (list by device): 1. 2. 3. 4. 5. Beamline Bellows HPRF Amplifier Device Motion Control Cryogenic Heater User Front End Software Possible Failures: 1. 2. 3. 4. 5. 8 Cyclical fatigue failure of bellows vents clean vacuum space Overheating during operation component damage Intermittent readback errors cause unpredictable machine control Wire disconnects from heater causing lack of function Unanticipated user input causes system damage 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Design Failure Flow Example Transportation Design Feature Frame Input Vibration Larger than Expected Isolation Ineffective Failure Mode Vibration at CM over spec Effect 9 01/29/2024 Exceeds transport envelope Stress exceeds yield Unable to Transport Structure loses Integrity J. Holzbauer Failure Mode and Effect Analysis Training
External Failure Examples Failures of your system caused by effects coming from outside your system – Other systems failing and that impact cascading into your system in an uncontrolled way – Truly external factors (environmental, infrastructural) – Like a What-If Analysis called for by FESHM 5031 Examples (List by External Effect) – – – – Cryoplant Trip Cooling water flow stops Fire in the RF Gallery/Linac Tunnel Site-wide power outage (including recovery from outage!) Critical point: Communication of this failure should NOT be via existing interlocks or MPS, although YOUR interlocks may be designed to protect your system 10 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
External Failure Flow Example Significant Delays External Condition during transport Instrumentation Loses Power Failure Mode Prolonged Exposure to weather/elements Effect Corrosion to CM components Transport Data Compromised 11 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
How granular do you have to get? It is not important to capture every potential failure! What is critical is capturing CLASSES of failures if your mitigation of that failure also broadly applies – Beamline bellows can fail during transport because of vibration, we will assess/protect all CM bellows to mitigate risk – Firmware/Software updates may be managed by the same process and procedures Criticality/impact of the failure is a factor – Incidental contact between CM and transport frame can happen in many ways, but we only consider damage to critical components (e.g., beamline vacuum hardware) – Failing to connect a piezo tuner cable in the tunnel is annoying, but just a time loss unless the lack of proper termination causes the driver to fail 12 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
FMEA Scope Details You are the expert on YOUR system, others may have to rely on YOUR expertise for some failures This means that you must own your system’s FMEAs for their full lifecycle – Fabrication, transportation, operation, maintenance, etc. Linac Installation is a good example, they own execution of the installation activities, but they are not, and cannot be expected to be, the technical experts on all these systems – Constant communication about risks, potential failures mode, mitigations and protections, procedures, etc. is critical for successful installation – Similar logic applies for storage, partner handoffs, etc. 13 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
FMEA Template PIP-II FMEA Template Link for Reference: PIP-II FMEA Templates (Protec ted).xlsm 14 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Process Step, Item/Function, or External Condition Identify the functions of the scope. – For Process FMEA, identify each of the steps in the process (Process Step). These should be listed in the order in which the steps are performed, i.e. chronological. – For Design FMEA, identify the item being analyzed and its function (Item/Function). This can be a design feature, a component, subsystem, or complete system. The function describes what the item does. There may be multiple functions for a single item. – For External Condition FMEA, identify each external factor that can have a negative impact on the design/process (External Condition). Common external conditions include power outages, transportation delays, etc. 15 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Identifing Failure Modes and Effects Potential Failure Mode – Identify all the possible failures at each process step, for each function, or for each external factor identified. – In what ways could this fail? Potential Effect(s) of Failure – For each failure, identify all the consequences. – If the failure occurs, what will happen? The template provides guidance in the form of a Severity Matrix. This matrix is defined in QAM 12030 Technical Appendix - Table 1. 16 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Scoring Severity Severity – How serious is the effect. – Select from the drop-down the severity level determined. – If a failure mode has more than one effect, consider the severity level for each. The template provides guidance in the form of a Severity Matrix. This matrix is defined in QAM 12030 Technical Appendix - Table 1. 17 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Severity Matrix SEVERITY PEOPLE ENVIRONMENT Multiple deaths from Permanent loss of a injury or illness; multiple public resource (e.g. cases of injuries involving drinking water, air, permanent disability; or stream, or river). CRITICAL chronic irreversible illnesses. HIGH MEDIUM COMPLIANCE PROPERTY Willful disregard for the rules and regulations. Loss of multiple facilities Total breakdown identified resulting or program components; in loss/shut down of a process or ( 5,000,000 total cost*) project. One death from injury or Long-term loss of a public Major noncompliance thatLoss of a facility or critical Major breakdown identified illness; one case of injury resource (e.g., drinking exposes the Lab to program component; resulting in the failure to attain the involving permanent water, air, stream, or significant potential fines ( 5,000,000 total cost*) budget, schedule, key performance disability; or chronic river). and penalties. indicators or customer irreversible illnesses. expectations. Injuries or temporary, Seriously impair the reversible illnesses functioning of a public resulting in hospitalization resource. of a variable but limited period of disability. Significant Major property damage Significant compromise to the noncompliance that or critical program attainment of the budget, schedule, requires reporting to DOE component; key performance indicators or or other authorities. ( 1,000,000 - 5,000,000 customer expectations which total cost*) exposes process/project to potential failure if gap cannot be immediately resolved. Injuries or temporary, Isolated and minor, but Programmatic reversible illnesses not measurable, impact(s) on noncompliance with the resulting in hospitalization some component(s) of a Lab's Work Smart set. with lost time. public resource. LOW MINIMAL PROCESS/PROJECT Injuries or temporary illnesses requiring only minor supportive treatment and no lost time. Minor property damage or critical program component; ( 50,000 - 1,000,000 total cost*) No measurable impact on Specific instance of a Standard property component(s) of a public noncompliance with the damage or critical resource Lab's Work Smart set. program component; ( 50,000 total cost*) Minor breakdown or gap identified which does not result in significant compromise to the attainment of the budget, schedule, key performance indicators or customer expectations; gaps can be resolved. Minor gaps identified which do not compromise the attainment of the budget, schedule, key performance indicators or customer expectations; gaps can easily be resolved. * total cost total dollar value including parts, labor, contingency plans, etc. that is necessary to repair/replace property or program component. 18 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Potential Causes of Failure Potential Cause(s)/Mechanism(s) of Failure – For each failure, identify all the potential root causes. – What could cause this to happen? 19 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Scoring Probability Probability – For each root potential cause, identify the probability of the failure occurring. – The template provides guidance in the form of a Probability Matrix. This matrix is defined in QAM 12030TA Table 2. – Select from the drop-down the probability level determined. 20 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Probability Matrix MISHAP PROBABILITY TABLE PROBABILITY DESCRIPTION Almost Certain Could occur annually Likely Unlikely Could occur once in two years Occurring not more than once in ten years Occurring not more than once in thirty years Rare Occurring not more than once in one hundred years. Possible 21 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Current Controls (Prevention) For each root cause identify the controls in place to prevent or reduce the likelihood of the failure from occurring. – Describe how the failure is prevented based on the current or planned actions. – The controls identified here should be used as input in determining the Probability level. 22 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Current Controls (Detection) For each root cause, identify the current controls in place to identify a failure mode that has been realized. – Describe how the failure mode or the cause is detected based on the current or planned actions. – The controls identified here should be used as input in determining the Detection level. 23 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Scoring Detection and Calculating Risk Detection – The probability that, if realized, the failure mode will be detected. – The FMEA template provides guidance in the form of a Detection Matrix table. A table is provided for each FMEA type (Process, Design, External) – Select from the drop-down the detection level determined. Risk – The Risk takes into consideration the severity, probability, and detection level selected. – Provides a method for ranking potential failures in the order they should be addressed. – Automatically determined based on severity, probability, and detection. 24 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Detection Matrix Process FMEA Detection Guidance Likelihood of Detection Criteria Almost Impossible No current process control, cannot detect. 25 01/29/2024 Low Failure mode detected by an individual through the use of visual/tactile/audible means. Moderate Failure mode detected by an individual through the use of variable or attribute gauging (calipers, go/no-go, multimeter, torque wrench, etc.) High Failure mode detected by automated controls. Almost Certain Failure mode is prevented as a result of design. Failure mode will not occur as it has been errorproofed by process/product design. J. Holzbauer Failure Mode and Effect Analysis Training
Detection Matrix Design FMEA Detection Guidance Likelihood of Criteria Detection Absolute No current design control, cannot detect or is Uncertainty not analyzed. External FMEA Detection Guidance Likelihood of Criteria Detection Absolute No current methods of detection, Uncertainty cannot detect. Low Design verification after final design with pass/ Low fail testing (system or sub-system testing with acceptance criteria), test to failure testing (system or sub-system testing until failure occurs), or degradation testing (system or sub- Moderate system testing after durability test, e.g., function check). Moderate Design validation prior to final design using pass/fail testing (acceptance criteria for performance, function checks, etc.), test to failure testing (until leaks, yields, cracks, etc.), or degradation testing (data trends, before/after values, etc.) High Design analysis/detection controls have strong detection capability. Virtual Analysis (e.g. CAE, FEA, etc.) is highly correlated with actual and/or expected operating conditions. Almost Certain Failure mode cannot occur because it is fully prevented through design solutions (e.g. proven design standard, best practice, etc.) 26 01/29/2024 Failure mode is detected by an individual through visual/tactile/audible means. Failure mode is detected by an individual through technological aids (e.g. smoke detector) High Failure mode is detected by automated controls. Almost Certain Failure mode is detected by automated controls and actions are automatically put into place to remediate (e.g. generator turns on during power outage). J. Holzbauer Failure Mode and Effect Analysis Training
Scoring Summary Severity How serious the effect is. Critical, High, Medium, Low, Minimal Probability The probability of failure occurring throughout the lifecycle of the scope. Almost Certain, Likely, Possible, Unlikely, Rare Detection The probability that, if realized, the failure mode will be detected. Almost Impossible, Low, Moderate, High, Almost Certain 27 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Action Items and Anticipated Improvements The current assessment of each failure generates a Risk level. – Negligible, Minor, Moderate, Serious, Critical If a risk is deemed too high given the current design/plan, corrective actions should be created and detailed. – These columns should contain the actions themselves, the party responsible for implementing these changes, and a target date – The last group of columns are a reassessment of the risks for a failure assuming that the action plans are implemented Once an action is completed, the details of the changes should be captured in the main assessment (columns E through N) – The action plan can then be removed and replaced with a new action plan as necessary to further reduce the risk 28 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Down-selects and Relative Risk FMEAs can be a tool for assessing relative risk of different technical solutions The assessment and grading process for failures gives a reasonable comparison of relative technical risk In the case of comparisons between different technical options, a limited FMEA can be created for each and the number and criticality of the risks can be compared 29 01/29/2024 Option 1 Option 2 Option 3 Negligible 10 Risk 8 15 Minor Risk 7 8 8 Moderate 0 Risk 5 2 Serious Risk 2 1 4 Critical Risk 1 0 0 J. Holzbauer Failure Mode and Effect Analysis Training
Reviewing an FMEA Creation of an FMEA should not be a solo effort, but it also does not need to be a full-team effort Although your core design team should be involved in creation and reviewing the FMEA, additional people should be included to review risks relevant to their work – Assembly staff is an often-overlooked group that can provide great insight and feedback on assembly related risks – Receiving QC staff can give good feedback on incoming QC related failures FMEA subject matter experts are always available to assist in creation and review of these documents at any stage – Quality Section (T. Digrazia, M. Luedke) – Technical Integration (J. Holzbauer) 30 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
How do I use an FMEA? An FMEA should be created as early in a design cycle as possible, starting with broad failures and refining as the design evolves Each 6-12 months, or as relevant, the FMEA should be refined, including newly discovered potential failures and taking credit for design improvements – Update action items and anticipated improvements If a failure occurs, lessons learned should be rolled into the FMEA Before each review, you should refresh and update your FMEA, it is a required deliverable for each PIP-II technical review 31 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training
Summary FMEA is a tool to help document and manage technical risks All aspects of risks assessed: – Human factors/process failures – Design failures – External factors The process of identifying, assessing, and comparing these risks has been discussed using the FMEA template The FMEA is a tool to be used in your design process to help focus risk assessment and mitigation 32 01/29/2024 J. Holzbauer Failure Mode and Effect Analysis Training