System Safety: An Overview
The idea or concept of system safety can be traced to the missile production industry of the late 1940s. It was further defined as a separate discipline by the late 1950s and early 1960s, used primarily by the missile, aviation, and aerospace communities. Prior to the 1940s, system designers and engineers relied predominantly on trial and error for achieving safe design.
This approach was somewhat successful in an era when system complexity was relatively simple compared with those of subsequent development.
A System is:
…”a perceived whole whose elements ‘hang together’ because they continually affect each other over time and operate toward a common purpose.”
Systems Thinking is defined as:
“…. a body of methods, tools, and principles, all oriented towards a common goal involving process interrelationships and control feedback…”
For example, in the aviation industry, this process was often referred to as the "fly-fix-fly" approach to design problems. An aircraft was designed based upon existing or known technology. It was then flown until problems developed or, in the worst case, it crashed.
If design errors were determined as the cause (as opposed to “human”, or "pilot" error), then the design problems would be fixed and the aircraft would fly again.
Obviously, this method of after-the-fact design safety worked well when aircraft flew low and slow and were constructed of wood, wire, and cloth. However, as systems grew more complex and aircraft capabilities such as airspeed and maneuverability increased, so did the likelihood of devastating results from a failure of the system or one of its many subtle interfaces.
Elements such as these became the catalyst for the development of systems engineering, out of which eventually grew the concept of systems safety.
The dawn of the manned spaceflight program in the mid-1950s also contributed to the growing necessity for safer system design. Hence, the budding missile and space systems programs became a driving force in the development of system safety engineering.
Those systems under development in the 1950s and early 1960s required a new approach to controlling hazards such as those associated with weapon and space systems (e.g., explosive components and pyrotechnics, unstable propellant systems, and extremely sensitive electronics).
The Minuteman Intercontinental Ballistic Missile (ICBM) was one of the first systems to have had a formal, disciplined, and defined system safety program.
In July 1969, the U.S. Department of Defence (DOD) formalised system safety requirements by publishing MIL-STD-882, entitled "System Safety Program Requirements."
This Standard has since undergone three revisions.
The U.S. National Aeronautics and Space Administration (NASA) soon recognised the need for system safety and have since made extensive system safety programs an integral part of space program activities. The early years of America’s space launch programs are full of catastrophic and quite dramatic examples of failures.
During those early years, it was a known and quite often stated fact that "our missiles and rockets just don't work, they blow up." The many successes since those days can be credited in large part to the successful implementation and utilization of a comprehensive system safety program. However, it should be noted that the Challenger disaster in January 1986 stands as a constant reminder to all that, no matter how exact and comprehensive a design or operating safety program is considered to be, the proper management of that system is still one of the most important elements of success.
This fundamental principle is true in any industry or discipline.
Eventually, the programs pioneered by the military and NASA were adopted by industry in such areas as nuclear power, refining, mass transportation, chemicals, and, more recently, computer programming. Today the system safety process is still used extensively by the various military organizations within the Dept of Defence, as well as by many other federal agencies such as NASA, the Federal Aviation Administration, and the Department of Energy.
In most cases, it is a required element of primary concern in the federal agency contract acquisition process. Although it would not be possible to fully discuss the basic elements of system safety without comment and reference to its military/federal connections, the primary focus of this article is on the advantages of utilizing system safety concepts and techniques as they apply to the general safety arena.
In fact, the industrial workplace can be viewed as a natural extension of the past growth experience of the system safety discipline. Many of the safety rules, regulations, statutes, and basic safety operating criteria practiced daily in industry today, are for the most part the direct result of a real or perceived need for such control doctrine.
The requirement for safety controls (written or physical) developed either because a failure occurred or because someone with enough foresight anticipated a possible failure and implemented controls to avoid such an occurrence. Even though the former example is usually the case, the latter is also responsible for the development of countless safe operating requirements practiced in industry today.
Both, however, are also the basis upon which system safety engineers operate.
The first method, creating safety rules after a failure or accident, is likened to the fly-fix-fly approach discussed earlier.
The second method, anticipating a potential failure and attempting to avoid it with control procedures, regulations, and so on, is exactly what the system safety practitioner does when analyzing system design or an operating condition or method.
However, whenever possible or practical, the system safety concept goes a step further and actually attempts to engineer the risk of hazard(s) out of the process. With the introduction of the system safety discipline, the fly-fix-fly approach to safe and reliable systems was transformed into the "identify, analyze, and eliminate" method of system safety assurance.
The connection between the system safety discipline and its relationship to the general industry occupational safety practice is established.
See also article on Systems Failure and Disasters:
Flixoborough, Challenger, Bhopal, Chernobyl, Piper Alpha, Longford, Cave Creek.
Systems Failure Analysis is based on the Deming philosophy that Occupational Safety & Health, like Quality, is primarily the result of senior management actions and decisions and not the result of actions taken by "workers".
Deming stresses that it is the "system" of work that determines how work is performed and only managers can create the system and improve it. Only managers can allocate resources, provide training, select the equipment and tools used and provide the plant and the environment necessary to achieve high standards of occupational health and safety.
The people who work in the system have little or no influence over 85+% of the causes of accidents that are built into the system; only management action can change the system.
Employees can only be responsible for resolution of special safety problems caused by actions or events directly under their control.
Are accidents really such or are they more accurately described as performance imperfections of human and physical resources that should be under the control of a responsible management?
What should be investigated?
The message to managers, who consider accident investigation just a tedious legal requirement to be avoided whenever possible, must be, that there is great potential in good incident/accident investigation for improving how managers manage.
Managers must be concerned with collecting and analysing data about any accidents involving the people they supervise.
We know that accidents invariably produce a Loss, but the extent of that loss is mostly chance. The result has little to do with cause.
We need to accept the principle that the severity of personal harm or property damage should not dictate the intensity of the investigation.
The system of investigation must probe into the way management manages and look at the Systems that have Failed.
It is important to note too, that cost does not relate to the cause.
Large Losses do not necessarily reveal mismanagement.
How useful is the accident data we currently collect?
Most data recorded and analysed on accidents is based on OSH requirements requiring a detailed description of the circumstances surrounding the accident, rather than information on why it happened. Many computerised analysis programs end up giving a great deal of accessible, but not useable, information, and very little insight (to management) into why the accidents are occurring.
We need to identify Corrective and Preventive Action(s) to:
a) Eliminate the causes of the accident that occurred;
b) Prevent reoccurrence; and
c) Prevent potential accidents.
Corrective Action does just that ~ corrects the existing problem, which may continue to re-occur.
Preventive Action improves the systems and processes.
Collecting and analysing data on the distribution of accidents by "class" or "type" i.e. nature of injury, part of body injured, accident type, agency of injury etc. does little to identify root cause(s).
For example, the fact that 100 eye injuries occurred in a given plant may impress its management of the need for safety. But it does not point out any specific causes, nor will it suggest any area of management to take remedial action. Even the addition of a figure of $10,000 loss from eye injuries does not offer any specifics for management on action to control the Loss.
This "class" and "type" information shows magnitude without the identification of cause.
It does not provide causal information on which management can base decisions to act and evaluate progress on efforts expended to improve Health & Safety.
Before we can analyse the systems that have failed we need to clearly understand the sub- systems involved in accidents and how they relate to each other.
Using Systems Failure as working tool [See also AS/NZS 4801:2001]
Systems fail for a variety of reasons but usually due to a multiplicity of causes.
Many of the individual causes may have been built into the original systems and processes and lain unnoticed for many years.
When a particular and unique set of conditions and circumstances arise however, catastrophe may result. Subsequent investigations often reveal flawed systems that have never been previously or officially revealed (in writing) to Senior Management, although the employees working in the process are often aware of the shortcomings.
Some of the problems, may actually have been unwittingly introduced by the employees themselves, as they operate the system and related processes.
This is the reason that regular internal audits* should be conducted, together with an investigation and follow-up of all accidents and mishaps. [see training section for NZSC ASA Auditor Course Training]
Audits Check for Compliance with Management Performance Standards.
Audits must be conducted by auditor(s) experienced in the processes being audited.
Any such Audit must actively involve the employees working in the processes being audited in order to uncover any major flaws or shortcomings in their operation.
This will involve using a Team of experienced people to reveal flaws, in either hardware or software systems and/or procedures.
If the quality of written Standards, as well as the operational integrity of the Standards, is regularly tested, then the only problem is usually one of individual compliance with the standard. If however the process has been changed for any reason, then a complete re-evaluation will be required
Carrying out Systems Failure Analysis
Accident Investigation & Analysis involves the methodical and systematic examination of the sequence of events before, during and after a particular mishap, to establish the root causes and actual consequences of the mishap (undesired event(s).
It also provides feedback on the effective functioning, or otherwise, of the Management System and/or Safety System through analysis of how and which part of the management system failed.
The Systems Failure Analysis Tool
These causes can be grouped under 3 headings:
• Human Systems (errors),
• Physical Systems (failures) and,
• Management Systems (oversights)
From this systematic analysis of Accident Data, the root cause(s) of the failure(s) can be established. Trends can also be measured which uncover repeated loss exposures, improperly evaluated risks, and inadequate controls.
Following the analysis, we can then decide on:
(a) prompt and immediate corrective action(s); and
(b) longer term preventive action(s).
Note: Actions (a) and (b) may be different.
What is required, to stop immediate recurrence of the mishap, may be very different to the longer-term strategy that is then adopted.
Preventive actions may involve a re-design of equipment, purchase of new equipment, modified procedures, recruitment changes, different training or re-training of staff and any one of a possible multitude of actions that may emerge from the systems failure analysis.
Sophisticated Health, Safety and Environmental (HSE) Systems in larger organisations will consist of dozens of processes and maybe as many as 175 related HSE practices.
A failure in any one or more of these practices could lead to a major mishap, so establishing root causes may not be a simple task and therefore the corrective and preventive actions may also be multi-tiered.
The Piper Alpha Oil Platform disaster in the North Sea in 1988 ~A classic ‘Systems Failure’ [The following information is used with the training video ‘Spiral to Disaster’ as part of our training courses. [see also separtae more detailed entry for Piper Alpha]
The Piper Alpha Oil Platform disaster was caused by the ignition of condensate flooding from a blind flange that could not withstand the pressure of condensate in the pipe it closed.
The blind flange was replacing a valve that had been removed for repair.
At the time of removal of the valve, there was no condensate in this pipe, and the blind flange was not intended to withstand the pressure of the condensate.
However, the operators in the control room on the night shift were not properly informed by the day shift that the valve had been taken out for repair. That night, operational irregularities occurred. The operators took the natural action of leading the condensate into the alternative pipe, and the potential for tragedy became a fact.
The explosion occurred and led to a large crude oil fire. The explosion in effect destroyed the platform’s main power supplies and the control room, and a series of serious consequences flowed from this, which led to the destruction of the platform and the tragic loss of 167 lives.
There is little doubt that the immediate cause of the disaster was human error but there is rarely one cause of a major accident, more generally a series of incidents (each normally safe or even trivial in themselves) come together, against an existing background to cause it.
In the case of Piper Alpha, obvious examples are:
• The unavailability of a crane to replace the repaired valve before the night shift took over.
• The lack of knowledge of the absence of the valve by the men on the night shift; and
• The fact that the only valve available malfunctioned that night.
The management system by way of “Permit to Work” was found to be ill monitored and the platform itself, through its original design and subsequent adaptation on several occasions, was open to the possibility of accident should circumstances conspire as they did.
Lord Cullen, who chaired the Court of Inquiry, made the following observations:
“…The failure in the operation of the Permit To Work system was not an isolated mistake but that there were a number of respects in which the laid down procedure was not adhered to and unsafe practices were followed. One particular danger, which was relevant to the disaster, was the need to prevent the inadvertent or unauthorised re-commissioning of equipment, which was still under maintenance and not in a state in which it could safely be put into service. The evidence also indicated dissatisfaction with the standard of information which was communicated at shift hand-over”.