## System Reliability Basics

For the pdf version, click here.

Almost everything we use every day can be thought of as a system built from components. Our lungs form a system. A string of lights forms a system. A manufacturing process, and a data center are also systems that are more complicated. Every system is built from **components**, and components are either *functional* or *failed*.

**Remark:** We can get more advanced and consider components in states of varying health and age, but we’re going to consider only the binary states of functional and failed.

Suppose a given system has *n* components. Each component will be either on or off. We call these binary states **component states** and give them a nice, natural mathematical definition.

**Definition (Component states). **

**Definition (Component states).**

The **state** of component *i* is denoted x_{i} and is defined by

Now, depending on how these components interact, the entire system is either functional or failed. We can write these component states in a vector called the **system state vector**:

**Definition (System State Vector).**

For *n* components in a system, each with state x_{i}, the system state is defined by \mathbf{x} = (x_{1},\ldots,x_{n}).

The system state vector is a combination of 1’s and 0’s corresponding to which components are functioning or failing. It is not always true that a single component failure (where its component state is 0) will result in a system failure. For example, you can still type with 8 fingers instead of 10, so two fingers in a splint will not keep you from your office duties. We can define a **structure function** that takes a system state vector as an input and tells us if the entire system is functioning or failed. So for example, the structure function of “typing with fingers” should return a 1 (functional) for any system state vector with two 0’s adjacent (since a split binds two fingers together). For example if we let \phi(\mathbf{x}) be the structure function of the act of typing with fingers, \phi(1,0,0,1,1,1,1,1,1,1) will return a 1, because we can still type with only 8 functioning fingers. Formally,

**Definition (Structure Function)**

The **structure function** of a system \phi : \mathbf{x} \to \{0,1\} is defined as

\phi(\mathbf{x}):= \left\{\begin{array}{lr}1, & \text{ the system is functioning when the state vector is x}\\0, &\text{the system has failed when the state vector is x}\end{array}\right.

**Remark: ***For a system with n components, there are 2^n possible system state vectors. To see this, remember that if each component state has two possibilities, and there are n states, then there are 2\cdot2\cdots2 = 2^n system state vectors*

Each system has a* topology*, i.e. its component arrangement. Here, we are concerned with the *reliability topology*, which tells us what combinations of components are needed to function in order for our system to be working. We use a tool called **block diagrams** to represent these.

### Example (Series System)

One of the simplest types of systems is the series system. In this example, if one component fails, the whole system goes down.

The block diagram shows us in a clear visual way if the system is functioning. The goal is to be able to move from left to right in the diagram. If a component is functioning, we may move through that node. If not, that “path” is closed. In the series system, there is only one possible path from left to right through every single component. If I cannot find a path from left to right, the system is not functioning when the given set of components is out. Thus, for the series system, removing even one component removes the only path I have to a functioning system.

The structure function for a series system is given by

\phi_{\text{series}}(\mathbf{x}) = \prod_{i=1}^{n}x_{i}

Here, we can see that if any state vector is 0, the whole product is 0, which means the system fails. Conversely, the only way \phi(\mathbf{x}) can be 1 is if x_{i} = 1 for every *i*.

**Remark: **The Christmas lights happen to physically match that reliability topology given in above. This isn’t always the case. The physical arrangement doesn’t necessarily correspond to the reliability toplogy. For example, we can view a simple computer as a processor, a motherboard, a hard drive, and a power supply. These are not physically arranged in a line inside your computer case, but if any one of these components fails, your computer is useless.

### Example (Parallel System)

The next simplest type of system is the *parallel system*. This one is the exact opposite of the series system: if even one component in a parallel system is still working, the system still functions.

The structure function for the parallel system is given by

\phi_{\text{parallel}}(\mathbf{x}) = 1-\prod_{i=1}^{n}(1-x_{i})

To see this, remember that the component state is 1 when the component functions, and 0 when it fails. A system is the same way. In a parallel system, the system fails if all of the components fail. Each component failure contributes a (1-x_{i}), and we must subtract this product from 1, because we want to know when it *functions*, not fails.

These two are the basic system reliability topologies. We can make ones that are much more complex, but we have a theorem that shows any system’s block diagram can be re-structured into a series system of parallel subsystems, or a parallel system of series subsystems.

From Leemis (Reliability: Probabilistic Models and Statistical Methods, Lawrence M. Leemis 2nd ed.) ,

**Theorem (****Decomposition of Systems into Series/Parallel Subsystems).**

Let P_{1},...,P_{s} be the minimal path sets for a system. Then

\phi(\mathbf{x}) = 1-\prod_{i=1}^{s}\left(1-\prod_{j \in P_{i}}x_{j}\right)

where x_{j} is the component state vector.

What are **path sets**? The path sets are the sets of components that form a complete path through the block diagram.

If we look at the figure above, we have lots of possible paths from left to right. The sets (1,3), (1,4), (2,4), and (2,5) all provide paths through the diagram. These are also *minimum path sets* in this case. Focusing on any particular path set, if I drop one component, the path disappears. The theorem gives the mathematical structure function corresponding to a parallel system of the series subsystems that the minimal path sets generate.

In other words, the system is functioning if Components 1 and 3 are functioning, **or** Components 1 and 4 are functioning, **or** Components 2 and 4, etc.

Last, here’s an example of how we can take a given block diagram and rearrange it in terms of a parallel system of series subsystems according to the theorem.

In the figure above, the top diagram represents a topology called a *bridge system*. The right side is the alternative version where I’ve arranged this into a parallel system of series subsystems. To test your knowledge, try the following exercise.

**Exercise. **Write the structure function for the bridge system.

**Solution.** \phi(\mathbf{x}) = 1-(1-x_{1}x_{3}x_{5})(1-x_{1}x_{4})(1-x_{2}x_{3}x_{4})(1-x_{2}x_{5})

Why do we actually care about this? Engineers will use block diagrams to help them design reliable systems. We can prove mathematically that the series system is the least reliable, and the parallel system is the most reliable, but that doesn’t mean we should always put every component in parallel. An airplane with more landing gear than it needs can cause the airplane to weigh too much, for example.

This results in complex system designs, particularly for mechanical and electronic devices. These designs may become too difficult to study visually via block diagrams, but the structure function helps. Studying the form of the structure function in terms if the components can help us determine where a critical component is, for example. A *critical component* is a component that will fail the whole system if it fails. In a series system, every component is critical. In a parallel system, no one is. Since most system designs are mixtures of these two, identification of a critical component isn’t always simple with the block diagram. The structure function allows us to simulate various components failing (having state 0), and quickly seeing its effect on the system.

Upon identification of a critical component, an engineer has a couple options

- he can add redundancy
- he can use a more reliable (and usually more expensive) version of the component

These kinds of decisions require optimization with cost and other conditions, but these basic reliability building blocks are tools that can help make this easier.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.