﻿

### Browsed byCategory: Reliability Theory

Probabilistic Ways to Represent the Lifetime of an Object

## Probabilistic Ways to Represent the Lifetime of an Object

Every item, system, person, or animal has a lifetime. For people and animals, we typically just measure the lifetime in years, but we have other options for items and systems. We can measure airplane reliability in flight hours (hours actually flown), or stress test a manufacturing tool in cycles. Regardless of the units we use, there are many things in common. We don’t know how long any item will “live” before it’s manufactured or deployed, so an item’s lifetime is a random variable. We wish to make decisions about manufacturing, warranties, or even purchasing by taking the reliability of an object into account.

We can represent each class of items (a brand of 100W lightbulbs, USR’s NS4 robots, etc) by a random variable for the lifetime. We’ll call it $Y$. Like any random variable, it has a probability distribution. There isn’t only one way to represent the distribution of $Y$. We can look at equivalent representations, each one useful for answering different types of questions. This article will run through a few of them and the uses by studying the theoretical lifetime distribution of USR’s famous NS4 robots.

Disclaimer: This example is derived from the fictonal company USR from Isaac Asimov’s I, Robot. This should not be construed to represent any real product. I’m sure I’m missing some other disclaimer notices, but assume the standard ones are here.

## Survivor Function

The survivor function is the most common way to study the lifetime of an item. Colloquially, this is the probability that the item survives past time $t$. We denote it by $S(t)$, and we can write mathematically that

$$S(t) = P(Y \geq t)$$

This equation can be given by a standard probability distribution (the exponential distribution is the most common) or other formula.

Example (Exponential NS4s)
Without having access to USR’s manufacturing data, let’s assume that the survivor function of the NS4 robot is given by $S(t) = e^{-t/3}$. Let’s also assume that $t$ is measured in years. What is the probability that a brand new NS4 lasts more than 5 years?

From the graph above, we can simply plug $t=5$ into the survivor function to get the answer to our question. The probability that the new NS4 survives longer than 5 years is

$$S(5) = e^{-5/3} \approx 0.189$$

We could use the survivor function to help NSR decide where to place the cutoff for warranty claims. Depending on the cost of either repairing the NS4 or replacing the NS4 with the NS5, we can backsolve to find out what $t$ would satisfy management. Suppose the cost function requires that the probability of surviving past the cutoff $t$ is 85%. Then we can use the survivor function to backsolve for $t$:

\begin{aligned}0.85 &= e^{-t/3} \\\ln(0.85) &= \frac{-t}{3} \\0.49 &\approx t\end{aligned}

Thus, we would set the warranty claims to be valid only for about the first half year after the NS4 is purchased.

Remark. Another way to judge an item is by looking at the shape of the survivor function. A steep decline like the one shown in the above graph tells us that the NS4 isn’t exactly the most reliable robot. Only about half of them survive two years.

### Conditional Survivor Function

For those who wish to dive into a little bit more math, we can dive into the conditional survivor function. This “spinoff” of the survivor function will tell us the probability of surviving past time $t$ when it is currently functioning at time $a$. The survivor function above assumes $t$ starts at 0; that is, the object is brand new. If we have bought a used NS4, or perhaps have been sending it to the grocery store for a while, then we need to take into account the fact that the NS4 has been operational for some time $a$.

We write the conditional survivor function $S_{Y|Y\geq a}(t)$ for some fixed $a$. We can use the famous Bayes formula to express this mathematically:

$$S_{Y|Y\geq a}(t) = P(Y \geq t | Y \geq a) = \frac{P(Y \geq t \text{ and } Y \geq a)}{P(Y \geq a)} = \frac{P(Y \geq t)}{P(Y\geq a)} = \frac{S(t)}{S(a)}$$

What this formula is basically saying is that the probability that the NS4 survives past $t$ given that it has already lived for $a$ years is given by $S(t)/S(a)$, and derived via Bayes’ formula.

Example
Suppose we bought a used NS4 that was 2 years old (and it is working now). What is the probability that this NS4 is still working more than 3 years from now?

We are looking for the probability that the NS4 is still operational after more than 5 years given that it has already been working for 2. So

$$S_{Y|Y\geq 2}(5) = \frac{S(5)}{S(2)} = \frac{e^{-5/3}}{e^{-2/3}} = \frac{1}{e} \approx 0.367$$

Thus, we only have a 36.7% chance of getting more than 3 years out of our used NS4. Perhaps we may want to consider haggling for a lower price…

## Cumulative Distribution Function (Cumulative Failure Probability)

This is the cumulative distribution function straight from basic probability, but we can add an additional interpretation in the context of reliability. Mathematically, the cumulative distribution function for a random variable $Y$ is the probability that the random variable is less than or equal to a fixed value $y$. Mathematically, we denote this by $F_{Y}(t) = P(Y \leq t)$.

You may recognize this as the “opposite” of the survival function ($S_{Y}(t) = P(Y \geq t)$). In probability, we call this the complement, and we can get from one event to its complement by noting that for an event $A$ and its complement $A^{c}$, $P(A) + P(A^{c}) = 1$. Thus, moving back to the survivor function and CDF, $S_{Y}(t) + F_{Y}(t) = 1$. Therefore, $F_{Y}(t) = 1-S_{Y}(t)$. With the NS4 example, the CDF is given by

$$F_{Y}(t) = 1-e^{-t/3}$$

The interpretation is exactly the opposite of the survivor function. The CDF gives us the probability that the NS4 will fail before time t.

Example
The probability that a new NS4 will fail before the 5th year is given by either $F_{Y}(t) = 1-e^{-5/3} = 1-0.189 = 0.821$.

## Hazard Function

The most common way to look at a lifetime distribution in reliability is called the hazard function. This is also called the failure rate in some circles, and gives a measure of risk. We’ll take a little bit of a dive into math to derive this, since the derivation is illuminating as to its interpretation.

The hazard function is denoted by $h(t)$. The question we want to answer here is the conditional probability that the item will fail in a time interval $[t, t+\Delta t]$, given that it has not occurred before. We want to know the probability of failure per unit time. So,

$$h(t)\Delta t = P(t \leq Y \leq t + \Delta t| Y \geq t).$$

We’re going to get into some calculus here, so this can be skipped if you’d rather not deal with this.

$$h(t) = \frac{P(t \leq Y \leq t + \Delta t| Y \geq t)}{\Delta t}= \frac{F'(t)}{S(t)\Delta t}$$

Now, if we let the interval length $\Delta t$ get smaller (to 0), we get the instantaneous failure rate.

$$h(t) = \frac{-S'(t)}{S(t)}$$

Remark Those sharp in calculus will notice via the chain rule that $h(t) = -\frac{d}{dt}\ln(S(t))$, so we can express the hazard function in terms of the survivor function. While the lifetime distribution doesn’t look so good for the NS4, this author would politely request USR work on improving the reliability of this model rather than moving forward with the NS5 project…

Example.
We can directly derive the hazard rate for the NS4 population.

$$h(t) = -\frac{d}{dt}\ln(e^{-t/3}) = \frac{d}{dt}\frac{t}{3} = \frac{1}{3}$$

which is one failure every three years of operation.

The hazard function is commonly used in engineering maintenance to determine schedules for checks or component replacement. For example, the hazard function can be used to determine how many flight hours an Lockheed F-22 fighter jet can be operated before a certain component is at risk for failure and should be inspected for replacement.

There are other forms we can use to express the distribution of an object’s lifetime, but these are the most common. Another thing to note is that we can easily move from one form to another. They all represent the same thing–lifetime of a system, but in slightly different ways. We were able to make several different decisions about USR’s NS4 robots thanks to these different representations. System Reliability Basics

## System Reliability Basics

Almost everything we use every day can be thought of as a system built from components. Our lungs form a system. A string of lights forms a system. A manufacturing process, and a data center are also systems that are more complicated. Every system is built from components, and components are either functional or failed.

Remark: We can get more advanced and consider components in states of varying health and age, but we’re going to consider only the binary states of functional and failed.

Suppose a given system has n components. Each component will be either on or off. We call these binary states component states and give them a nice, natural mathematical definition.

#### Definition (Component states).

The state of component i is denoted $x_{i}$ and is defined by

$$x_{i} := \left\{\begin{array}{lr}1, & \text{ component }i\text{ is functional}\\0, & \text{ component }i\text{ is failed}\end{array}\right.$$

Now, depending on how these components interact, the entire system is either functional or failed. We can write these component states in a vector called the system state vector:

Definition (System State Vector).
For n components in a system, each with state $x_{i}$, the system state is defined by $\mathbf{x} = (x_{1},\ldots,x_{n})$.

The system state vector is a combination of 1’s and 0’s corresponding to which components are functioning or failing. It is not always true that a single component failure (where its component state is 0) will result in a system failure. For example, you can still type with 8 fingers instead of 10, so two fingers in a splint will not keep you from your office duties. We can define a structure function that takes a system state vector as an input and tells us if the entire system is functioning or failed. So for example, the structure function of “typing with fingers” should return a 1 (functional) for any system state vector with two 0’s adjacent (since a split binds two fingers together). For example if we let $\phi(\mathbf{x})$ be the structure function of the act of typing with fingers, $\phi(1,0,0,1,1,1,1,1,1,1)$ will return a 1, because we can still type with only 8 functioning fingers. Formally,

Definition (Structure Function)
The structure function of a system $\phi : \mathbf{x} \to \{0,1\}$ is defined as
$$\phi(\mathbf{x}):= \left\{\begin{array}{lr}1, & \text{ the system is functioning when the state vector is x}\\0, &\text{the system has failed when the state vector is x}\end{array}\right.$$

Remark: For a system with n components, there are $2^n$ possible system state vectors. To see this, remember that if each component state has two possibilities, and there are n states, then there are $2\cdot2\cdots2 = 2^n$ system state vectors

Each system has a topology, i.e. its component arrangement. Here, we are concerned with the reliability topology, which tells us what combinations of components are needed to function in order for our system to be working. We use a tool called block diagrams to represent these.

### Example (Series System)

One of the simplest types of systems is the series system. In this example, if one component fails, the whole system goes down. The block diagram shows us in a clear visual way if the system is functioning. The goal is to be able to move from left to right in the diagram. If a component is functioning, we may move through that node. If not, that “path” is closed. In the series system, there is only one possible path from left to right through every single component. If I cannot find a path from left to right, the system is not functioning when the given set of components is out. Thus, for the series system, removing even one component removes the only path I have to a functioning system.

The structure function for a series system is given by
$$\phi_{\text{series}}(\mathbf{x}) = \prod_{i=1}^{n}x_{i}$$

Here, we can see that if any state vector is 0, the whole product is 0, which means the system fails. Conversely, the only way $\phi(\mathbf{x})$ can be 1 is if $x_{i} = 1$ for every i.

Remark: The Christmas lights happen to physically match that reliability topology given in above. This isn’t always the case. The physical arrangement doesn’t necessarily correspond to the reliability toplogy. For example, we can view a simple computer as a processor, a motherboard, a hard drive, and a power supply. These are not physically arranged in a line inside your computer case, but if any one of these components fails, your computer is useless.

### Example (Parallel System)

The next simplest type of system is the parallel system. This one is the exact opposite of the series system: if even one component in a parallel system is still working, the system still functions.

The structure function for the parallel system is given by
$$\phi_{\text{parallel}}(\mathbf{x}) = 1-\prod_{i=1}^{n}(1-x_{i})$$

To see this, remember that the component state is 1 when the component functions, and 0 when it fails. A system is the same way. In a parallel system, the system fails if all of the components fail. Each component failure contributes a $(1-x_{i})$, and we must subtract this product from 1, because we want to know when it functions, not fails.

These two are the basic system reliability topologies. We can make ones that are much more complex, but we have a theorem that shows any system’s block diagram can be re-structured into a series system of parallel subsystems, or a parallel system of series subsystems.
From Leemis (Reliability: Probabilistic Models and Statistical Methods, Lawrence M. Leemis 2nd ed.) ,

Theorem (Decomposition of Systems into Series/Parallel Subsystems).
Let $P_{1},...,P_{s}$ be the minimal path sets for a system. Then
$$\phi(\mathbf{x}) = 1-\prod_{i=1}^{s}\left(1-\prod_{j \in P_{i}}x_{j}\right)$$
where $x_{j}$ is the component state vector.

What are path sets? The path sets are the sets of components that form a complete path through the block diagram.

If we look at the figure above, we have lots of possible paths from left to right. The sets (1,3), (1,4), (2,4), and (2,5) all provide paths through the diagram. These are also minimum path sets in this case. Focusing on any particular path set, if I drop one component, the path disappears. The theorem gives the mathematical structure function corresponding to a parallel system of the series subsystems that the minimal path sets generate.

In other words, the system is functioning if Components 1 and 3 are functioning, or Components 1 and 4 are functioning, or Components 2 and 4, etc.

Last, here’s an example of how we can take a given block diagram and rearrange it in terms of a parallel system of series subsystems according to the theorem.

In the figure above, the top diagram represents a topology called a bridge system. The right side is the alternative version where I’ve arranged this into a parallel system of series subsystems. To test your knowledge, try the following exercise.

Exercise. Write the structure function for the bridge system.

Solution. $\phi(\mathbf{x}) = 1-(1-x_{1}x_{3}x_{5})(1-x_{1}x_{4})(1-x_{2}x_{3}x_{4})(1-x_{2}x_{5})$

Why do we actually care about this? Engineers will use block diagrams to help them design reliable systems. We can prove mathematically that the series system is the least reliable, and the parallel system is the most reliable, but that doesn’t mean we should always put every component in parallel. An airplane with more landing gear than it needs can cause the airplane to weigh too much, for example.

This results in complex system designs, particularly for mechanical and electronic devices. These designs may become too difficult to study visually via block diagrams, but the structure function helps. Studying the form of the structure function in terms if the components can help us determine where a critical component is, for example. A critical component is a component that will fail the whole system if it fails. In a series system, every component is critical. In a parallel system, no one is. Since most system designs are mixtures of these two, identification of a critical component isn’t always simple with the block diagram. The structure function allows us to simulate various components failing (having state 0), and quickly seeing its effect on the system.

Upon identification of a critical component, an engineer has a couple options 