Browsed by
Tag: reliability theory

On Server Efficiency

On Server Efficiency

For the full text of the paper, including all proofs and supplementary lemmata, click to download thesis-ch-2

Abstract

Editor’s note: This paper comprises the second chapter of the PhD dissertation by Rachel Traylor. Cha and Lee defined a mathematical notion of server performance by measuring efficiency \psi defined as the long run average number of jobs completed per unit time. The service time distribution heavily influences the shape of the server efficiency as a function of a constant arrival rate \lambda. Various classes of distributions are studied in order to find sufficient conditions for the existence of a single maximum. The existence of a maximum allows for simple binary control policies to handle traffic and optimize performance.

Introduction, Motivation, and Background

Cha and Lee [1] studied the reliability of a single server under a constant stress workload, and also defined a notion of server efficiency \psi for a given intensity \lambda(t) as the long-run average number of jobs completed per unit time as a way to measure server performance. With the number of jobs completed as M, the efficiency is defined

\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted \{N^{*}(t), t \geq 0\}, is a nonhomogenous Poisson process with the same intensity function \lambda(t) as before, and that \{N^{*}(t), t \geq 0\} is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution H(t) with expected value \nu. This process is a renewal reward process, with the renewal \{R_{n}\} = \{M_{n}\}, the number of jobs completed. The length of a renewal cycle is Y_{n} + H_{n}, where Y_{n} is the length of time the server was operational, and H_{n} is the time to reboot after a server crash. Then, by [2],

\psi = \frac{E[M]}{E[Y]+ \nu}

where M is the number of jobs completed in a particular renewal cycle, \nu is the mean time to reboot of the server, and Y is the length of a particular renewal cycle. Then, using the definition of \psi the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.


Theorem 1 (Server Efficiency under Cha/Lee)
Suppose \{N(t), t \geq 0\} is a nonhomogenous Poisson process with intensity \lambda(t)\geq 0. Then the efficiency is given by

\begin{aligned}\psi&=\frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\qquad\qquad\left.\times\left(r_{0}(t)a(t)+\eta a(t)b(t) \right)\right]\end{aligned} where a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv, b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr, \bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds, and m(x) = \int_{0}^{x}\lambda(s)ds.


 

Numerical Example and Control Policies

Figure 1: Efficiency under Rayleigh Distribution

As an illustrative example, Cha and Lee considered the case when \lambda(t) \equiv \lambda,
r_{0}(t) \equiv r_{0} = 0.2, \eta = 0.01, \nu = 1, and g_{W}(w) = we^{-w^{2}/2} (the PDF of the Rayleigh distribution). As shown in Figure 1, there exists a \lambda^{*} such that \psi(\lambda) is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

 (1) If the real time arrival rate \lambda < \lambda^{*}, do not interfere with arrivals.

 (2) If \lambda \geq \lambda^{*}, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability 1-\frac{\lambda^{*}}{\lambda}.

The Rayleigh distribution used in Figure 1 has applications in physics, typically when the magnitude of a vector is related to its directional components. It is a special case of the Weibull distribution which is widely used in survival analysis, failure analysis, weather forecasting, and communications. These distributions are not typically used to model service times. The exponential distribution is the most common due to its memoryless properties, followed by the Erlang and uniform distributions.

The efficiency, \psi, under the Rayleigh distribution example in Figure 1 shows the existence of a 0 < \lambda^{*} < \infty such that \psi(\lambda) is maximized at \lambda^{*}. This useful feature of \psi(\lambda) in this case allows for the implementation of a simple control policy for arrivals to the server to prevent overload, given above. 

Numerical simulations under a variety of possible distribution classes, including convex, concave, exponential, uniform, and Erlang suggest that the mathematical properties of \psi are heavily influenced by the choice and characteristics of service time distribution g_{W}(w). In particular, it is of interest to seek sufficient conditions of g_{W}(w) that will guarantee the existence of a \lambda^{*} that maximizes \psi. This is done for the uniform, compact support, and Erlang classes. Furthermore, it is shown under certain conditions, not only does the server efficiency lack a maximum, but \psi increases without bound. This is not representative of real server behavior, and thus of mathematical and practical interest to note for further modeling.

 

Efficiency of a Server under Uniform Service Life Distribution

Suppose \lambda(x) \equiv \lambda, and suppose r_{0}(x) \equiv r_{0} = \max_{x \in (0,\infty)}r_{0}(x). The efficiency \psi is given for constant \eta and \lambda here:
\psi(\lambda) = \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\int_{0}^{\infty}\exp\left(-r_{0}t-\lambda t + a(t)+b(t)\right)(r_{0}+b(t))a(t)dt\right]

where S_{Y}(t) is the survival function of the node, a(t) = \int_{0}^{t}e^{-\eta v}g(v)(t-v)dv, b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}(t-r)dr,
g(v) is the pdf of the service time distribution, and \bar{G}(x) = 1-\int_{0}^{x}g(s)ds.

The following theorem gives sufficient conditions for the uniform distribution and \eta that guarantee the existence of a finite maximum efficiency.


Theorem 2(Efficiency under Uniform Service Distribution)
Suppose the service life distribution is given by Uniform(c,d) for some 0<c<d. Then if
\sigma > \frac{ce^{-c\eta}}{\sqrt{12}\phi(-\eta)(1+\eta(c+d))+c\eta - 1} where \phi(-\eta) the standard deviation of the service life W, \psi(\lambda) has a maximum on (0,\infty) is the moment generating function of a uniform distribution evaluated at -\eta.


 

Numerical simulations suggest that \psi increases without bound for c=0, d>1. The following lemma proves this fact.

 


Lemma
Suppose the service life distribution is given by Uniform(0,d), with d>1. Then \psi increases without bound.


This is worth discussing here. It makes no sense whatsoever for the efficiency of a server to increase forever as the arrival rate increases. So what’s happening here? Notice that if the uniform distribution include 0 as an endpoint, then there is a positive probability that the service time for a job will be exactly 0 (in whatever time units you like). This is impossible in reality. A small service time is possible, but it’s still strictly greater than 0. What we see here is that using distributions with positive support at 0 causes issues in the efficiency function; it’s not the fault of the definition of efficiency, but rather a consequence of using distributions that cannot mirror reality. The next section explores this further with a more broad class of distributions.

Extension of the Uniform Distribution: Compact Support

The ideas and techniques presented in the previous section yield a powerful extension to any service life distribution with compact support away from 0. Supposing g_{W}(w) has compact support [a,b], it may be bounded above by a positively scaled uniform distribution. In practice, service times are finite and nonzero, thus this extension allows for very simple control policies to be implemented for a much larger class of distributions.

 


Theorem 3 (Efficiency under Compact Support)
Let g_{W}(w) be the pdf of the service times having compact support [c,d], d > 0. Let m = \max_{w}g_{W}(w) < \infty, and R = (d-c) be the length of the support. Then \psi(\lambda) has a maximum if m < \frac{c}{R\eta +e^{-b\eta}-e^{-a\eta}+\eta e^{-a\eta}}.


 

Comments and Illustrations

Figure 2: Efficiency under an increasing triangular service distribution

Example 1

Let g_{W}(w) = \frac{2}{5}w\mathbb{1}_{[2,3]}(w), and let r_{0} = \nu = \eta = 1. Then m = \frac{6}{5}. By Theorem 2, m < \frac{2}{e^{-2}+e^{-3}+1}\approx 1.678. Thus, the existence of a maximum \psi is guaranteed. Figure 2 gives the numerical result with step size of 0.1. The maximum occurs around \lambda^{*} = 0.5

 

Figure 3: Efficiency under an increasing distribution with support on [1,2]
Example 2: Sufficient conditions aren’t always necessary conditions
The condition given in Theorem 2 is rather weak and is only a sufficient condition. As an illustration, consider the same increasing density shifted to a support of [1,2]. Thus
g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w). Retain all other assumptions from Example 1. Then
m = \frac{4}{3} > \frac{1}{e^{-2}+1} \approx 0.88 which violates the condition. However, consider Figure 3 above. Clearly \psi has a maximum at approximately \lambda^{*} = 0.8
The condition given in Theorem 2 relies on overestimating g_{W}(w) by a constant. If the variance \sigma_{W}^{2} of g_{W}(w) is large, the maximum of the pdf will decrease and be more comparable to the rest of the distribution. In these cases, bounding g_{W}(w) by its max m over the support [a,b] will give a reasonable approximation. However, in the case of a small support and high enough skew compared to the size and location of the support, much of the mass is concentrated at the right end of the support, and m will be higher. In these cases, bounding g_{W}(w) by m will result in a large amount of overestimation, and thus the condition may fail, but \psi still has a maximum. The numerical example wherein g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w) illustrates the conservative nature of this type of estimation.

Efficiency under Erlang Service Distribution

Now suppose g(v) is of the Erlang class but shifted \delta > 0 to the right. For motivation, consider that service times can never be 0 in a practical setting. Then the PDF and the complement of the CDF are given by
\begin{aligned}g(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}0, & 0 \leq v \leq \delta \\\frac{\gamma^{k}(v-\delta)^{k-1}e^{-\gamma(v-\delta)}}{(k-1)!}, & v\geq \delta\end{array}\right.\\\bar{G}(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}1, & 0 \leq v \leq \delta \\e^{\gamma(\delta-v)}\sum_{j=0}^{k-1}\frac{\gamma^{j}(v-\delta)^{j}}{j!}, & v\geq \delta\end{array}\right.\end{aligned}


Theorem 4 (Server Efficiency under Erlang Distribution)
Let \delta > 0, \eta > 0. Let \alpha(\delta) = \left(\frac{\gamma}{\gamma + \eta}\right)^{k}e^{-\eta\delta} + \frac{\gamma^{k}e^{\gamma\delta}(k-1)}{(\eta + \gamma)^{k-1}}, and 0 < \beta(\delta,\eta) < 1. If the service life distribution is of the \delta-shifted Erlang class, then \psi(\lambda) has a maximum in \lambda on (0,\infty) for \delta,\eta such that \alpha(\delta) + \beta(\delta,\eta) < 1.


Thus, we see that the Erlang service distribution produces a maximum as long as the support is shifted away from 0. 

Efficiency under Exponential Distribution

The exponential distribution is the most common service distribution assumed for a queue. The M/M/1 queue assumes a Poisson distribution for arrival times, and an exponential distribution or service times. However, the exponential distribution has most of its mass concentrated at or near 0. As we have seen in other distributions, this becomes a problem, both in considering what makes sense in practical applications, and in its effect on the efficiency. 
Suppose g(v) is of the exponential class. That is, g(v) = \gamma e^{-\gamma v}, \gamma > 0. It will be proven that under certain conditions on \eta and \gamma, an exponential g(v) causes \psi to increase without bound.

 


Theorem 5 (Efficiency under Exponential Distribution)
Suppose g(v) = \gamma e^{-\gamma v}. Then if \frac{2\gamma}{\gamma + \eta} > 1 + \frac{2}{\frac{2}{\gamma} + W\left(-\frac{\gamma}{\gamma +\eta}e^{-2-\frac{2\eta}{\gamma}}\right)}, \psi(\lambda) \to \infty,  as \lambda \to \infty where W(\cdot) is the Lambert W-function.


 

Extension to Random Stress and Nonconstant Intensity

All the above theorems regarding the server efficiency  assumed constant stress \eta and constant intensity \lambda. This section generalizes the analyses in the previous sections for nonconstant intensity and random stress.

The stochastic reliability models in both [1]  and the previous chapter both assumed a time-dependent intensity \lambda(t). By setting \lambda \equiv \max_{t}\lambda(t), the established efficiency theorems provide a conservative set of conditions under which a maximum efficiency may be obtained. In these cases, \psi is actually a function of all possible \lambda_{\max}.

If the job stresses are random, as in Theorem 2 of Traylor (2016), the above sections may still be utilized. Assume the sample space for \mathcal{H} is compact and nonnegative. The efficiency is given by Theorem 2 of Traylor (2016).
WLOG, again suppose the sample space of \mathcal{H} is discrete, given by \{\eta_{1},...,\eta_{m}\} with respective probabilities p_{i}, i =1,...,m. Now suppose all mass is concentrated at \eta_{[m]}. Let a_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}v}g(v)(t-v)dv, and b_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}(t-r)}\bar{G}_{W}(t-r)dr. Then, the following are true:
(1) E_{\mathcal{H}}[a(t)+b(t)] \leq a_{m}(t) + b_{m}(t) 

(2)E_{\mathcal{H}}[a(t)] \leq a_{m}(t)

(3) E_{\mathcal{H}}[\mathcal{H} a(t)b(t)] \leq \eta_{m}a_{m}(t)b_{m}(t)

Thus, by replacing the expectations in (1)-(3) with their respective upper bounds in the efficiency theorems, analyses of the efficiency for the uniform, compact support, and Erlang classes may proceed as previously detailed. These estimates are conservative but sufficient.

For an exponential service life distribution and random stress, create a lower bound for the expectations in
(1)-(3) by concentrating all mass at \eta_{[1]}. Then the conditions from Theorem 5 guarantee an explosion for the exponential distribution.

Implications and Numerical Illustrations

Uniform Service Distribution

 Click on an image in the gallery to expand and explore the efficiency of a server under different instances of the uniform service distribution.
For the uniform service distribution, both the variance of g_{W}(w) and the location of the support affect the efficiency itself. The gallery above shows \psi(\lambda) for various uniform distributions as an illustration. In all four cases, r_{0} = \nu =\eta =1. The variance of a uniform distribution is given by \sigma^{2} = \frac{d-c}{12}. With the exception of the fourth gallery image, which illustrates the explosion of \psi when the distribution has positive mass at 0, Table 1 below compares possible values of \psi for different Uniform distributions. Notice that while the variance \sigma^{2} does affect the range of \psi by several orders of magnitude, the location of the support has a much more powerful effect. Thus, if all service times are equally likely, a server is less efficient if it is consistently but mildly slow (Uniform[10,11]) compared to an inconsistent server (Uniform[1,500]).

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approximate Range of } \psi & \text{Approximate } \lambda^{*}\\\text{Uniform}[1,2] & 1/12 &1.5 & (0,0.012) & 1.3\\ \text{Uniform}[10,11] & 1/12 &10.5 & (0, 2\times10^{-11}) & 0.1 \\ \text{Uniform}[1,500] & 499/12 & 250.5 & (0,3\times 10^{-5}) & 1.6\end{array}

 

Increasing but Compact Service Distribution

Click to expand and explore. 

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\small{\frac{2w}{3}}\mathbb{1}_{[1,2]}(w)& \approx 0.0802&\approx 1.56 & (0,7\times10^{-3}) & 0.75\\\small{\frac{2w}{5}}\mathbb{1}_{[2,3]}(w)&\approx 0.0822 &\approx 2.53 & (0, 7\times10^{-4}) & 0.5 \\\small{\frac{2w}{500^{2}-1}}\mathbb{1}_{[1,500]}(w)& \approx 13888& 333.35 & (0,1.3\times 10^{-7})& 1.4\end{array}

As an illustration of a distribution on compact support, consider the class of increasing densities g_{W}(w) = cx\mathbb{1}_{[a,b]}(x). Several examples are given in the above gallery table. For both compact supports of length 1, the variance is approximately the same, but the mean changes, producing an order of magnitude decrease in efficiency. Compared to the compact support of length 499, with a much larger mean, the efficiency decreases by 3 orders of magnitude. Notice, however, that the decline after the maximum is much less sharp.

 

Erlang Service Distribution

Click on an image to explore. 

 

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\text{Erlang}(2,1)& 2&2 & (0,0.7)& 9\\\text{Erlang}(9,1)&9 &9 & (0, 4\times10^{-6}) & 0.5 \\\text{Rayleigh}(1)& (4-\pi)/2& \sqrt{\pi/2} & (0,0.9)& 8\end{array} The gallery above gives two examples of \psi under an Erlang distribution. Notice the change in the efficiency as the mean increases. Here, since \lambda =1, \sigma^{2} = \mu, so the mean likely has the largest effect on \psi.

Comparing all the examples

Comparing all examples, the Rayleigh(1) service distribution imposes the highest maximum efficiency, followed closely by the Erlang(2,1) service distribution, with the Uniform[1,2] service distribution following. \lambda^{*} under the Erlang(2,1) service distribution is larger than for the Rayleigh(1) service distribution, indicating that a server whose service times follow the former distribution can handle a larger arrival intensity before its efficiency begins to decline than the latter.

The means for the Rayleigh(1), Erlang(2,1), and Uniform[1,2] distributions are similar, as shown in the tables, but the Uniform[1,2] distribution has equal probability of any service time in its support with large negative excess kurtosis. It is posulated that kurtosis, skew, and variance play large roles in the behavior and range of \psi. Compare the efficiency under the Erlang(2,1) service distribution with the efficiency under the Erlang(9,1) service distribution. Not only is the mean much lower for the Erlang(2,1) distribution, but the distribution is more strongly positive-skewed than the Erlang(9,1). Thus, more mass is concentrated at the left side of the distribution, indicating that the service times are more often shorter.

Finally, to note the effect of the typical stress level \eta on the range of \psi, we compare the original Cha/Lee figure with our Rayleigh figure above. The service distribution and all other quantities remain the same, but Cha and Lee’s numerical example set \eta = 0.01, whereas the third image in the Erlang gallery shows \psi under \eta = 1. The range of \psi decreases by two orders of magnitude with a 100 fold increase in \eta, with the shape remaining similar. In addition, the location of the maximum \lambda^{*} also inversely varies by the same magnitude.

Studying the efficiency under various service distributions aids not only in deciding when to implement a server intervention, but also aids in evaluating the performance of various servers given their service times.

Conclusion

Many questions have been opened up about the various effects of the service distribution and stress on the server efficiency. The conditions given above are merely sufficient conditions, and have not been proven necessary. Studying the efficiency as a function of the arrival rate under various service distributions and stress provides a way to understand the performance of a server and the various factors that can affect it mathematically.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

  1.  CHA, K. H., AND LEE, E. Y. A stochastic breakdown model for an unreliable web server system and an optimal admission control policy. Journal of Applied Probability 48, 2 (2011), 453–466.
  2.  ROSS, S. Stochastic Processes, 2nd edition.
  3.  TRAYLOR, R. Stochastic reliability of a server under random workload. Ph.D. thesis (2016).
System Reliability Basics

System Reliability Basics

For the pdf version, click here.

Almost everything we use every day can be thought of as a system built from components. Our lungs form a system. A string of lights forms a system. A manufacturing process, and a data center are also systems that are more complicated. Every system is built from components, and components are either functional or failed.

Remark: We can get more advanced and consider components in states of varying health and age, but we’re going to consider only the binary states of functional and failed.

Suppose a given system has n components. Each component will be either on or off. We call these binary states component states and give them a nice, natural mathematical definition.


Definition (Component states). 

The state of component i is denoted x_{i} and is defined by

x_{i} := \left\{\begin{array}{lr}1, & \text{ component }i\text{ is functional}\\0, & \text{ component }i\text{ is failed}\end{array}\right.

Now, depending on how these components interact, the entire system is either functional or failed. We can write these component states in a vector called the system state vector:


Definition (System State Vector).
For n components in a system, each with state x_{i}, the system state is defined by \mathbf{x} = (x_{1},\ldots,x_{n}).


The system state vector is a combination of 1’s and 0’s corresponding to which components are functioning or failing. It is not always true that a single component failure (where its component state is 0) will result in a system failure. For example, you can still type with 8 fingers instead of 10, so two fingers in a splint will not keep you from your office duties. We can define a structure function that takes a system state vector as an input and tells us if the entire system is functioning or failed. So for example, the structure function of “typing with fingers” should return a 1 (functional) for any system state vector with two 0’s adjacent (since a split binds two fingers together). For example if we let \phi(\mathbf{x}) be the structure function of the act of typing with fingers, \phi(1,0,0,1,1,1,1,1,1,1) will return a 1, because we can still type with only 8 functioning fingers. Formally,


Definition (Structure Function)
The structure function of a system \phi : \mathbf{x} \to \{0,1\} is defined as
\phi(\mathbf{x}):= \left\{\begin{array}{lr}1, & \text{ the system is functioning when the state vector is x}\\0, &\text{the system has failed when the state vector is x}\end{array}\right.


Remark: For a system with n components, there are 2^n possible system state vectors. To see this, remember that if each component state has two possibilities, and there are n states, then there are 2\cdot2\cdots2 = 2^n system state vectors

Each system has a topology, i.e. its component arrangement. Here, we are concerned with the reliability topology, which tells us what combinations of components are needed to function in order for our system to be working. We use a tool called block diagrams to represent these.

 

Example (Series System)

One of the simplest types of systems is the series system. In this example, if one component fails, the whole system goes down.

That string of Christmas lights is the classic example. One bulb going out can ruin your whole tree.

The block diagram shows us in a clear visual way if the system is functioning. The goal is to be able to move from left to right in the diagram. If a component is functioning, we may move through that node. If not, that “path” is closed. In the series system, there is only one possible path from left to right through every single component. If I cannot find a path from left to right, the system is not functioning when the given set of components is out. Thus, for the series system, removing even one component removes the only path I have to a functioning system.

The structure function for a series system is given by
\phi_{\text{series}}(\mathbf{x}) = \prod_{i=1}^{n}x_{i}

Here, we can see that if any state vector is 0, the whole product is 0, which means the system fails. Conversely, the only way \phi(\mathbf{x}) can be 1 is if x_{i} = 1 for every i.

Remark: The Christmas lights happen to physically match that reliability topology given in above. This isn’t always the case. The physical arrangement doesn’t necessarily correspond to the reliability toplogy. For example, we can view a simple computer as a processor, a motherboard, a hard drive, and a power supply. These are not physically arranged in a line inside your computer case, but if any one of these components fails, your computer is useless.

Example (Parallel System)

 

Our lungs function in parallel. If one lung fails, we still survive.
Block diagram for the parallel system with n components

 

 

 

 

 

 

 

 

 

The next simplest type of system is the parallel system. This one is the exact opposite of the series system: if even one component in a parallel system is still working, the system still functions.

The structure function for the parallel system is given by
\phi_{\text{parallel}}(\mathbf{x}) = 1-\prod_{i=1}^{n}(1-x_{i})

To see this, remember that the component state is 1 when the component functions, and 0 when it fails. A system is the same way. In a parallel system, the system fails if all of the components fail. Each component failure contributes a (1-x_{i}), and we must subtract this product from 1, because we want to know when it functions, not fails.

These two are the basic system reliability topologies. We can make ones that are much more complex, but we have a theorem that shows any system’s block diagram can be re-structured into a series system of parallel subsystems, or a parallel system of series subsystems.
From Leemis (Reliability: Probabilistic Models and Statistical Methods, Lawrence M. Leemis 2nd ed.) ,


Theorem (Decomposition of Systems into Series/Parallel Subsystems).
Let P_{1},...,P_{s} be the minimal path sets for a system. Then
\phi(\mathbf{x}) = 1-\prod_{i=1}^{s}\left(1-\prod_{j \in P_{i}}x_{j}\right)
where x_{j} is the component state vector.


What are path sets? The path sets are the sets of components that form a complete path through the block diagram.

A general block diagram for some random system

If we look at the figure above, we have lots of possible paths from left to right. The sets (1,3), (1,4), (2,4), and (2,5) all provide paths through the diagram. These are also minimum path sets in this case. Focusing on any particular path set, if I drop one component, the path disappears. The theorem gives the mathematical structure function corresponding to a parallel system of the series subsystems that the minimal path sets generate.

In other words, the system is functioning if Components 1 and 3 are functioning, or Components 1 and 4 are functioning, or Components 2 and 4, etc.

Last, here’s an example of how we can take a given block diagram and rearrange it in terms of a parallel system of series subsystems according to the theorem.

Bridge System
Bridge System as a parallel system of series subsystems

 

In the figure above, the top diagram represents a topology called a bridge system. The right side is the alternative version where I’ve arranged this into a parallel system of series subsystems. To test your knowledge, try the following exercise.


Exercise. Write the structure function for the bridge system.

Solution. \phi(\mathbf{x}) = 1-(1-x_{1}x_{3}x_{5})(1-x_{1}x_{4})(1-x_{2}x_{3}x_{4})(1-x_{2}x_{5})


Why do we actually care about this? Engineers will use block diagrams to help them design reliable systems. We can prove mathematically that the series system is the least reliable, and the parallel system is the most reliable, but that doesn’t mean we should always put every component in parallel. An airplane with more landing gear than it needs can cause the airplane to weigh too much, for example.

This results in complex system designs, particularly for mechanical and electronic devices. These designs may become too difficult to study visually via block diagrams, but the structure function helps. Studying the form of the structure function in terms if the components can help us determine where a critical component is, for example. A critical component is a component that will fail the whole system if it fails. In a series system, every component is critical. In a parallel system, no one is. Since most system designs are mixtures of these two, identification of a critical component isn’t always simple with the block diagram. The structure function allows us to simulate various components failing (having state 0), and quickly seeing its effect on the system.

Upon identification of a critical component, an engineer has a couple options

  • he can add redundancy
  • he can use a more reliable (and usually more expensive) version of the component

These kinds of decisions require optimization with cost and other conditions, but these basic reliability building blocks are tools that can help make this easier.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.