Browsed by
Category: Applied Math

Stochastic Reliability of a Server under a Random Workload

Stochastic Reliability of a Server under a Random Workload

For the full paper, which includes all proofs, click here if you dare.


Editor’s note: This paper is the first chapter from a PhD thesis published in 2016 by Rachel Traylor. We generalize a 2011 model from Cha and Lee that gave a closed form for the survival function of a server under a random workload, and with random service times. In this work, the constant stress assumption of Cha and Lee is relaxed and allowed to be random; that is, every request to the server brings a random stress or workload to the server for the duration of its stay until completion. The efficiency measure of a random stress server is defined as the long run average number of jobs completed per unit time, and provides a way to measure server performance.


There are many types of systems which can be dubbed servers, such as a retail checkout counter, a shipping company, a web server, or a customer service hotline. All of these systems have common general behavior. Requests or customers arrive via a stochastic process, the service times vary randomly, and each request stresses the server if only temporarily. A general stochastic model that describes the reliability of such a server can provide the necessary information for optimal resource allocation and efficient task scheduling, leading to significant cost savings for businesses and improved performance metrics [6]. Such topics have been studied in literature for several decades [1, 2, 3, 20].

Much attention was devoted to reliability principles that model software failures and bug fixes, starting with Jelinski and Moranda in 1972 [11]. The hazard function under this model shows the time between the i-th failure and the (+ 1)st failure. Littlewood (1980) [16] extended this initial reliability model for software by assuming differences in error size [12].

These models have been extended into software testing applications [4, 5, 19] and optimal software release times [7, 8, 17, 22]. The explosion of e-commerce and the resulting increase in internet traffic have led to the development of reliability models for Web applications. Heavy traffic can overload and crash a server; thus, various control policies for refreshing content and admission of page requests were created [10, 13, 15, 18, 23].

In particular, Cha and Lee (2011) [9] proposed a stochastic breakdown model for an unreliable web server whose requests arrive at random times according to a nonhomogenous Poisson process and bring a constant stress factor to the system that dissipates upon service completion. The authors provide a fairly general survival function under any service distribution g_{W}(w), define server efficiency to measure performance, and illustrate a possible admission control policy due to an observed property of the server efficiency under a specific numerical example.

Thus far, no extensions of [9] have been proposed. This work generalizes the model put forth by Cha and Lee in a variety of ways. First, the assumption of constant job stress is relaxed and replaced by a random variable, and a new survival function and efficiency equation are derived. This work, while suitable for IT applications, is general enough for use in almost any industry, including logistics, retail, manufacturing, and engineering systems.

Background (Cha and Lee’s Model)


System Description and Survival Function

Sample Trajectory of Breakdown Rate Process Under Original Model


Cha and Lee considered a web server wherein each request arrives via a nonhomogenous Poisson process \{N(t): t \geq 0\} with intensity function \lambda(t). Each request adds a constant stress \eta, increasing the breakdown rate for the duration of service. Suppose r_{0}(t) is the breakdown rate of the idle server. Then the breakdown rate process \mathscr{B}(t) is defined as

\mathscr{B}(t) := r_{0}(t) + \eta\sum_{j=1}^{N(t)}\mathbb{1}(T_{j} \leq t \leq T_{j} + W_{j})

where N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)} are the random variables that describe the number of arrivals, arrival times, and service times, respectively. It is assumed that \{T_{j}\}_{j=1}^{N(t)} are independent of each other. Furthermore, the \{W_{j}\}_{j=1}^{N(t)} \sim g_{W}(w) are i.i.d. and are mutually independent of all T_{j}‘s.

Under these conditions, Cha and Lee proved the following theorem that gives the explicit form of the survival function under this model:


Suppose that \{N(t), t \geq 0\} is a nonhomogenous Poisson process with intensity function \lambda(t), i.e. m(t) \equiv \int_{0}^{t}\lambda(x)dx. Assuming the conditional survival function is given by

P\left(Y > t \left| N(t), \{T_{j}\}_{j=1}^{N(t)}, \{W_{j}\}_{j=1}^{N(t)}\right.\right) = \bar{F}_{0}(t)\exp\left(-\eta\sum_{j=1}^{N(t)}\min(W_{j}, t-T_{j})\right)

and m(t) has an inverse, the survival function of Y is given by

S_{Y}(t) = \bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}\exp(-\eta w)\bar{G}_{W}(w)m(t-w)dw\right)

and the hazard function of Y, denoted r(t), is given by

r(t) = r_{0}(t) + \eta\int_{0}^{t}e^{-\eta w}\bar{G}_{W}(w)\lambda(t-w)dw


Efficiency of the Server

It is natural to develop some measure of server performance. Cha and Lee measure such performance by defining the efficiency, \psi, of the web server as the long-run expected number of jobs completed per unit time. That is, with the number of jobs completed as M, the efficiency is defined

\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted \{N^{*}(t), t \geq 0\}, is a nonhomogenous Poisson process with the same intensity function \lambda(t) as before, and that \{N^{*}(t), t \geq 0\} is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution H(t) with expected value \nu. This process is a renewal reward process, with the renewal \{R_{n}\} = \{M_{n}\}, the number of jobs completed. The length of a renewal cycle is Y_{n} + H_{n}, where Y_{n} is the length of time the server was operational, and H_{n} is the time to reboot after a server crash. Then, by [21],

\psi = \frac{E[M]}{E[Y]+ \nu}

where M is the number of jobs completed in a particular renewal cycle, \nu is the mean time to reboot of the server, and Y is the length of a particular renewal cycle. Then, using the definition of efficiency, the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.


Theorem. (Efficiency of the Constant Stress Server)
Suppose \{N(t), t \geq 0\} is a nonhomogenous Poisson process with intensity \lambda(t)\geq 0. Then the efficiency is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\kern{10em}\biggl.\times\left(r_{0}(t)a(t) + \eta a(t)b(t)\right)\biggr]\end{aligned}

where a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv, b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr, \bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds, and m(x) = \int_{0}^{x}\lambda(s)ds.


Numerical Examples and Control Policies

Efficiency as a Function of Stress under Rayleigh Service Time Distribution

As an illustrative example, Cha and Lee considered the case when \lambda(t) \equiv \lambda,
r_{0}(t) \equiv r_{0} = 0.2, \eta = 0.01, \nu = 1, and g_{W}(w) = we^{-w^{2}/2} (the PDF of the Rayleigh distribution). As shown in the figure above, there exists a \lambda^{*} such that \psi(\lambda) is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

  1. If the real time arrival rate \lambda < \lambda^{*}, do not interfere with arrivals.
  2. If \lambda \geq \lambda^{*}, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability 1-\frac{\lambda^{*}}{\lambda}. The next section presents a generalization of the above model by relaxing the assumption that the job stress \eta is constant. This yields a far more generalized model that can encompass random stress with any distribution.

Random Stress Reliability Model

Sample Trajectory of Breakdown Rate Process under Random Stress Model

The basic model here is very similar to Cha and Lee’s. The main difference is that we relax the constant stress assumption and allow it to be random.

Assume that each job j coming into the server adds a random stress \mathcal{H}_{j} to the server for the duration of its time in the system. Suppose \{\mathcal{H}_{j}\}_{j=1}^{N(t)} \sim \mathcal{H}, and are i.i.d. where WLOG \mathcal{H} is a discrete random variable with a finite sample space S = \{\eta_{i}: \eta_{i} \in \mathbb{R}^{+}, i = 1,...,m \text{ for } m \in \mathbb{N}\} and probability distribution given by

P(\mathcal{H} = \eta_{i}) = p_{i}, \quad i = 1,\ldots,m

The following assumptions from Cha and Lee are retained:

  1. (CL1) Requests arrive via a nonhomogenous Poisson Process \{N(t), t \geq 0\} with intensity \lambda(t).
  2. (CL2) Arrival times \{T_{j}\}_{j=1}^{N(t)} are independent.
  3. (CL3) Service times \{W_{j}\}_{j=1}^{N(t)} are i.i.d. with pdf g_{W}(w) and mutually independent of all arrival times.

Then, the random stress breakdown rate (RSBR) process \mathscr{B}(t) is given by

\mathscr{B}(t) = r_{0}(t) + \sum_{j=1}^{N(t)}\mathcal{H}_{j}\mathbb{1}(T_{j} < t \leq T_{j} + W_{j}), \quad t \geq 0

Compare the sample trajectory shown in the figure above to the sample trajectory of breakdown rate under the original model (see page 2). The random stress brought by each request still disappears upon job completion, but the effect on the server is no longer deterministic. Thus, for the same set of arrival times and respective completion times, the realization of the breakdown rate process under the RSBR model has one more element of variation.

Let Y be the random time to breakdown of the web server given the workload from client requests. Let \mathfrak{T} = \{T_{j}\}_{j=1}^{N(t)}, \mathfrak{W} = \{W_{j}\}_{j=1}^{N(t)}, and \mathfrak{H} = \{\mathcal{H}_{j}\}_{j=1}^{N(t)}, with observed values \mathfrak{t} = \{t_{j}\}_{j=1}^{N(t)}, \mathfrak{w} = \{w_{j}\}_{j=1}^{N(t)}, and \mathfrak{h} = \{\eta_{i_{j}}\}_{i_{j}=1}^{N(t)}. Then the conditional survival function of the server, given the arrival process of client requests (N(t)), job stresses (\mathfrak{H}), service times (\mathfrak{W}), and arrival times (\mathfrak{T}) is

\begin{aligned}S_{\scriptstyle{Y|N(t), \mathfrak{T}, \mathfrak{W}, \mathfrak{H} }}(t|n, \mathfrak{t}, \mathfrak{w}, \mathfrak{h}) &=e^{ -\int_{0}^{t}\mathscr{B}(x)dx}\\&= \bar{F}_{0}(t)e^{-\sum_{j=1}^{N(t)} \mathcal{H}_{i}\min(W_{j}, t-T_{j})}\end{aligned}

where \bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(x)dx\right).


Survival Function for the RSBR Server

Under the RSBR generalization, the survival function of the server is given in the following theorem.


Theorem 1. (Survival Function for a Random Stress Server)
Suppose that jobs arrive to a server according to a nonhomogenous Poisson process
\{N(t), t \geq 0\} with intensity function \lambda(t) \geq 0 and m(t) \equiv E[N(t)] = \int_{0}^{t}\lambda(x)dx. Let the arrival times \{T_{j}\}_{j=1}^{N(t)} be independent, and let the service times \{W_{j}\}_{j=1}^{N(t)} g_{W}(w) be i.i.d. and mutually independent of all arrival times. Assume the random job stresses \mathcal{H}_{j} \sim \mathcal{H}. Then

S_{Y}(t) = \bar{F}_{0}(t)\exp\left(- E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]\right)

where \bar{F}_{0}(t) = \exp\left(-\int_{0}^{t}r_{0}(s)ds\right).


The compound failure rate function (or hazard function) r(t) is given by

r(t) = -\frac{d}{dt}\ln(S_{Y}(t)) = r_{0}(t) + E_{\mathcal{H}}\left[\mathcal{H}\int_{0}^{t}e^{-\mathcal{H} w}m(t-w)\bar{G}_{W}(w)dw\right]


See [14] or [21].

Efficiency Measure of a Server under RSBR

Upon server crash, the server must be rebooted. This section gives the server efficiency as defined in [9], but for the RSBR model. The server efficiency given in Theorem 1 holds quite a few similarities to that of [9], except that, like Theorem 1, the expectation must be taken over the random stress variable. The following assumptions from the original model are retained:

  1. (E1) The arrival process after rebooting, \{N^{rb}(t), t \geq 0\}, remains a nonhomogenous Poisson process with the same intensity function \lambda(t), t \geq 0 as before.
  2. (E2) \{N^{rb}(t), t \geq 0\} is independent of the arrival process of client requests before rebooting. Hence, \{N^{rb}(t), t \geq 0\} = \{N(t), t \geq 0\}, since it retains all the same characteristics as before.
  3. (E3) The time to reboot the server follows a continuous distribution H(t) with mean \nu.

Recall that M(t) is defined as the total number of jobs completed by the server during the time (0,t]. Also, recall the definition of server efficiency from [9]:

\psi \equiv \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}

The efficiency of the server under a random stress environment is given in the following theorem.


Theorem (Server Efficiency under Random Stress Environment)
Suppose that \{N(t): t \geq 0\} is a nonhomogenous Poisson process with intensity function \lambda(t), t \geq 0. Suppose also the conditions of Theorem 1 and the conditions (E1)-(E3) are met. Then the efficiency of the server is given by

\begin{aligned}\psi &= \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\\&\kern{2em}\times\left\{\int_{0}^{\infty}e^{-\int_{0}^{t}r_{0}(x)-\int_{0}^{t}\lambda(x)dx+E_{\mathcal{H}}[a(t)+b(t)]}(r_{0}(t)E_{\mathcal{H}}[a(t)] + E_{\mathcal{H}}[\mathcal{H} a(t)b(t)])dt \right\}\end{aligned}


a(t) = \int_{0}^{t}e^{-\mathcal{H} v}g_{W}(v)m(t-v)dv


b(t) = \int_{0}^{t}e^{-\mathcal{H}(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr.

Remarks and Implications

Note that \mathcal{H} was assumed discrete, but the proofs of Theorems 1 and 2 are unaffected if \mathcal{H} is continuous. Thus the generality of this model is significantly stronger than in [9]. In practical considerations, these integrals will need to be numerically evaluated.

For certain distributions of \mathcal{H}, S_{Y}(t) has a fairly compact form. Subsequent installments will examine the case where \mathcal{H} has a binomial distribution, formed from both independent and dependent Bernoulli trials. We will also explore the effects of various service life distributions on \psi.


This paper has given a simple but large generalization of [9]. The random stress model allows for the reality that we likely will not know the stress a request will bring to the server, but may know its distribution. Subsequent installments will further study the properties of the RSBR model and extend it to other single-server models, and networks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.