Browsed by
Author: Rachel Traylor

On Permuted First-Kind Dependence of Categorical Random Variables

On Permuted First-Kind Dependence of Categorical Random Variables

 

This paper discusses the notion of horizontal dependency in sequences of first-kind dependent categorical random variables. We examine the necessary and sufficient conditions for a sequence of first-kind dependent categorical random variables to be identically distributed when the conditional probability distribution of subsequent variables after the first are permuted from the identity permutation used in previous works.

Read More Read More

A Partition by any Other Name

A Partition by any Other Name

I promise I’m actually a probability theorist, despite many of my posts being algebraic in nature. Algebra, as we’ve seen in several other posts, elegantly generalizes many things in basic arithmetic, leading to highly lucrative applications in coding theory and data protection.  Some definitions in mathematics may not have obvious “practical use”, but turn out to yield theorems and results so powerful we can use them to send image data cleanly from space. 

Read More Read More

Time Series Analysis Part 1: Regression with a Twist

Time Series Analysis Part 1: Regression with a Twist

We’re surrounded by time series. It’s one of the more common plots we see in day-to-day life. Finance and economics are full of them – stock prices, GDP over time, and 401K value over time to name a few. The plot looks deceptively simple; just a nice univariate squiggle. No crazy vectors, no surfaces, just one predictor – time. It turns out time is a tricky and fickle explanatory variable, which makes analysis of time series a bit more nuanced than first glance. This nuance is obscured by the ease of automatic implementation of time series modeling in languages like R1 As nice as this is for practitioners, the mathematics behind this analysis is lost. Ignoring the mathematics can lead to improper use of these tools. This series will examine some of the mathematics behind stationarity and what is known as ARIMA (Auto-Regressive Integrated Moving Average) modeling. Part 1 will examine the very basics, showing that time series modeling is really just regression with a twist.

Read More Read More

Commentary: Infrastructure Considerations for Machine Learning

Commentary: Infrastructure Considerations for Machine Learning

Welcome to another brief commentary and departure from the heavier mathematics. I have been endeavoring to expand the breadth of my knowledge on the tech side of things, and chronicling some things I’ve learned and observed from speaking with different companies, both as an independent and as a Tech Field Day delegate. Many of these articles have focused on considerations for a practitioner rather than a mathematician, but occasionally we theorists have to show some business value, so I try to keep current on the tools and methods utilized in the corporate world.

It’s fairly common knowledge now that most machine learning and deep learning algorithms are highly data-dependent. That is, the data you feed something like a neural network heavily affects the results. Since the common analogy for machine learning, artificial intelligence, and neural networks is one of a biological learning process, let me continue that analogy. These algorithms are like small children; they’re sponges. They learn based on the type and amount of data given, and in surprising ways. If you want to teach a child (or baby Terminator, perhaps one named John Henry) what a cow looks like, you must be very careful what you give him. If you only give him pictures of birds and cows, he may decide that a cow is identified by the number of legs he has. Then what happens when he is given a picture of a cat?

Perhaps you think of this and throw in pictures of dogs too. Aha! So a cow has four legs and hoofed feet! Until John Henry sees a zebra. This silly example illustrates just how long we took to learn even simple things as children, and how important large amounts of repetitive and varied data were to us converging on how to recognize a cow. These AI/ML/NN algorithms are designed to mimic this learning process, and thus require vast amounts of highly varied data. Good performance by an algorithm on a subset of the data may not hold up to the expanse of the real world data, just like the example of learning to recognize a cow. Thus, these algorithms are not ergodic, to borrow a term from dynamics and probability. The models and methods are not independent of the initial data you feed them. In other words, if two different people feed the same algorithm different datasets and let the algorithm “learn”, the end results can be vastly different.

To get around this, most practitioners of data science want to throw as much data as possible, ideally the entirety of everything. If you’re wanting to learn the shopping habits on an e-commerce site, you’d prefer to let your John Henry learn on the whole database rather than just a subset.1

However, your IT department would likely be unhappy with your request to run tests on a production environment for a multitude of reasons: security and performance being two of those. Having a bunch of copies floating around takes up massive amounts of storage, not to mention the security risks. A mistake in the code run against the production environment can take the whole e-store down due to a bad query.2 I spoke twice with Actifio about their Sky Infrastructure, first hearing from them at Tech Field Day 15, then interviewing them again to get some more details about use cases rather than an overview of the infrastructure itself. 

Bryan Achilles presents the Sky Infrastructure at Tech Field Day 15.

As a quick overview (Mr. Achilles does a great job on the tech details in this video here), Actifio basically creates what they term a “golden copy” of your data, after which updates are done incrementally to save storage space, and everyone can get their own virtual copy (which are really more like pointers) to interact with. Now a data scientist can’t affect the production database when he/she queries against it, and can also use far more data in testing than before. This should shorten a data science development cycle, because the workaround for using subsets of data to train is to sample many subsets and train the algorithm over and over again, which takes time. In addition, the data scientist can find out very quickly if the code that worked for 100,000 rows will hold up against 10 million (guilty as charged of writing unscalable code in my past experience as a data scientist). 

Being more of a theoretician, I don’t tend to step out of my bubble to consider the various infrastructures that are necessary to provide me my sandbox. To fix that, I endeavor to occasionally speak with various tech companies about their work. I like the way Actifio has streamlined a good solution that aims to satisfy the IT gatekeepers and the developers/data scientists/users of the data. Overall, I’m not exactly a fan of the semi-blind deep-learning approaches to make all business decisions, but those methods do have their uses, particularly in exploration and discovery. This platform definitely has a good potential to help a data science team in their development. 

[Disclaimer: I have never been compensated by Actifio or any other company for my commentary articles.] Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Poisson Processes and Data Loss

Poisson Processes and Data Loss

There are many applications for counting arrivals over time. Perhaps I want to count the arrivals into a store, or shipments into a postal distribution center, or node failures in a cloud cluster, or hard drive failures in a traditional storage array. It’s rare that these events come neatly, one after the other, with a constant amount of time between each event or arrival. Typically those interarrival times, the time between two events in a sequence arriving, are random. How then do we study these processes?

Read More Read More

Commentary: Technical Debt in Machine Learning

Commentary: Technical Debt in Machine Learning

I recently had the opportunity to be a guest on an episode of the On-Premise IT Roundtable podcast, the topic of which was technical debt. (You can listen to the twenty minute episode here, or watch the video version here.) The conventional definition of technical debt, for both consumer and enterprise technology, is the lagging of upgrades, potentially causing issues down the line when the software or hardware is no longer supported. For example, should you immediately upgrade the operating system on your smartphone, or wait with an older version? If you wait, how long should you wait? The discussion was lively, pleasant, and informative. 

In particular, one panelist, Mr. James Green of Actual Tech Media, brought up an interesting and different notion of technical debt: that of the “quick and dirty” solution to put out a fire or to be seen as a player in the latest buzzword space.

From Mr. Greene, Technical debt can also be “…when I start building something small and narrow to address a problem I have right now without thinking about three years from now, what are my needs going to be?…You create technical debt when you have to build something that meets your new needs…”

This struck me in a slightly different context than the data centers and storage technologies that comprised the undercurrent of the discussion. I see companies and people scrambling to board the data science/machine learning/AI train, and quickly implement, well, anything that shows they use these concepts and tools. It’s great for marketing, and in marketing, timing is everything. You have to be at the forefront, but not the front. What happens is “research” on a development cycle: new dashboards and “analytics” churned out every few weeks, and slight twists on the same models every time the problem or dataset changes. 

It creates technical debt from an advanced development standpoint to wait too long to begin looking into a promising area or crafting a solution using what could be the next-big-thing. You risk jumping on the bandwagon as everyone else jumps to a new one. But perhaps more costly long-term is lurching from bandwagon to bandwagon, desperately trying to use that shiny new buzzword-hammer you bought into in your organization, whether or not it’s really the right tool. When you deploy a canned algorithm, or an unsupervised learning algorithm you don’t fully understand, how will you know when it begins to fail? How will you identify slow changes to the model as the data slowly changes, and how will you know if those changes are due to truly evolving circumstances or simply corrupt data? Could there have been a simpler, more elegant, more mathematical solution to the problem that would have applied across verticals, saving you months of retraining the same neural network on a different dataset? 

I’ll make an analogy with open-water swimming. Unlike swimming in a pool, where you have a nice, straight, black line at the bottom of the pool to guide you, open water comes with a new set of hurdles. Current, waves, and murky depths. The technique for an open water swim is very different from a pool swim. In the pool, you can look down, let go, and cruise, knowing you’ll never get off course. Out in open water, you pause every 5-7 strokes briefly to look up and reorient yourself with the landmarks and buoys to make sure you’re still on course. You also spend time studying the course and terrain for days prior to the start of the swim. If you fail to do so, you can get hundreds of meters off course and have to correct1, which adds minutes to your race time. In open water, the winner isn’t always just the fastest swimmer charging ahead, but the smartest one whose pace is a little slower, but pauses to ensure he is on his course. 

Frantically deploying the latest neural network, machine learning technique, or black-box AI solution is just like an open water swimmer treating the race like a pool swim. We’re in uncharted or barely charted territory here. Plowing full speed ahead deploying ultimately uninterpretable solutions may cause an organization to look up 3 years later and realize it’s way off course, with little to show for it. The conversation around the utility or cost of conventional technical debt converged on the conclusion that there are more nuances than simply deeming it “good” or “bad”. Obviously, it’s bad to look up every single stroke when swimming; that will slow you down unnecessarily. As with most things, moderation is key. A pause may seem ill-advised this week, this month, or even this year, but globally, those occasional pauses ensure you’re on course to deliver business value long-term. Study the problem at hand carefully. It may generalize abstractly to a problem in another vertical or industry that has already been solved elegantly, and deploying that solution may save millions of dollars and hundreds of hours over the next 5 years. 
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Commentary: On Straight As and Salaries

Commentary: On Straight As and Salaries

(Fair warning: this is a personal account.)

The systems were designed well, I think. When we were in school or college, passing was supposed to mean you knew the material, basically. A B showed you were pretty good, and an A was only for the smartest students. Not relatively the smartest, but objectively the smartest. Most people could at least pass, fewer were B students, very few were A students, and a tiny fraction were straight A students. At least, that’s how it was supposed to work.

I was a straight A student all through high school. Actually, I didn’t just average As, I almost never was handed back a piece of paper with less than a 90% mark. Honestly, I didn’t feel the sense of accomplishment I thought I was supposed to.

Fast forward to college. Oh man, Georgia Tech hit me hard. Not only did I experience my first B, I got my first C. Advanced Linear Algebra, or what was meant to be the first proofs class. I was told by the professor to quit mathematics which was devastating to hear. Under the way grades were supposed to work, he was right. It meant I was barely treading water in a field that was only going to get harder.

I kept going. In the next course, the first in real analysis, all those ideas that didn’t click during Advanced Linear Algebra just made sense. I started excelling, and I learned one of my life-changing lessons: Grades aren’t indicative of your performance, nor do they indicate the ability to apply that knowledge elsewhere or retain the lessons learned.

Grades are just a barometer for what a school or a professor deems worthy.

That course challenged me, fundamentally. I slipped and fell over and over again, and didn’t find my footing until the next semester. But once I realized As didn’t really matter; once I realized that the letters on my transcript were due to someone else’s value system I didn’t agree with, I lost inhibition and fear. I grew by taking on things I wasn’t sure I could do, and finished my PhD in mathematics about a year and a half ago.

Once we leave school, the measure of a person’s value became job title and salary. It makes sense, right? We’re willing to pay more for something or someone we need very much. But is that really true?

Airline pilots go through grueling training for years, and amortized over their careers, make very little. EMTs are paid far less than a director of marketing. None of these observations are new, nor is complaining about them.

Sticking strictly in the business world, the highest paid individuals are in executive management, marketing, and sales, mostly. But CEOs of large companies are ultimately pretty interchangeable – in fact, it happens all the time. Even moving down to comparing branches, businesses spend billions on marketing, and very little by comparison on research.

Is the system of salaries and spending really broken? Maybe it’s not. Maybe it’s a barometer of what businesses value. The word “value” is quite vague; we all have different values. Marketers, salespeople, and CEOs are highly valued because good ones produce high stock prices, and favorable quarterly profits. If a business’s value system ranks these things more highly than long term exploration, then it makes perfect sense that they structure their spending and salaries the way they do.

It means that the vast majority of what companies value1 differs from what I or other individuals value.

It means that salaries are just a barometer for what corporations deem most worthy.

And there is my next lesson, the “business version” of what I learned regarding grades. Salaries are ultimately a metric for someone else’s value system. Most corporations look quarter-by-quarter, maybe even a couple years out. What if you see further than that? Then the things you think money should be spent on will differ.

If you were ever laid off, if you are making less than you think you should2, then your values didn’t align with your employer’s. That’s all. Straight-A students are not necessarily better than students who made a C or three. Derive your value internally, by looking in a mirror and evaluating yourself objectively, not relative to someone else.

I realize this is hard. I also realize that my idealistic notions of not caring about salaries or grades sound a little new age. Obviously, there are penalties for these attitudes, especially in the short run. My undergraduate GPA wasn’t a 4.0 – that cost me admission into some of the more prestigious PhD programs, which would have led to good academic jobs. Refusing to align with what businesses deem valuable today means that you will have less discretionary income.

Honestly, after doing both, I’m happier for it. I’m not afraid anymore. I don’t have the best academic pedigree, but someone recognized the work I did, and for a while, I had my dream job. It came from an unexpected place, after I was rejected from almost every national lab in the country. Now again, I believe the same thing – uncompromised vision. I believe in fair trades – I’ll find someone whose value system is analogous to mine. Until then, my salary doesn’t matter.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Commentary: High Level Data Filtration

Commentary: High Level Data Filtration

The consensus over the last five or so years has converged on a conclusion regarding data: we’re drowning in it. We have more than we can possibly monitor with our own eyeballs, and certainly more than we know what to do with intelligently. The motto for data scientists has been “More is better.” Well, ask and ye shall receive. The image I see is a virtual version of someone looking at the end of a firehose as it’s turned on full blast, then trying to spot an occasional speck in the stream of water. More data is good, only if it’s good data.

In modeling as well, we have generated great predictive algorithms1 for a multitude of applications, including sales, cybersecurity, and customer support. These models are meant to alert a good sales prospect, a high priority customer support need, or a security issue on your network. One potential issue is the notion of too many alerts, or alerts generated from repeat or replicated data. Alert fatigue is a real thing; humans can become adjusted to what used to be a panicked state, and consequently, alerts get ignored. How many times do you hear a car alarm and assume it’s actually due to a thief? Most of the time, we put the pillow over our head and grumble as we try to get back to sleep. What about an alert that pops up on your computer repeatedly? You end up ignoring it because it keeps bothering you about the same thing you acknowledged three hours ago. 

The issue with these scenarios is that when a real alert comes in between these ones you prefer to dismiss, you’ll likely ignore the new one as well.  Enter the need for filtration. We have so much data (much of it repeats) in enterprise scenarios that we need a way to filter these streams by eliminating the obvious and duplicated, as it were. To focus on a very real illustration of this need, take a look at enterprise network traffic. Enterprises have thousands of devices sending massive amounts of data within the network. They also deal with thousands of traffic that moves into and out of the network. The amount of packet data and metadata you could capture in an attempt to monitor this is difficult for us to really fathom. We need to decide what data is useful, and what isn’t; and we need to decide this in real time, as the information flows. 

An intelligent way to approach data filtration is similar to how we look at water filtration. The first thing you want to do is get the obvious large chunks of…whatever…out. Then you apply progressively tighter layers of filtration (plus some chemical treatment) until you get clean water. Data cleaning can be a bit like that. Data scientists2 recognize that they will never be handed clean water to work with, and that they’ll have to do some cleaning themselves. But rarely does anyone who is actually tasked with developing data science or cybersecurity solutions want to be the ones removing the obvious big garbage. 

I watched, as part of Tech Field Day 153, a presentation by Ixia (video can be found here) on what is effectively a real-time filtration system for network treat intelligence. The idea is to leverage their database of known obvious issues and “bad players”, as it were, to quickly filter out and mitigate these “large chunks” before passing the data to the more refining and advanced cybersecurity monitoring products. Ixia’s products also look for duplicate data, and remove that as well.  I like that they stepped in to do the “dirty work”, as it were, and offer a solution to help with that high level filtration in real time. They were very clear about what their analytics did and did not do, which I respect. 

The benefit of clearing out these known issues or duplicated data is clear whenever someone downstream, as it were, feeds data into some variation of a predictive machine learning algorithm that is meant to monitor, evaluate, and alert. The algorithms can run more efficiently with less data flowing through it, and unnecessary alerts can be eliminated, allowing a security monitoring system to only deal with the data that suggest potentially new threats that may require human intervention, as the known threats were identified and mitigated many steps earlier. 

The world needs all kinds. Everyone wants to be the surgeon, because he’s famous, but the surgeon cannot perform well without good technicians and nurses to help with the prep work. Ixia stepped in and offers a technician for the job of filtering prepping network traffic for better analysis and better security monitoring, which will keep the surgeon from fatigue. 
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

On Server Efficiency

On Server Efficiency

For the full text of the paper, including all proofs and supplementary lemmata, click to download thesis-ch-2

Abstract

Editor’s note: This paper comprises the second chapter of the PhD dissertation by Rachel Traylor. Cha and Lee defined a mathematical notion of server performance by measuring efficiency \psi defined as the long run average number of jobs completed per unit time. The service time distribution heavily influences the shape of the server efficiency as a function of a constant arrival rate \lambda. Various classes of distributions are studied in order to find sufficient conditions for the existence of a single maximum. The existence of a maximum allows for simple binary control policies to handle traffic and optimize performance.

Introduction, Motivation, and Background

Cha and Lee [1] studied the reliability of a single server under a constant stress workload, and also defined a notion of server efficiency \psi for a given intensity \lambda(t) as the long-run average number of jobs completed per unit time as a way to measure server performance. With the number of jobs completed as M, the efficiency is defined

\psi := \lim\limits_{t \to \infty}\frac{E[M(t)]}{t}

Upon breakdown and rebooting, the server is assumed to be ‘as good as new’, in that performance of the server does not degrade during subsequent reboots. In addition, the model assumes the arrival process after reboot, denoted \{N^{*}(t), t \geq 0\}, is a nonhomogenous Poisson process with the same intensity function \lambda(t) as before, and that \{N^{*}(t), t \geq 0\} is independent of the arrival process before reboot. In a practical setting, this model assumes no ‘bottlenecking’ of arrivals occurs in the queue during server downtime that would cause an initial flood to the rebooted server. In addition, the reboot time is assumed to follow a continuous distribution H(t) with expected value \nu. This process is a renewal reward process, with the renewal \{R_{n}\} = \{M_{n}\}, the number of jobs completed. The length of a renewal cycle is Y_{n} + H_{n}, where Y_{n} is the length of time the server was operational, and H_{n} is the time to reboot after a server crash. Then, by [2],

\psi = \frac{E[M]}{E[Y]+ \nu}

where M is the number of jobs completed in a particular renewal cycle, \nu is the mean time to reboot of the server, and Y is the length of a particular renewal cycle. Then, using the definition of \psi the following closed form of the efficiency of a server under all assumptions of Cha and Lee’s model is derived.


Theorem 1 (Server Efficiency under Cha/Lee)
Suppose \{N(t), t \geq 0\} is a nonhomogenous Poisson process with intensity \lambda(t)\geq 0. Then the efficiency is given by

\begin{aligned}\psi&=\frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\exp\left(-\int_{0}^{t}r_{0}(x)dx-\int_{0}^{t}\lambda(x)dx + a(t) + b(t)\right)\right.\\&\qquad\qquad\left.\times\left(r_{0}(t)a(t)+\eta a(t)b(t) \right)\right]\end{aligned} where a(t) = \int_{0}^{t}e^{-\eta v}g_{W}(v)m(t-v)dv, b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}_{W}(t-r)\lambda(r)dr, \bar{G}_{W}(x) = 1-\int_{0}^{x}g_{W}(s)ds, and m(x) = \int_{0}^{x}\lambda(s)ds.


 

Numerical Example and Control Policies

Figure 1: Efficiency under Rayleigh Distribution

As an illustrative example, Cha and Lee considered the case when \lambda(t) \equiv \lambda,
r_{0}(t) \equiv r_{0} = 0.2, \eta = 0.01, \nu = 1, and g_{W}(w) = we^{-w^{2}/2} (the PDF of the Rayleigh distribution). As shown in Figure 1, there exists a \lambda^{*} such that \psi(\lambda) is maximized. Thus one may implement the obvious optimal control policy for server control to avoid server overload:

 (1) If the real time arrival rate \lambda < \lambda^{*}, do not interfere with arrivals.

 (2) If \lambda \geq \lambda^{*}, facilitate some appropriate measure of interference.

Examples of interference for a web server in particular include rejection of incoming requests or possible re-routing. Cha and Lee give an interference policy of rejection with probability 1-\frac{\lambda^{*}}{\lambda}.

The Rayleigh distribution used in Figure 1 has applications in physics, typically when the magnitude of a vector is related to its directional components. It is a special case of the Weibull distribution which is widely used in survival analysis, failure analysis, weather forecasting, and communications. These distributions are not typically used to model service times. The exponential distribution is the most common due to its memoryless properties, followed by the Erlang and uniform distributions.

The efficiency, \psi, under the Rayleigh distribution example in Figure 1 shows the existence of a 0 < \lambda^{*} < \infty such that \psi(\lambda) is maximized at \lambda^{*}. This useful feature of \psi(\lambda) in this case allows for the implementation of a simple control policy for arrivals to the server to prevent overload, given above. 

Numerical simulations under a variety of possible distribution classes, including convex, concave, exponential, uniform, and Erlang suggest that the mathematical properties of \psi are heavily influenced by the choice and characteristics of service time distribution g_{W}(w). In particular, it is of interest to seek sufficient conditions of g_{W}(w) that will guarantee the existence of a \lambda^{*} that maximizes \psi. This is done for the uniform, compact support, and Erlang classes. Furthermore, it is shown under certain conditions, not only does the server efficiency lack a maximum, but \psi increases without bound. This is not representative of real server behavior, and thus of mathematical and practical interest to note for further modeling.

 

Efficiency of a Server under Uniform Service Life Distribution

Suppose \lambda(x) \equiv \lambda, and suppose r_{0}(x) \equiv r_{0} = \max_{x \in (0,\infty)}r_{0}(x). The efficiency \psi is given for constant \eta and \lambda here:
\psi(\lambda) = \frac{1}{\int_{0}^{\infty}S_{Y}(t)dt + \nu}\left[\int_{0}^{\infty}\exp\left(-r_{0}t-\lambda t + a(t)+b(t)\right)(r_{0}+b(t))a(t)dt\right]

where S_{Y}(t) is the survival function of the node, a(t) = \int_{0}^{t}e^{-\eta v}g(v)(t-v)dv, b(t) = \int_{0}^{t}e^{-\eta(t-r)}\bar{G}(t-r)dr,
g(v) is the pdf of the service time distribution, and \bar{G}(x) = 1-\int_{0}^{x}g(s)ds.

The following theorem gives sufficient conditions for the uniform distribution and \eta that guarantee the existence of a finite maximum efficiency.


Theorem 2(Efficiency under Uniform Service Distribution)
Suppose the service life distribution is given by Uniform(c,d) for some 0<c<d. Then if
\sigma > \frac{ce^{-c\eta}}{\sqrt{12}\phi(-\eta)(1+\eta(c+d))+c\eta - 1} where \phi(-\eta) the standard deviation of the service life W, \psi(\lambda) has a maximum on (0,\infty) is the moment generating function of a uniform distribution evaluated at -\eta.


 

Numerical simulations suggest that \psi increases without bound for c=0, d>1. The following lemma proves this fact.

 


Lemma
Suppose the service life distribution is given by Uniform(0,d), with d>1. Then \psi increases without bound.


This is worth discussing here. It makes no sense whatsoever for the efficiency of a server to increase forever as the arrival rate increases. So what’s happening here? Notice that if the uniform distribution include 0 as an endpoint, then there is a positive probability that the service time for a job will be exactly 0 (in whatever time units you like). This is impossible in reality. A small service time is possible, but it’s still strictly greater than 0. What we see here is that using distributions with positive support at 0 causes issues in the efficiency function; it’s not the fault of the definition of efficiency, but rather a consequence of using distributions that cannot mirror reality. The next section explores this further with a more broad class of distributions.

Extension of the Uniform Distribution: Compact Support

The ideas and techniques presented in the previous section yield a powerful extension to any service life distribution with compact support away from 0. Supposing g_{W}(w) has compact support [a,b], it may be bounded above by a positively scaled uniform distribution. In practice, service times are finite and nonzero, thus this extension allows for very simple control policies to be implemented for a much larger class of distributions.

 


Theorem 3 (Efficiency under Compact Support)
Let g_{W}(w) be the pdf of the service times having compact support [c,d], d > 0. Let m = \max_{w}g_{W}(w) < \infty, and R = (d-c) be the length of the support. Then \psi(\lambda) has a maximum if m < \frac{c}{R\eta +e^{-b\eta}-e^{-a\eta}+\eta e^{-a\eta}}.


 

Comments and Illustrations

Figure 2: Efficiency under an increasing triangular service distribution

Example 1

Let g_{W}(w) = \frac{2}{5}w\mathbb{1}_{[2,3]}(w), and let r_{0} = \nu = \eta = 1. Then m = \frac{6}{5}. By Theorem 2, m < \frac{2}{e^{-2}+e^{-3}+1}\approx 1.678. Thus, the existence of a maximum \psi is guaranteed. Figure 2 gives the numerical result with step size of 0.1. The maximum occurs around \lambda^{*} = 0.5

 

Figure 3: Efficiency under an increasing distribution with support on [1,2]
Example 2: Sufficient conditions aren’t always necessary conditions
The condition given in Theorem 2 is rather weak and is only a sufficient condition. As an illustration, consider the same increasing density shifted to a support of [1,2]. Thus
g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w). Retain all other assumptions from Example 1. Then
m = \frac{4}{3} > \frac{1}{e^{-2}+1} \approx 0.88 which violates the condition. However, consider Figure 3 above. Clearly \psi has a maximum at approximately \lambda^{*} = 0.8
The condition given in Theorem 2 relies on overestimating g_{W}(w) by a constant. If the variance \sigma_{W}^{2} of g_{W}(w) is large, the maximum of the pdf will decrease and be more comparable to the rest of the distribution. In these cases, bounding g_{W}(w) by its max m over the support [a,b] will give a reasonable approximation. However, in the case of a small support and high enough skew compared to the size and location of the support, much of the mass is concentrated at the right end of the support, and m will be higher. In these cases, bounding g_{W}(w) by m will result in a large amount of overestimation, and thus the condition may fail, but \psi still has a maximum. The numerical example wherein g_{W}(w) = \frac{2}{3}w\mathbb{1}_{[1,2]}(w) illustrates the conservative nature of this type of estimation.

Efficiency under Erlang Service Distribution

Now suppose g(v) is of the Erlang class but shifted \delta > 0 to the right. For motivation, consider that service times can never be 0 in a practical setting. Then the PDF and the complement of the CDF are given by
\begin{aligned}g(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}0, & 0 \leq v \leq \delta \\\frac{\gamma^{k}(v-\delta)^{k-1}e^{-\gamma(v-\delta)}}{(k-1)!}, & v\geq \delta\end{array}\right.\\\bar{G}(v;k,\gamma,\delta) &= \left\{\begin{array}{lr}1, & 0 \leq v \leq \delta \\e^{\gamma(\delta-v)}\sum_{j=0}^{k-1}\frac{\gamma^{j}(v-\delta)^{j}}{j!}, & v\geq \delta\end{array}\right.\end{aligned}


Theorem 4 (Server Efficiency under Erlang Distribution)
Let \delta > 0, \eta > 0. Let \alpha(\delta) = \left(\frac{\gamma}{\gamma + \eta}\right)^{k}e^{-\eta\delta} + \frac{\gamma^{k}e^{\gamma\delta}(k-1)}{(\eta + \gamma)^{k-1}}, and 0 < \beta(\delta,\eta) < 1. If the service life distribution is of the \delta-shifted Erlang class, then \psi(\lambda) has a maximum in \lambda on (0,\infty) for \delta,\eta such that \alpha(\delta) + \beta(\delta,\eta) < 1.


Thus, we see that the Erlang service distribution produces a maximum as long as the support is shifted away from 0. 

Efficiency under Exponential Distribution

The exponential distribution is the most common service distribution assumed for a queue. The M/M/1 queue assumes a Poisson distribution for arrival times, and an exponential distribution or service times. However, the exponential distribution has most of its mass concentrated at or near 0. As we have seen in other distributions, this becomes a problem, both in considering what makes sense in practical applications, and in its effect on the efficiency. 
Suppose g(v) is of the exponential class. That is, g(v) = \gamma e^{-\gamma v}, \gamma > 0. It will be proven that under certain conditions on \eta and \gamma, an exponential g(v) causes \psi to increase without bound.

 


Theorem 5 (Efficiency under Exponential Distribution)
Suppose g(v) = \gamma e^{-\gamma v}. Then if \frac{2\gamma}{\gamma + \eta} > 1 + \frac{2}{\frac{2}{\gamma} + W\left(-\frac{\gamma}{\gamma +\eta}e^{-2-\frac{2\eta}{\gamma}}\right)}, \psi(\lambda) \to \infty,  as \lambda \to \infty where W(\cdot) is the Lambert W-function.


 

Extension to Random Stress and Nonconstant Intensity

All the above theorems regarding the server efficiency  assumed constant stress \eta and constant intensity \lambda. This section generalizes the analyses in the previous sections for nonconstant intensity and random stress.

The stochastic reliability models in both [1]  and the previous chapter both assumed a time-dependent intensity \lambda(t). By setting \lambda \equiv \max_{t}\lambda(t), the established efficiency theorems provide a conservative set of conditions under which a maximum efficiency may be obtained. In these cases, \psi is actually a function of all possible \lambda_{\max}.

If the job stresses are random, as in Theorem 2 of Traylor (2016), the above sections may still be utilized. Assume the sample space for \mathcal{H} is compact and nonnegative. The efficiency is given by Theorem 2 of Traylor (2016).
WLOG, again suppose the sample space of \mathcal{H} is discrete, given by \{\eta_{1},...,\eta_{m}\} with respective probabilities p_{i}, i =1,...,m. Now suppose all mass is concentrated at \eta_{[m]}. Let a_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}v}g(v)(t-v)dv, and b_{m}(t) = \lambda\int_{0}^{t}e^{-\eta_{m}(t-r)}\bar{G}_{W}(t-r)dr. Then, the following are true:
(1) E_{\mathcal{H}}[a(t)+b(t)] \leq a_{m}(t) + b_{m}(t) 

(2)E_{\mathcal{H}}[a(t)] \leq a_{m}(t)

(3) E_{\mathcal{H}}[\mathcal{H} a(t)b(t)] \leq \eta_{m}a_{m}(t)b_{m}(t)

Thus, by replacing the expectations in (1)-(3) with their respective upper bounds in the efficiency theorems, analyses of the efficiency for the uniform, compact support, and Erlang classes may proceed as previously detailed. These estimates are conservative but sufficient.

For an exponential service life distribution and random stress, create a lower bound for the expectations in
(1)-(3) by concentrating all mass at \eta_{[1]}. Then the conditions from Theorem 5 guarantee an explosion for the exponential distribution.

Implications and Numerical Illustrations

Uniform Service Distribution

 Click on an image in the gallery to expand and explore the efficiency of a server under different instances of the uniform service distribution.
For the uniform service distribution, both the variance of g_{W}(w) and the location of the support affect the efficiency itself. The gallery above shows \psi(\lambda) for various uniform distributions as an illustration. In all four cases, r_{0} = \nu =\eta =1. The variance of a uniform distribution is given by \sigma^{2} = \frac{d-c}{12}. With the exception of the fourth gallery image, which illustrates the explosion of \psi when the distribution has positive mass at 0, Table 1 below compares possible values of \psi for different Uniform distributions. Notice that while the variance \sigma^{2} does affect the range of \psi by several orders of magnitude, the location of the support has a much more powerful effect. Thus, if all service times are equally likely, a server is less efficient if it is consistently but mildly slow (Uniform[10,11]) compared to an inconsistent server (Uniform[1,500]).

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approximate Range of } \psi & \text{Approximate } \lambda^{*}\\\text{Uniform}[1,2] & 1/12 &1.5 & (0,0.012) & 1.3\\ \text{Uniform}[10,11] & 1/12 &10.5 & (0, 2\times10^{-11}) & 0.1 \\ \text{Uniform}[1,500] & 499/12 & 250.5 & (0,3\times 10^{-5}) & 1.6\end{array}

 

Increasing but Compact Service Distribution

Click to expand and explore. 

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\small{\frac{2w}{3}}\mathbb{1}_{[1,2]}(w)& \approx 0.0802&\approx 1.56 & (0,7\times10^{-3}) & 0.75\\\small{\frac{2w}{5}}\mathbb{1}_{[2,3]}(w)&\approx 0.0822 &\approx 2.53 & (0, 7\times10^{-4}) & 0.5 \\\small{\frac{2w}{500^{2}-1}}\mathbb{1}_{[1,500]}(w)& \approx 13888& 333.35 & (0,1.3\times 10^{-7})& 1.4\end{array}

As an illustration of a distribution on compact support, consider the class of increasing densities g_{W}(w) = cx\mathbb{1}_{[a,b]}(x). Several examples are given in the above gallery table. For both compact supports of length 1, the variance is approximately the same, but the mean changes, producing an order of magnitude decrease in efficiency. Compared to the compact support of length 499, with a much larger mean, the efficiency decreases by 3 orders of magnitude. Notice, however, that the decline after the maximum is much less sharp.

 

Erlang Service Distribution

Click on an image to explore. 

 

\begin{array}{l|rrrr}g_{W}(w) & \sigma^{2}&\mu& \text{Approx. Range of } \psi & \text{Approx. } \lambda^{*}\\\text{Erlang}(2,1)& 2&2 & (0,0.7)& 9\\\text{Erlang}(9,1)&9 &9 & (0, 4\times10^{-6}) & 0.5 \\\text{Rayleigh}(1)& (4-\pi)/2& \sqrt{\pi/2} & (0,0.9)& 8\end{array} The gallery above gives two examples of \psi under an Erlang distribution. Notice the change in the efficiency as the mean increases. Here, since \lambda =1, \sigma^{2} = \mu, so the mean likely has the largest effect on \psi.

Comparing all the examples

Comparing all examples, the Rayleigh(1) service distribution imposes the highest maximum efficiency, followed closely by the Erlang(2,1) service distribution, with the Uniform[1,2] service distribution following. \lambda^{*} under the Erlang(2,1) service distribution is larger than for the Rayleigh(1) service distribution, indicating that a server whose service times follow the former distribution can handle a larger arrival intensity before its efficiency begins to decline than the latter.

The means for the Rayleigh(1), Erlang(2,1), and Uniform[1,2] distributions are similar, as shown in the tables, but the Uniform[1,2] distribution has equal probability of any service time in its support with large negative excess kurtosis. It is posulated that kurtosis, skew, and variance play large roles in the behavior and range of \psi. Compare the efficiency under the Erlang(2,1) service distribution with the efficiency under the Erlang(9,1) service distribution. Not only is the mean much lower for the Erlang(2,1) distribution, but the distribution is more strongly positive-skewed than the Erlang(9,1). Thus, more mass is concentrated at the left side of the distribution, indicating that the service times are more often shorter.

Finally, to note the effect of the typical stress level \eta on the range of \psi, we compare the original Cha/Lee figure with our Rayleigh figure above. The service distribution and all other quantities remain the same, but Cha and Lee’s numerical example set \eta = 0.01, whereas the third image in the Erlang gallery shows \psi under \eta = 1. The range of \psi decreases by two orders of magnitude with a 100 fold increase in \eta, with the shape remaining similar. In addition, the location of the maximum \lambda^{*} also inversely varies by the same magnitude.

Studying the efficiency under various service distributions aids not only in deciding when to implement a server intervention, but also aids in evaluating the performance of various servers given their service times.

Conclusion

Many questions have been opened up about the various effects of the service distribution and stress on the server efficiency. The conditions given above are merely sufficient conditions, and have not been proven necessary. Studying the efficiency as a function of the arrival rate under various service distributions and stress provides a way to understand the performance of a server and the various factors that can affect it mathematically.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

References

  1.  CHA, K. H., AND LEE, E. Y. A stochastic breakdown model for an unreliable web server system and an optimal admission control policy. Journal of Applied Probability 48, 2 (2011), 453–466.
  2.  ROSS, S. Stochastic Processes, 2nd edition.
  3.  TRAYLOR, R. Stochastic reliability of a server under random workload. Ph.D. thesis (2016).
Commentary: Returning to Fundamentals in Tech

Commentary: Returning to Fundamentals in Tech

I once heard a great analogy about the difference between mathematicians and engineers in their problem-solving approaches. If an engineer and a mathematician are tasked with crossing a river, the engineer will create a set of stepping stones and get you across quickly, and safely in most cases. The mathematician will spend a month examining the problem, and more months or years carefully constructing a very sturdy stone bridge that will last for lifetimes. 

Read More Read More