Browsed by
Category: Analysis

Simulating Soundscapes Using Convolutions

Simulating Soundscapes Using Convolutions

One of the most powerful areas of electrical engineering that flourished in the 20th century is the field of signal processing. The field is broad and rich in some beautiful mathematics, but by way of introduction, here we’ll take a look at some basic properties of signals and how we can use these properties to find a nice compact representation of operations on them. As a motivating application, we’ll use what we study today to apply certain effects to audio signals. In particular, we’ll take a piece of audio, and be able to make it sound like it’s being played in a cathedral, or in a parking garage, or even through a metal spring.

First things first: what is a signal? For this discussion we’ll limit ourselves to looking at the space \ell = \{x :\mathbb{Z} \rightarrow \mathbb{R}\} – the set of functions which take an integer and return a real number. Another way to think of a signal then is as an infinite sequence of real numbers. We’re limiting ourselves to functions where the domain is discrete (the integers), rather than continuous (the real numbers), since in many applications we’re looking at signals that represent some measurement taken at a bunch of different times1. It’s worth noting that any signal that’s been defined on a countable domain \{..., t_{n-1}, t_n, t_{n+1},...\} can be converted to one defined on the integers via an isomorphism. We like to place one further restriction on the signals, in order to make certain operations possible. We restrict the space to so-called finite-energy signals:

\ell_2 = \left\{x \in \ell : \sum_{n = -\infty}^{\infty} |x(n)|^2 < \infty\right\}.

This restriction makes it much easier to study and prove things around these functions, while still giving us lots of useful signals to study, without having to deal with messy things like infinities. In practice, when dealing with audio we usually have a signal with a finite length and range, so this finite-energy property is trivially true.

Studying signals is only as useful if we can also define operations on them. We’ll study the interaction of signals with systems, which take one signal and transform it into another – essentially, a function operating on signals. Here, we’ll say that a system H : \ell_2 \rightarrow \ell_2 takes an input signal x(t) and produces output signal H\{x(t)\} = y(t).

Linearity and Time Invariance

There are certain properties that are useful for systems to have. The first is linearity. A system H is considered linear if for every pair of inputs x_1, x_2 \in \ell_2, and for any scalar values \alpha, \beta \in R, we have

H\{\alpha x_1 + \beta x_2\} = \alpha H\{x_1\} + \beta H\{x_2\}

This is very useful, because it allows us to break down a signal into simpler parts, study the response of the system to each of those parts, and understand the response to the more complex original signal.

The next property we’re going to impose on our systems is time-invariance:

\forall s \in \mathbb{Z}, H\{x(n)\} = y(n) \Rightarrow H\{x(n-s)\} = y(n-s)

This means that shifting the input by s corresponds to a similar shift in the output. In our example of playing music in a cathedral, we expect our system to be time-invariant, since it shouldn’t matter whether we play our music at noon or at midnight, we’d expect it to sound the same. However, if we were playing in a building that, for example, lowered a bunch of sound-dampening curtains at 8pm every night, then the system would no longer be time-invariant.

So what are some more concrete examples of systems that are linear and time-invariant?
Let’s consider an audio effect which imitates an echo – it outputs the original signal, plus a quieter, delayed version of that signal. We might express such a system as

H_{\Delta, k}\{x(n)\} = x(n) + kx(n-\Delta)

where \Delta \in \mathbb{Z} is the time delay of the echo (in terms of the number of samples), and k \in \mathbb{R} is the relative volume of the echoed signal. We can see that this signal is time-invariant, because there is no appearance of the time variable outside of the input. If we replaced k by a function k(n) = \sin(n), for example, we would lose this time invariance. Additionally, the system is plainly linear:

\begin{aligned}H_{\Delta, k}\{\alpha x_1(n) + \beta x_2(n)\} &= \alpha x_1(n) + \beta x_2(n) + k x_1(n-\Delta) +k x_2(n-\Delta) \\&= H_{\Delta, k}\{x_1(n)\} + H_{\Delta, k}\{x_2(n)\}\end{aligned}

A common non-linearity in audio processing is called clipping — we limit the output to be between -1 and 1: H\{x(n)\} = \max(\min(x(n), 1), -1). This is clearly non-linear since doubling the input will not generally double the output.

The Kronecker delta signal

There is a very useful signal that I would be remiss not to mention here: the Kronecker delta signal. We define this signal as

\delta(n) = \begin{cases} 1 & n = 0 \\ 0 & n \neq 0 \end{cases}

The delta defines an impulse, and we can use it to come up with a nice compact description of linear, time-invariant systems. One property of the delta is that it can be used to “extract” a single element from another signal, by multiplying:

\forall s \in \mathbb{Z}, \delta(n-s)x(s) = \begin{cases} x(n) & n=s \\ 0 & n \neq s\end{cases}

Similarly, we can then write any signal as an infinite sum of these multiplications:

x(n) = \sum_{s=-\infty}^{\infty} \delta(n-s)x(s) = \sum_{s=-\infty}^{\infty}\delta(s)x(n-s)

Why would we want to do this? Let H be a linear, time-invariant system, and let h(t) = H\{\delta(t)\}, the response of the system to the delta signal. Then we have

\begin{aligned} H\{x(n)\} &= H\left\{\sum_{s=-\infty}^{\infty} \delta(n-s)x(s)\right\}\\&=\sum_{s=-\infty}^{\infty}H\{\delta(n-s)\}x(s) \text{ by linearity}\\&=\sum_{s=-\infty}^{\infty}h(n-s)x(s) \text{ by time-invariance}\end{aligned}

We can write any linear, time-invariant system in this form. We call the function h the impulse response of the system, and it fully describes the behaviour of the system. This operation where we’re summing up the product of shifted signals is called a convolution, and appears in lots of different fields of math 2.

Firing a Gun in The Math Citadel

The power of this representation of a system is that if we want to understand how it will act on any arbitrary signal, it is sufficient to understand how it responds to an impulse. To demonstrate this, we’ll look at the example of how audio is affected by the environment it is played in. Say we were a sound engineer, and we wanted to get an instrument to sound like it was being played in a big, echoing cathedral. We could try to find such a place, and actually record the instrument, but that could be expensive, requiring a lot of setup and time. Instead, if we could record the impulse response of that space instead, we could apply the convolution to a recording we did back in a studio. How do we capture an impulse response? We just need a loud, very short audio source – firing a pistol or popping a balloon are common. To demonstrate, here are some example impulse responses, taken from OpenAirLib, and their effects on different audio signals.

First, here is the unprocessed input signal – a short piece of jazz guitar:

Here is the same clip, as if it were played in a stairwell at the University of York. First, the impulse response, then the processed audio.

That sounds a little different, but we can try a more extreme example: the gothic cathedral of York Minster. Again, here is the impulse response, followed by the processed signal.

In this case, we have a much more extreme reverberation effect, and we get the sound of a guitar in a large, ringing room. For our last example, we’ll note that impulse responses don’t have to be natural recordings, but instead could be entirely synthetic. Here, I’ve simply reversed the first impulse response from the stairwell, which creates this pre-echo effect, which doesn’t exist naturally.

This is just one of the most basic examples of what can be done with signal processing, but I think it’s a particularly good one – by defining some reasonable properties for signals and systems, we’re able to derive a nice compact representation that also makes a practical application very simple.

Sequences & Tendency: Topology Basics Pt. 2

Sequences & Tendency: Topology Basics Pt. 2

Introduction

In my previous post I presented abstract topological spaces by way of two special characteristics. These properties are enough to endow a given set with vast possibilities for analysis. Fundamental to mathematical analysis of all kinds (real, complex, functional, etc.) is the sequence.

We have covered the concept of sequences in some of our other posts here at The Math Citadel. As Rachel pointed out in her post on Cauchy sequences, one of the most important aspects of the character of a given sequence is convergence.

In spaces like the real numbers, there is convenient framework available to quantify closeness and proximity, and which allows naturally for a definition of limit or tendency for sequences. In a general topological space missing this skeletal feature, convergence must be defined.

This post will assume only some familiarity with sequences as mathematical objects and, of course, the concepts mentioned in Part 1. For a thorough treatment of sequences, I recommend Mathematical Analysis by Tom M. Apostol.

Neighborhoods

Suppose (X,\mathscr{T}) is a given topological space, and nothing more is known. At our disposal so far are only open sets (elements of \mathscr{T}), and so it is on these a concept of vicinity relies.

Definition. Given a topological space (X,\mathscr{T}), a neighborhood of a point x\in X is an open set which contains x.

That is, we say an element T\in\mathscr{T} such that x\in T is a neighborhood1 of x. To illustrate, take the examples from my previous post.

The Trivial Topology

When the topology in question is the trivial one: \{\emptyset,X\}, the only nonempty open set is X itself, hence it is the only neighborhood of any point x\in X.

The Discrete Topology

Take X=\{2,3,5\} and \mathscr{T} to be the collection of all subsets of X:

  \emptyset  
\{2\} \{3\} \{5\}
\{2,3\} \{2,5\} \{3,5\}
  \{2,3,5\}  

 

Then, for, say x=5, neighborhoods include \{5\}, \{2,5\}, \{3,5\}, and \{2,3,5\}.

The Standard Topology on \mathbb{R}

The standard topology on \mathbb{R} is defined to be the family of all sets of real numbers containing an open interval around each of its points. In this case, there are infinitely2 many neighborhoods of every real number. Taking x=\pi for instance, then (3,4), (-2\pi,2\pi), and even

\bigcup_{n=1}^{\infty}\left(\pi-\frac{1}{n},\pi+\frac{1}{n}\right)

are all neighborhoods of \pi.

Remark. A special type of neighborhood in the standard topology is the symmetric open interval. Given a point x and a radius r>0, the set

(x-r,x+r)=\{y\in\mathbb{R}\mathrel{:}|x-y|<r\}

is a neighborhood of x. These sets form what is referred to as a basis for the standard topology and are important to definition of convergence in \mathbb{R} as a metric space.

Convergence

“…the topology of a space can be described completely in terms of convergence.” —John L. Kelley, General Topology

At this point in our discussion of topological spaces, the only objects available for use are open sets and neighborhoods, and so it is with these that convergence of a sequence are built3.

Definition. A sequence (\alpha_n) in a topological space (X,\mathscr{T}) converges to a point L\in X if for every neighborhood U of L, there exists an index N\in\mathbb{N} such that \alpha_n\in U whenever n\geq N. The point L is referred to as the limit of the sequence (\alpha_n).

Visually, this definition can be thought of as demanding the points of the sequence cluster around the limit point L. In order for the sequence (\alpha_n) to converge to L, it must be the case that after finitely many terms, every one that follows is contained in the arbitrarily posed neighborhood U.

As you might expect, the class of neighborhoods available has a dramatic effect on which sequences converge, as well as where they tend. Just how close to L are the neighborhoods built around it in the topology?

We will use the example topologies brought up so far to exhibit the key characteristics of this definition, and what these parameters permit of sequences.

The Trivial Topology

In case it was to this point hazy just how useful the trivial topology is, sequences illustrate the issue nicely. For the sake of this presentation, take the trivial topology on \mathbb{R}. There is precisely one neighborhood of any point, namely \mathbb{R} itself. As a result, any sequence of real numbers converges, since every term belongs to \mathbb{R}. Moreover, every real number is a limit of any sequence. So, yes, the sequence (5,5,5,\ldots) of all 5‘s converges to \sqrt{2} here.

The Discrete Topology

Whereas with the trivial topology a single neighborhood exists, the discrete topology is as packed with neighborhoods as can be. So, as the trivial topology allows every sequence to converge to everything, we can expect the discrete topology to be comparatively restrictive. Taking the set \{2,3,5\} with the discrete topology as mentioned above, we can pinpoint the new limitation: every set containing exactly one point is a neighborhood of that point. Notice the sets4 \{2\}, \{3\}, and \{5\} are all open sets.

What does this mean? Any sequence that converges to one of these points, say 3, must eventually have all its terms in the neighborhood \{3\}. But that requires all convergent sequences to be eventually constant! This seems to be a minor issue with the finite set \{2,3,5\}, but it presents an undesirable, counter-intuitive problem in other sets.

Take \mathbb{R} with the discrete topology, for example. Under these rules, the sequence

(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right),

though expected to converge to 0, does not converge at all.

So, the discrete topology is too restrictive, and the trivial topology lets us get away with anything. Fortunately, a happy middle ground exists by being a little more selective with neighborhoods.

The Standard Topology

By requiring an open set to contain an open interval around each of its points, it is impossible that a singleton be an open set. Therefore a singleton cannot be a neighborhood, and we eliminate the trouble posed by the discrete topology. Yet every open interval around a real number L contains a smaller one, and each of these is a neighborhood.

This effectively corrals the points of any convergent sequence, requiring the distance between the terms and the limit to vanish as n increases. Take again the sequence

(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right).

We suspect (\alpha_n) converges to 0, but this requires proof. Therefore, we must consider an arbitrary neighborhood of 0, and expose the index N\in\mathbb{N} such that all terms, from the Nth onward, exist in that neighborhood.

Suppose U is a given neighborhood of 0, so that U contains an open interval surrounding 0. Without loss of generality, we may assume this interval is symmetric; that is, the interval has the form (-r,r) for some radius r>0. Take N to be any integer greater than \tfrac{1}{r}. Then, whenever n\geq N,

\alpha_n = \frac{1}{n} \leq \frac{1}{N} < \frac{1}{1/r} = r.

But this means \alpha_n\in(-r,r)\subset U so long as n\geq N. Since we chose U arbitrarily, it follows (\alpha_n) converges to 0.

Conclusion

The behavior of a sequence in a given set can change rather drastically depending on the network of neighborhoods the topology allows. However, with careful construction, it is possible to have all the sequential comforts of metric spaces covered under a briefly put definition.

My next post in this series will push the generalization of these concepts much further, by relaxing a single requirement. In order to define convergence in the preceding discussion, the set of indices \mathbb{N} was important not for guaranteeing infinitely many terms, but instead for providing order. This allows us to speak of all terms subsequent to one in particular. It turns out that if we simply hold on to order, we can loosen the nature of the set on which it is defined. That is the key to Moore-Smith Convergence, to be visited next.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Paper Review: Active Queue Management with Non-Linear Packet Dropping Function

Paper Review: Active Queue Management with Non-Linear Packet Dropping Function

As promised in the previous article, I plan to review Reference 2, Active Queue Management with Non-Linear Packet Dropping Function, by D. Augustyn, A. Domanski, and J. Domanska, published in HET-NETs 2010, which discusses a change in the structure of the packet drop probability function using the average queue length in a buffer. I mentioned previously that choosing a linear function of the average queue length can be viewed as a bit of an arbitrary choice, since we’re designing a control mechanism here, and this paper attempts to define a new form of this packet drop probability function.

In summary, the best lesson one can take from this paper is that publication in a journal or conference proceedings does not guarantee that the paper withstands scrutiny. The paper is linked above for the interested reader to peruse himself, and to investigate the claims. 

Summary

The paper intended to give a new function to calculate the probability of proactively dropping a packet in a queue in order to prevent a full buffer. It seemed to be presented as an alternative to RED, described in my previous article. The authors define this new function, then set up a simulation in order to examine the effects. 

When Orthogonal is Abused

The authors describe using a finite linear combination of orthogonal basis polynomials defined on a finite interval as the underlying mathematical structure. 

First, we should discuss what we mean by orthogonal in context of functions. Orthogonal is most commonly understood in terms of vectors, and when we’re in two dimensions, orthogonal becomes our familiar perpendicular. 

Orthogonality

Beginning with the familiar notion of perpendicular, we can generalize this to understand orthogonality. The geometric interpretation of two vectors  being perpendicular is that the angle between them is 90^{\circ}. Once we leave two and three dimensions (or jump to the space of polynomials, as we’ll do soon), the concept of an angle isn’t as helpful. 

Another way to define perpindicular is through an operation known as the dot productSuppose we take two 2D vectors, \mathbf{x} and \mathbf{y}. Each vector will have coordinates: \mathbf{x} = (x_{1},x_{2}) and \mathbf{y} = (y_{1}, y_{2}). The dot product is a special type of multiplication defined on vectors, and denoted \cdot:

\mathbf{x}\cdot\mathbf{y} = x_{1}y_{1} + x_{2}y_{2}

The dot product can be described in words as the sum of the component-wise multiplication of the coordinates of the two vectors. 

Now, we can say that two vectors are perpendicular if their dot product is 0. That is, \mathbf{x} and \mathbf{y} are perpindicular if \mathbf{x}\cdot\mathbf{y} = 0. (This article gives a nice overview of how we move from the algebraic to the geometric definition of perpendicular and orthogonal.)

Remember, perpendicular doesn’t make sense once we get into higher dimensions. Orthogonal is a more general notion of vectors being perpendicular, and is defined for two vectors (of any length) as their dot product equalling zero.

From Dot Product to Inner Product

The dot product is used on vectors, and defines another type of product that is different from the scalar multiplication we know. In fact, we can generalize the notion of a dot product to something called an inner product, which can be defined on many different spaces than just vectors. We can define operations and products however we like, but for our definition to qualify as an inner product (denoted \langle \cdot, \cdot\rangle), it must meet certain criteria

For instance, on the set of real valued functions with domain [a,b], we define the inner product of two functions f(x) and g(x) as 

\langle f, g\rangle := \int_{a}^{b}f(x)g(x)dx

The concept of orthogonality generalizes to an inner product as well. If the inner product of two functions is zero (as defined above), we say the functions are orthogonal. 

Back to the paper

The authors claim to be using a set of orthogonal polynomials to define their drop probability function, and they give the structure of such functions. For a domain [a,b], and for \phi_{j} in the set of polynomials, they define \phi_{j} = (x-a)^{j-1}(b-x). So, for example, \phi_{1} = (b-x), and \phi_{5} = (x-a)^{4}(b-x)

Now, in order to be an orthogonal basis for a space1, the set of functions that are claimed to form the basis of the set must be pairwise orthogonal. That is, I need to be able to take the inner product of any two functions \phi_{i} and \phi_{j} and get 0. If that isn’t true for even one pair, then the set is not orthogonal. 

As it turns out, if we take the inner product of any two functions in the basis set over the domain given, we find that there are no pairs that are orthogonal. To do this in general, we compute the integral

\int_{a}^{b}(x-a)^{i-1}(b-x)\cdot (x-a)^{j-1}(b-x)dx

The integral computation is one of simple polynomial integration, and can be done either by hand or using your favorite software (Mathematica) of choice. What we find here is that this set of functions defined in general this way is never orthogonal, yet the paper claims they are. 

Applying to the particular situation of designing a drop probability function, they give the following for average queue length thresholds T_{\min} and T_{\max} p(x,a_{1},a_{2}) = \left\{\begin{array}{lr}0, &x < T_{\min}\\\phi_{0} + a_{1}\phi_{1}(x) + a_{2}\phi_{2}(x),&T_{\min}\leq x \leq T_{\max}\\1,&x > T_{\max}\end{array}\right.

where the basis functions are 

\begin{aligned}\phi_{0}(x) &= p_{m}\frac{x-T_{\min}}{T_{\max}-T_{\min}}\\\phi_{1}(x) &= (x-T_{\min})(T_{\max}-x)\\\phi_{2}(x) &= (x-T_{\min})^{2}(T_{\max}-x)\end{aligned}

The reader will recognize \phi_{0} as the original drop probability function from the RED algorithm. These functions are absolutely not orthogonal though (as the authors claim), and a simple check as we did above will verify it.

Other issues

Another issue is that these mysterious coefficients a_{1} and a_{2} need to be determined. How, you ask? The authors do not say, other than to note that one can define “a functional” implicit on the unknown p(x,a_{1},a_{2}) that can be minimized to find the optimal values for those coefficients. They write that this mysterious functional2 can be based on either the average queue length or average waiting time, yet provide no details whatsoever as to the functional they have chosen for this purpose. They provide a figure with a sample function, but give no further details as to how it was obtained.

One other issue I have in their methodology is discussing the order of estimation. For those familiar with all sorts of ways to estimate unknown functions, from Taylor series, to splines, to Fourier series, we know that a function is exactly equal to an infinite sum of such functions. Any finite sum is an estimation, and the number of terms we use for estimation with a desired accuracy may change with the function being estimated. 

For instance, if I want to use Taylor series (a linear combination of polynomials) to estimate a really icky function, how many terms should I include to ensure my accuracy at a point is within 0.001 of the real value? It depends on the function.3.

The authors simply claim that a second order term is appropriate in this case. The issue I take with that is these queueing management drop probability functions are designed. We’re not estimating a function describing a phenomenon, we are designing a control policy to seek some sort of optimal behavior. Fundamentally, the authors posing this as an estimation problem of some unknown drop probability function is incorrect. This isn’t a law of nature we seek to observe; it’s a control policy we seek to design and implement and optimize. Using language that implies the authors are estimating some “true function” is misleading.

Regarding the simulation itself, since the design was arbitrary, and not based on sound mathematical principles, I cannot give any real comment to the simulations and the results. The authors briefly discuss and cite some papers that explore the behavior of network traffic, and claim to take this into account in their simulations. To those, I cannot comment yet. 

Conclusion

Always verify a paper for yourself, and don’t accept anything at face value. Research and technical publications should be completely transparent as to choices and methodologies (and obviously free of mathematical inaccuracies), and derivations and proofs should be present when necessary, even if in an appendix. 

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

 

The Red-Headed Step-Distributions

The Red-Headed Step-Distributions

Almost every textbook in probability or statistics will speak of classifying distributions into two different camps: discrete (singular in some older textbooks) and continuous. Discrete distributions have either a finite or a countable sample space (also known as a set of Lebesgue measure 0), such as the Poisson or binomial distribution, or simply rolling a die. The probability of each point in the sample space is nonzero. Continuous distributions have a continuous sample space, such as the normal distribution. A distribution in either of these classes is either characterized by a probability mass function (pmf) or probability distribution function (pdf) derived from the distribution function via taking a derivative. There is, however, a third kind.

One rarely talked about, or mentioned quickly and then discarded. This class of distributions is defined on a set of Lebesgue measure 0, yet the probability of any point in the set is 0, unlike discrete distributions. The distribution function is continuous, even uniformly continuous, but not absolutely continuous, meaning it’s not a continuous distribution. The pdf doesn’t exist, but one can still find moments of the distribution (e.g. mean, variance). They are almost never encountered in practice, and the only real example I’ve been able to find thus far is based on the Cantor set. This class is the set of red-headed step-distributions– the singular continuous distributions.

Back up, what is Lebesgue measure?

Measure theory itself can get extremely complicated and abstract. The idea of measures is to give the “size” of subsets of a space. Lebesgue measure is one type of measure, and is actually something most people are familiar with: the “size” of subsets of Euclidean space in n dimensions. For example, when n=1, we live in 1D space. Intervals. The Lebesgue measure of an interval [a,b] on the real line is just the length of that interval: b-a. When we move to two dimensions, \mathbb{R}\times \mathbb{R}, the Cartesian product of 1D space with itself, our intervals combine to make rectangles. The Lebesgue measure in 2D space is area; so a rectangle built from [a,b]\times [c,d] has Lebesgue measure (b-a)(d-c). Lebesgue measure in 3D space is volume. And so forth. 

Now, points are 0-dimensional in Euclidean space. They have no size, no mass. They have Lebesgue measure 01. Intuitively, we can simply see that Lebesgue measure helps us see how much “space” something takes up in the Euclidean world, and points take up no space, and hence should have measure 0. 

In fact, any countable set of points has Lebesgue measure 0. Even an infinite but countable set. The union of disjoint Lebesgue measurable sets has a measure equal to the sum of the individual sets. Points are certainly disjoint, and they each have measure 0, and summing 0 forever still yields 0.2 So, the set \{0,1,2\} has Lebesgue measure 0. But so do the natural numbers \mathbb{N}and the rational numbers \mathbb{Q}, even though the rational numbers contain the set of natural numbers.

It is actually possible to construct an uncountable infinite set that has Lebesgue measure 0, and we will need that in constructing our example of a singular continuous distribution. For now, we’ll examine discrete and continuous distributions briefly.

Discrete (Singular) Distributions

These are the ones most probability textbooks begin with, and most of the examples that are familiar.

Roll a fair die. 

The sample space for a roll of a fair die X is S =\{1,2,3,4,5,6\}. The PMF is P(X = x) = 1/6, where x \in S. The CDF is given by the function P(X\leq x) = \sum_{j\leq x}P(X=j) 

Example:

P(X \leq 4) = \sum_{j\leq 4}\frac{1}{6} = \frac{2}{3}

Binomial Distribution

A binomial random variable X counts the number of “successes” or 1s in a binary sequence of n Bernoulli random variables. Think a sequence of coin tosses, and counting the number of heads. In this case, the sample space is infinite, but countable: S = \{0,1,2,\ldots\}. If the probability of a 1, or “success” is p, then the PMF of X is given by 

P(X=x) = {n \choose x}p^{x}(1-p)^{n-x}

Note here again that the sample space is of Lebesgue measure 0, but the probability of any point in that space is a positive number. 

Continuous Distributions

Continuous distributions operate on a continuous sample space, usually an interval or Cartesian product of intervals or even a union of intervals. Continuous distribution functions F are absolutely continuous, meaning that (in one equivalent definition), the distribution function has a derivative f=F' almost everywhere that is Lebesgue integrable, and obeys the Fundamental Theorem of Calculus:

F(b)-F(a) = \int_{a}^{b}f(x)dx

for a< b. This f is the probability distribution function (PDF), derived by differentiating the distribution function. Let’s mention some examples of these:

The Continuous Uniform Distribution

Suppose we have a continuous interval [a,b], and the probability mass is spread equally along this interval, meaning that the probability that our random variable X lies in any subinterval of size s has the same probability, regardless of location. Suppose we do not allow the random variable to take any values outside the interval. The sample space is continuous but over a finite interval. The distribution function for this X is given by 

F(x) = \left\{\begin{array}{lr}0&x< a\\\frac{x-a}{b-a}&a\leq x \leq b\\1&x > b\end{array}\right.

This is an absolutely continuous function. Then we may easily derive the PDF by differentiating F:

f(x) = \mathbb{1}_{x \in [a,b]}\frac{1}{b-a}

where \mathbb{1}_{x \in [a,b]} is the indicator function that takes value 1 if x is in the interval, and 0 otherwise. 

This distribution is the continuous version of a die roll. The die roll is the discrete uniform distribution, and here we just allow for a die with uncountably many sides with values in [a,b]. The probability of any particular point is 0, however, even though it is possible to draw a random number from this interval. To see this, note that the probability that the random variable X lies between two points in the interval, say x_{1} and x_{2} is given by multiplying the height of the PDF by the length (Lebesgue measure) of the subinterval. The Lebesgue measure of a point is 0, so even though a value for the PDF exists at that point, the probability is 0. 

We don’t run into issues here mathematically because we are on a continuous interval. 

The Normal Distribution

Likely the most famous continuous distribution, the normal distribution is given by the famous “bell curve.” In this case, the sample space is the entire real line. The probability that a normally distributed random variable X lies between any two points a and b is given by 

P(a\leq X \leq b) = \int_{a}^{b}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right)dx

where \mu is the mean and \sigma^{2} is the variance. 

Singular Continuous Distributions

We’re going to begin this section by discussing everyone’s favorite counterexample in mathematics: the Cantor set. 

The Cantor set

The Cantor set is given by the limit of the following construction:

  1. Take the interval [0,1]
  2. Remove the middle third: (1/3, 2/3), so you’re left with [0,1/3]\cup[2/3,1]
  3. Remove the middle third of each of the remaining intervals. So you remove (1/9,2/9) from [0,1/3] and (7/9,8/9) from [2/3,1], leaving you with the set [0,1/9]\cup[2/9,1/3]\cup[2/3,7/9]\cup[8/9,1]

Continue this process infinitely.

This is an example of a set that is uncountable, yet has Lebesgue measure 0. Earlier, when we discussed Lebesgue measure, we noted that all countable sets had measure 0. Thus we may conclude that only uncountable sets (like intervals) have nonzero Lebesgue measure. However, the Cantor set illustrates that not all uncountable sets have positive Lebesgue measure. To see why the Cantor set has Lebesgue measure 0, we will look at the measure of the sets that are removed (the complement of the Cantor set):

At the first step, we have removed one interval of size 1/3. At the second step, we remove two intervals of size 1/9. At the third step, we remove four intervals of size 1/27. Let’s call S_{n} the subset removed from the interval [0,1] by the nth step. By the end of the third step, we have removed a set of size

m(S_{3}) = \frac{1}{3} + \frac{2}{3^{2}} + \frac{4}{3^{3}}

By the nth step, 

m(S_{n}) = \sum_{j=0}^{n}\frac{2^{j}}{3^{j+1}}

This is the partial sum of a geometric series, so

m(S_{n}) = 1-\left(\frac{2}{3}\right)^{n}

Now, the Cantor set is formed when n \to \infty. The measure of the complement of the Cantor set, which we called S_{\infty} then has measure

m(S_{\infty}) = \lim_{n \to \infty}m(S_{n}) = \lim_{n \to \infty}1-\left(\frac{2}{3}\right)^{n} = 1

But the original interval we started with had Lebesgue measure 1, and the union of the Cantor set with its complement S_{\infty} is the interval [0,1]. That means that the measure of the Cantor set plus the measure of its complement must add to 1, which implies that the Cantor set is of measure 0. However, since we removed open intervals during the construction, there must be something left; in fact, there are uncountably many points left. 

Now we have an uncountable set of Lebesgue measure 0. We’re going to use this set to construct the only example I could find of a singular continuous distribution. It is very important that the Cantor set is an uncountable set of Lebesgue measure 0. 

Building the Cantor distribution

Update: Following a correction from an earlier version, I’m going to show how to construct this distribution directly and via the complement of the Cantor set. The latter was used in a textbook I found, and is a bit convoluted in its construction, but I’m going to leave it.

The direct construction is to look at the intervals left behind at each stage n of constructing the Cantor set. Assign a probability mass of \frac{1}{2^{n}} to each of the 2^{n} intervals left behind, and this is your distribution function. It’s basically a continuous uniform distribution, but on stages of the Cantor set construction. Sending n \to \infty yields the Cantor set, but the probability distribution moves to 0 on a set of measure 0. Thus, unlike the continuous uniform distribution, where the probability of any single point was 0, but the support has positive measure, we essentially have the continuous uniform distribution occurring on a set of measure 0, which means we have a continuous distribution function on a singular support of measure 0 that is uncountable and thus not discrete. This distribution is therefore neither continuous nor discrete. 

Another way to construct this is by complement, via Kai Lai Chung’s A Course in Probability Theory. 

(Note: after a second glance at this, I found this to be a relatively convoluted way of constructing this distribution, since it can be fairly easily constructed directly. However, I imagine the author’s purpose was to be very rigid and formal to cover all his bases, so I present a review of it here:)

Let’s go back to the construction of the Cantor set. At each step n we have removed in total 2^{n}-1 disjoint intervals. Let’s number those intervals, going from left to right as J_{n,k}, where k = 1,2,\ldots, 2^{n}-1

For example, at n=2 we have that J_{2,1} = (1/9,2/9),J_{2,2} = (1/3,2/3), and J_{2,3} = (7/9,8/9)

Now let the quantity c_{n,k} = \frac{k}{2^{n}}. This will be the probability mass assigned to interval J_{n,k}. So we define the distribution function as 

F(x) = c_{n,k}, x \in J_{n,k}

Let U_{n} = \cup_{k=1}^{2^{n}-1}J_{n,k}, and U = \lim_{n\to\infty}U_{n} The function F is indeed a distribution function and can be shown to be uniformly continuous on the set D = (-\infty,0)\cup U \cup (1,\infty). However, none of the points in D is in the support of F, so the support of F is contained in the Cantor set (and in fact is the Cantor set).  The support (the Cantor set) has measure 0, so it is singular, but the distribution function is continuous, so it cannot be a discrete distribution. This distribution fits nowhere in our previous two classes, so we must now create a third class — the singular continuous distribution.

(By the way, even though the PDF doesn’t exist, the Cantor distribution still has mean of 1/2 and a variance of 1/8, but no mode. It does have a moment generating function.)

Any other examples?

With some help, I spent some time poring through quite a few probability books to seek further study and other treatment of singular continuous distributions. Most said absolutely nothing at all, as if the class didn’t exist. 

One book, Modern Probability Theory and Its Applications has a rather grumpy approach:

There also exists another kind of continuous distribution function, called singular continuous, whose derivative vanishes at almost all points. This is a somewhat difficult notion to picture, and examples have been constructed only by means of fairly involved analytic operations. From a practical point of view, one may act as if singular continuous distribution functions do not exist, since examples of these functions are rarely, if ever, encountered in practice.

This notion also has led me to a couple papers, which I intend to review and continue presenting my findings. I happen to have a great fondness for these “edge cases” and forgotten areas of mathematics. I believe they are the most ripe for groundbreaking new material. 

 

Cauchy Sequences: the Importance of Getting Close

Cauchy Sequences: the Importance of Getting Close

I am an analyst at heart, despite my recent set of algebra posts. Augustin Louis Cauchy can be argued as one of the most influential mathematicians in history, pioneering rigor in the study of calculus, almost singlehandedly inventing complex analysis and real analysis, though he also made contributions to number theory, algebra, and physics. 

One of the fundamental areas he studied was sequences and their notion of convergence. Suppose I give you a sequence of numbers, and ask you what happens to this sequence if I kept appending terms forever? Would the path created by the sequence lead somewhere? 

Read More Read More