﻿

### Browsed byAuthor: Jason Hathcock

Sequences & Tendency: Topology Basics Pt. 2

## Introduction

In my previous post I presented abstract topological spaces by way of two special characteristics. These properties are enough to endow a given set with vast possibilities for analysis. Fundamental to mathematical analysis of all kinds (real, complex, functional, etc.) is the sequence.

We have covered the concept of sequences in some of our other posts here at The Math Citadel. As Rachel pointed out in her post on Cauchy sequences, one of the most important aspects of the character of a given sequence is convergence.

In spaces like the real numbers, there is convenient framework available to quantify closeness and proximity, and which allows naturally for a definition of limit or tendency for sequences. In a general topological space missing this skeletal feature, convergence must be defined.

This post will assume only some familiarity with sequences as mathematical objects and, of course, the concepts mentioned in Part 1. For a thorough treatment of sequences, I recommend Mathematical Analysis by Tom M. Apostol.

## Neighborhoods

Suppose $(X,\mathscr{T})$ is a given topological space, and nothing more is known. At our disposal so far are only open sets (elements of $\mathscr{T}$), and so it is on these a concept of vicinity relies.

Definition. Given a topological space $(X,\mathscr{T})$, a neighborhood of a point $x\in X$ is an open set which contains $x$.

That is, we say an element $T\in\mathscr{T}$ such that $x\in T$ is a neighborhood1 of $x$. To illustrate, take the examples from my previous post.

#### The Trivial Topology

When the topology in question is the trivial one: $\{\emptyset,X\}$, the only nonempty open set is $X$ itself, hence it is the only neighborhood of any point $x\in X$.

#### The Discrete Topology

Take $X=\{2,3,5\}$ and $\mathscr{T}$ to be the collection of all subsets of $X$:

 $\emptyset$ $\{2\}$ $\{3\}$ $\{5\}$ $\{2,3\}$ $\{2,5\}$ $\{3,5\}$ $\{2,3,5\}$

Then, for, say $x=5$, neighborhoods include $\{5\}$, $\{2,5\}$, $\{3,5\}$, and $\{2,3,5\}$.

#### The Standard Topology on $\mathbb{R}$

The standard topology on $\mathbb{R}$ is defined to be the family of all sets of real numbers containing an open interval around each of its points. In this case, there are infinitely2 many neighborhoods of every real number. Taking $x=\pi$ for instance, then $(3,4)$, $(-2\pi,2\pi)$, and even

$$\bigcup_{n=1}^{\infty}\left(\pi-\frac{1}{n},\pi+\frac{1}{n}\right)$$

are all neighborhoods of $\pi$.

Remark. A special type of neighborhood in the standard topology is the symmetric open interval. Given a point $x$ and a radius $r>0$, the set

$$(x-r,x+r)=\{y\in\mathbb{R}\mathrel{:}|x-y|

is a neighborhood of $x$. These sets form what is referred to as a basis for the standard topology and are important to definition of convergence in $\mathbb{R}$ as a metric space.

## Convergence

“…the topology of a space can be described completely in terms of convergence.” —John L. Kelley, General Topology

At this point in our discussion of topological spaces, the only objects available for use are open sets and neighborhoods, and so it is with these that convergence of a sequence are built3.

Definition. A sequence $(\alpha_n)$ in a topological space $(X,\mathscr{T})$ converges to a point $L\in X$ if for every neighborhood $U$ of $L$, there exists an index $N\in\mathbb{N}$ such that $\alpha_n\in U$ whenever $n\geq N$. The point $L$ is referred to as the limit of the sequence $(\alpha_n)$.

Visually, this definition can be thought of as demanding the points of the sequence cluster around the limit point $L$. In order for the sequence $(\alpha_n)$ to converge to $L$, it must be the case that after finitely many terms, every one that follows is contained in the arbitrarily posed neighborhood $U$.

As you might expect, the class of neighborhoods available has a dramatic effect on which sequences converge, as well as where they tend. Just how close to $L$ are the neighborhoods built around it in the topology?

We will use the example topologies brought up so far to exhibit the key characteristics of this definition, and what these parameters permit of sequences.

#### The Trivial Topology

In case it was to this point hazy just how useful the trivial topology is, sequences illustrate the issue nicely. For the sake of this presentation, take the trivial topology on $\mathbb{R}$. There is precisely one neighborhood of any point, namely $\mathbb{R}$ itself. As a result, any sequence of real numbers converges, since every term belongs to $\mathbb{R}$. Moreover, every real number is a limit of any sequence. So, yes, the sequence $(5,5,5,\ldots)$ of all $5$‘s converges to $\sqrt{2}$ here.

#### The Discrete Topology

Whereas with the trivial topology a single neighborhood exists, the discrete topology is as packed with neighborhoods as can be. So, as the trivial topology allows every sequence to converge to everything, we can expect the discrete topology to be comparatively restrictive. Taking the set $\{2,3,5\}$ with the discrete topology as mentioned above, we can pinpoint the new limitation: every set containing exactly one point is a neighborhood of that point. Notice the sets4 $\{2\}$, $\{3\}$, and $\{5\}$ are all open sets.

What does this mean? Any sequence that converges to one of these points, say $3$, must eventually have all its terms in the neighborhood $\{3\}$. But that requires all convergent sequences to be eventually constant! This seems to be a minor issue with the finite set $\{2,3,5\}$, but it presents an undesirable, counter-intuitive problem in other sets.

Take $\mathbb{R}$ with the discrete topology, for example. Under these rules, the sequence

$$(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right),$$

though expected to converge to $0$, does not converge at all.

So, the discrete topology is too restrictive, and the trivial topology lets us get away with anything. Fortunately, a happy middle ground exists by being a little more selective with neighborhoods.

#### The Standard Topology

By requiring an open set to contain an open interval around each of its points, it is impossible that a singleton be an open set. Therefore a singleton cannot be a neighborhood, and we eliminate the trouble posed by the discrete topology. Yet every open interval around a real number $L$ contains a smaller one, and each of these is a neighborhood.

This effectively corrals the points of any convergent sequence, requiring the distance between the terms and the limit to vanish as $n$ increases. Take again the sequence

$$(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right).$$

We suspect $(\alpha_n)$ converges to $0$, but this requires proof. Therefore, we must consider an arbitrary neighborhood of $0$, and expose the index $N\in\mathbb{N}$ such that all terms, from the $N$th onward, exist in that neighborhood.

Suppose $U$ is a given neighborhood of $0$, so that $U$ contains an open interval surrounding $0$. Without loss of generality, we may assume this interval is symmetric; that is, the interval has the form $(-r,r)$ for some radius $r>0$. Take $N$ to be any integer greater than $\tfrac{1}{r}$. Then, whenever $n\geq N$,

$$\alpha_n = \frac{1}{n} \leq \frac{1}{N} < \frac{1}{1/r} = r.$$

But this means $\alpha_n\in(-r,r)\subset U$ so long as $n\geq N$. Since we chose $U$ arbitrarily, it follows $(\alpha_n)$ converges to $0$.

## Conclusion

The behavior of a sequence in a given set can change rather drastically depending on the network of neighborhoods the topology allows. However, with careful construction, it is possible to have all the sequential comforts of metric spaces covered under a briefly put definition.

My next post in this series will push the generalization of these concepts much further, by relaxing a single requirement. In order to define convergence in the preceding discussion, the set of indices $\mathbb{N}$ was important not for guaranteeing infinitely many terms, but instead for providing order. This allows us to speak of all terms subsequent to one in particular. It turns out that if we simply hold on to order, we can loosen the nature of the set on which it is defined. That is the key to Moore-Smith Convergence, to be visited next.

Building a Ground Floor: Topology Basics Pt. 1

## Building a Ground Floor: Topology Basics Pt. 1

Like some other terms in mathematics (“algebra” comes to mind), topology is both a discipline and a mathematical object. Moreover like algebra, topology as a subject of study is at heart an artful mathematical branch devoted to generalizing existing structures like the field of real numbers for their most convenient properties. It is also a favorite subject of mine, ever since my first introduction to it. This is due in large part to its exceedingly simple first principles, which make the wealth of expansion they allow all the more impressive.

It is my intent to discuss some of these starting points here, in the first of a short series of posts toward the goal of presenting one of my favorite results arising in topology: Moore-Smith convergence, a vast extension of the notion of the limit of a sequence. My representation here follows the explanation given by John L. Kelley in his classic text General Topology, which I recommend to all curious readers.

## What is a topology?

Definition. By topology is meant any collection $\mathscr{T}$ of sets satisfying two conditions:

$$\begin{array}{lrcl}\text{(1)}&A,B\in\mathscr{T}&\Rightarrow&A\cap B\in\mathscr{T};\\\text{(2)}&\mathscr{C}\subset\mathscr{T}&\Rightarrow&\bigcup\{C\in\mathscr{C}\}\in\mathscr{T}\end{array}$$

It is worthwhile to break this definition down. Condition $(1)$ requires that the intersection of any two elements of the collection $\mathscr{T}$ must itself be a member of $\mathscr{T}$. Condition $(2)$ states that the union of any subcollection of $\mathscr{T}$ must also belong to $\mathscr{T}$. These are referred to as closure to finite intersection and closure to arbitrary union, respectively, in some texts.

Notably, the definition speaks only of a collection of sets with no specification beyond the two conditions. Yet, even with these, one can deduce some further characteristic properties.

Corollary. If $\mathscr{T}$ is a topology, then

$$\begin{array}{ll}\text{(i)}&\emptyset\in\mathscr{T};\\\text{(ii)}&\bigcup\{T\in\mathscr{T}\}\in\mathscr{T}.\end{array}$$

Since $\emptyset\subset S$ for every set $S$, and $\mathscr{T}\subset\mathscr{T}$, it is enough to apply $(2)$ to both of these cases to prove the corollary. In fact, many texts make the definition $X\mathrel{:=}\bigcup\{T\in\mathscr{T}\}$, and refer to the pair $(X,\mathscr{T})$ as the topological space defined by $\mathscr{T}$.

This way, the space is given its character by way of the scheme that builds $\mathscr{T}$, rather than the set $X$. It is an important distinction, for many topologies are possible on a given set. With that, we can look at some examples.

## From Trivial to Complicated

### 1. The Trivial Topology

Based on the corollary just presented, it is enough to gather a given set $X$ and the empty set $\emptyset$ into a collection $\{\emptyset,X\}$ and have created a topology on $X$. Because $X$ and $\emptyset$ are its only members, the collection is easily closed to arbitrary union and finite intersection of its elements. This is known as the trivial or indiscrete topology, and it is somewhat uninteresting, as its name suggests, but it is important as an instance of how simple a topology may be. As per the corollary, every topology on $X$ must contain $\emptyset$ and $X$, and so will feature the trivial topology as a subcollection.

### 2. The Discrete Topology

For this example, one can start with an arbitrary set, but in order to better illustrate, take the set of the first three primes: $\{2,3,5\}$. Suppose we consider the collection of all possible subsets of $\{2,3,5\}$. This is also referred to as the power set of $\{2,3,5\}$, and denoted $\wp(\{2,3,5\})$. Fortunately, the set is small enough to list exhaustively1. Here they are listed from top-to-bottom in order of increasing inclusion:

 $\emptyset$ $\{2\}$ $\{3\}$ $\{5\}$ $\{2,3\}$ $\{2,5\}$ $\{3,5\}$ $\{2,3,5\}$

Note these are all possible subsets of $\{2,3,5\}$. It is clear any union or intersection of the pieces in the table above exists as an entry, and so this meets criteria $(1)$ and $(2)$. This is a special example, known as the discrete topology. Because the discrete topology collects every existing subset, any topology on $\{2,3,5\}$ is a subcollection of this one.

For example, taking the sets

$$\emptyset,\quad\{5\},\quad\{2,3\},\quad\{2,3,5\}$$

from the collection in the table is enough to produce a topology2.

Remark. Given a topological space $(X,\mathscr{T})$, the elements of $\mathscr{T}$ are referred to as open sets. This nomenclature is motivated in the next example.

### 3. $\mathbb{R}$ and Open Intervals

This example will be more constructive than the previous ones. Consider the set of real numbers, $\mathbb{R}$. Let us define a special collection $\mathscr{T}$ of subsets of real numbers the following way: a set $T$ belongs to $\mathscr{T}$ if, and only if, for every $x\in T$, there exist real numbers $a$ and $b$ such that $x\in(a,b)$ and $(a,b)\subset T.$ That is, we say $T\in\mathscr{T}$ to mean $T$ contains an open interval around each of its elements.

It is good practice to take the time to prove this collection defines a topology on $\mathbb{R}$. To do so, it must be shown that $\bigcup\{T\in\mathscr{T}\}=\mathbb{R}$, and that $\mathscr{T}$ meets conditions $(1)$ and $(2)$.

Proof. To show $\bigcup\{T\in\mathscr{T}\}=\mathbb{R}$, it must be verified that $\bigcup\{T\in\mathscr{T}\}\subset\mathbb{R}$ and $\mathbb{R}\subset\bigcup\{T\in\mathscr{T}\}$. The first containment follows by defining every $T\in\mathscr{T}$ as a subset of $\mathbb{R}$ to begin with, so the reverse containment is all that is left. Let $x\in\mathbb{R}$ be given. Then certainly $x\in(x-1,x+1)$, and surely $(x-1,x+1)\in\mathscr{T}$, as it contains an open interval around all its points by its very design. Thus $x\in\bigcup\{T\in\mathscr{T}\}$.

On to proving $\mathscr{T}$ satisfies $(1)$ and $(2)$. For $(1)$, let $A,B\in\mathscr{T}$ be given and suppose3 $x\in A\cap B$. This holds if, and only if, $x\in A$ and $x\in B$. Since $A$ and $B$ both belong to $\mathscr{T}$, there exist real numbers $a$, $b$, $c$, and $d$ such that $x\in(a,b)\subset A$, and $x\in(c,d)\subset B$. But this means $x\in(a,b)\cap(c,d)$. Fortunately, two intervals of real numbers may only overlap in one way: this means either $c or $a. Without loss of generality, suppose it is the former case, that $c. Then $(a,b)\cap(c,d)=(c,b)$, and it is so that $x\in(c,b)$, an open interval contained in $A\cap B$ (precisely as desired), and it follows $A\cap B\in\mathscr{T}$.

To show $(2)$ is much easier. Let $\{T_\alpha\}_{\alpha\in\mathscr{A}}$ be a collection4 of sets belonging to $\mathscr{T}$, and suppose $x\in\bigcup_{\alpha\in\mathscr{A}}T_\alpha$. Then there exists an index, say $\alpha_0\in\mathscr{A}$, such that $x\in T_{\alpha_0}$. Since $T_{\alpha_0}\in\mathscr{T}$, there exist real numbers $a$ and $b$ such that $x\in(a,b)\subset T_{\alpha_0}$. But this means $x\in(a,b)\subset\bigcup_{\alpha\in\mathscr{A}}T_\alpha$. Since $x$ was chosen arbitrarily, it follows $\bigcup_{\alpha\in\mathscr{A}}T_\alpha\in\mathscr{T}$.

The proof above shows $(\mathbb{R},\mathscr{T})$ is a topological space; the collection $\mathscr{T}$ is referred to as the standard topology on $\mathbb{R}$. The open sets in this space are all the subsets of real numbers contained in open intervals. Fittingly, then, open intervals are open sets in the standard topology.

## Conclusion

This first post is meant to express the simple starting points of topology as a subject of study. It only takes the two criteria mentioned here to define a topology of sets, and yet an entire realm of theory rests upon them. This is a recurring theme in topology, algebra, and mathematics in general. Building the fully-featured universes that hold the answers for more specific inquiry: the complete ordered field of real numbers5 $\mathbb{R}$, the space $\mathcal{C}^{\infty}(\mathbb{C})$ of infinitely differentiable functions $f\mathrel{:}\mathbb{C}\to\mathbb{C}$, the class of all real-valued Lebesgue-integrable functions on $\mathbb{R}$, each of these requires a well-made foundation.

The next post in this series will cover the nature of sequences in topological spaces, particularly those instances where the convenient features afforded by the real numbers are no longer available. With the metric space structure stripped away, how does one define convergence and limit of sequences? What does it mean for elements in topological spaces to be close when distance is otherwise without definition?

Carnival of Mathematics 154

## Carnival of Mathematics 154

The Math Citadel is happy to host Carnival of Mathematics edition $154$, where we take a look at math blog posts from all around. For all those interested in Carnival of Mathematics future and past, visit The Aperiodical.

To get things started, we present a few interesting properties of $154$, in keeping with Carnival of Mathematics custom.  Firstly, $154$ is a palindrome when written in bases $6$, $8$, $9$, and $13$:

$$154=414_6=282_8=181_9=\text{BB}_{13}.$$

The nearest primes to 154 are 151 and 157.  Since 154 is equidistant from these, that is,

$$|154-151|=3=|154-157|,$$

154 is referred to as an interprime number

Finally, a triangle rotated internally 154 times1

Now on to this month’s post selection.

First off, Anthony Bonato discusses the hierarchy of infinities and formative study in axiomatic set theory that brought about The Continuum Hypothesis, in a post by the same name at The Intrepid Mathematician. Included is a brief account of mathematicians Gödel and Cohen and their respective constructions of set theory to uphold or prove false the hypothesis.

Next, Arvind Rao examines the case of defining a circle by three non-colinear points in the plane in 3 Points Make a Circle. The evidence in the affirmative that any such three points do indeed define a circle is presented via straight edge and compass as well as programmatically2.

In abstract algebra, we include a post of our own, All the Same Opposites, which demonstrates some of the many forms taken by $\text{GF}(2)$ and the significance of their structural equivalence.

At Quanta Magazine, Patrick Honner presents the mathematical reasoning behind vaccines in How Math (and Vaccines) Keep You Safe From the Flu, comparing linear and exponential growth rates as basis for understanding how illnesses spread. And Kevin Hartnett discusses the Navier-Stokes Equations and why these merit a spot on the list of Millennium Problems in mathematics in What Makes the Hardest Equations in Physics So Difficult?

We close with probability theory. Our own Dr. Rachel Traylor writes about the rare (and historically neglected) “singular continuous distributions” in The Red-Headed Step-Distributions. Lastly, an older post: Will Kurt at Count Bayesie gives a detailed look into Kullback-Leibler divergence and its information theoretic roots in Kullback-Leibler Divergence Explained.

All the Same Opposites

## All the Same Opposites

###### Editor’s note: see this appendix for supporting proofs.

Fields are among the most convenient algebraic structures, preserving much of the arithmetic we know and love from familiar fields like the rationals $\mathbb{Q}$ and the real numbers $\mathbb{R}$.

Now, it is unnecessary that a set possess infinitely many elements to possibly constitute a field (under the right binary operations). Indeed, Dr. Rachel Traylor recently invited readers to finite field theory by way of GF(4), the field with four elements. In this post, I propose to discuss the simplest of all possible fields, namely $\text{GF}(2)$.

## What is $\text{GF}(2)$?

As Dr. Traylor explains in her post on GF(4), a field is built via a nonempty set $\mathbb{F}$ and two binary operations, typically denoted $+$ and $\cdot$ when no further specifying is needed.  Speaking abstractly, $\text{GF}(2)$ is the case when $\mathbb{F}$ is taken to be any set of two elements, say, $\{a,b\}$, satisfying the operation tables below1.

 $+$ $a$ $b$ $a$ $a$ $b$ $b$ $b$ $a$

 $\cdot$ $a$ $b$ $a$ $a$ $a$ $b$ $a$ $b$

## So, what makes $\text{GF}(2)$ simplest?

Briefly put, there is no field with fewer elements than has $\text{GF}(2)$. Why is this so? The operations $+$ and $\cdot$ each require an identity element, and these must be distinct. As a result, any field must contain at least two elements. And, as it happens, those are enough to define a fully-fledged field.

As it is the most basic of fields, it might be expected that $\text{GF}(2)$ is only trivially interesting, or only appears in coverage of the subject on the theoretical front. (No judgment here; sometimes the simplest examples of structures are illustrative of very little. See the trivial topology or trivial algebra of sets on a set $X$.) However, $\text{GF}(2)$ is the mathematical representation of many dualities we encounter on a daily basis. We will take a look at some of the prominent cases.

## Even & Odd

Let’s denote by ${\bf e}$ and ${\bf o}$ arbitrarily chosen even and odd integers, respectively. Truly, in this way, we are defining two equivalence classes on the set of integers, $\mathbb{Z}$, by way of modulo 2 addition.

Reminder, we say an integer $m$ is even to mean there exists an integer $k$ such that $m=2k$, and $m$ is odd to mean there exists an integer $j$ such that $m=2j+1$. Collect these in a set: $\{{\bf e},{\bf o}\}$, and consider this set under ordinary addition $+$ and multiplication $\cdot$ .

These results are summarized in tables.  For instance, in the $+$ table below, we can think of ${\bf e}+{\bf o}$ as

$$(2k)+(2j+1)=2(k+j)+1\equiv{\bf o},$$

since $k+j$ is also an integer.

 $+$ ${\bf e}$ ${\bf o}$ ${\bf e}$ ${\bf e}$ ${\bf o}$ ${\bf o}$ ${\bf o}$ ${\bf e}$

 $\cdot$ ${\bf e}$ ${\bf o}$ ${\bf e}$ ${\bf e}$ ${\bf e}$ ${\bf o}$ ${\bf e}$ ${\bf o}$

From these tables, ${\bf e}$ serves as additive identity for the field, and ${\bf o}$ the multiplicative identity.

## Binary

Even more readily encountered than even and odd is binary arithmetic. We have a series of posts here on the theory of coding, and all of it rests on $\{0,1\}$ when taken with the operations of $+_2$ (addition modulo 2) and $\cdot$ .

 $+_2$ $0$ $1$ $0$ $0$ $1$ $1$ $1$ $0$

 $\cdot$ $0$ $1$ $0$ $0$ $0$ $1$ $0$ $1$

The similarities to the tables for $(\{{\bf e},{\bf o}\},+,\cdot)$ are evident.  With binary, $0$ is the additive (modulo 2) identity, and $1$ is the multiplicative identity.  This seems to follow naturally.  After all, $0$ is an even integer, and $1$ is an odd integer, so these elements are easily believed to work alike.  With that in mind, I move to an example with a less immediate connection.

## Truth & Falsehood

Now I want to consider the set $\{{\bf T},{\bf F}\}$ of truth values2 true and false, respectively.

It is worthwhile to stop and think on which operations make sense for this set. Two are needed to construct a field. Just as even and odd integers may be combined by adding and multiplying, mathematical statements may be combined via disjunction (“or,” $\vee$), conjunction (“and,” $\wedge$), and implication (“if, then,” $\Rightarrow$).  For this case, I am interested in the special “exclusive or,” also called XOR3, denoted by $\oplus$, and conjunction.

 $\oplus$ ${\bf F}$ ${\bf T}$ ${\bf F}$ ${\bf F}$ ${\bf T}$ ${\bf T}$ ${\bf T}$ ${\bf F}$

 $\wedge$ ${\bf F}$ ${\bf T}$ ${\bf F}$ ${\bf F}$ ${\bf F}$ ${\bf T}$ ${\bf F}$ ${\bf T}$

## Opposite… in Precisely the Same Way

The only thing setting these examples apart is an exchange of symbols. Truly,

$a$, ${\bf e}$, $0$, and ${\bf F}$,

are interchangeable, just as are

$b$, ${\bf o}$, $1$, and ${\bf T}$.

What matters is that these individual elements behave in exactly the same way with respect to their operations.  In the language of algebra, it is said

$(\{a,b\},+,\cdot)$

$(\{{\bf e},{\bf o}\},+,\cdot)$

$(\{0,1\},+_2,\cdot)$, and

$(\{{\bf F},{\bf T}\},\oplus,\wedge)$

are isomorphic, that is, structurally equivalent.

Definition (Isomorphism of fields). We say two fields $(\mathbb{F},+,\times)$ and $(\mathbb{H},\boxplus,\boxtimes)$ are isomorphic to mean there exists a function $\varphi\mathrel{:}\mathbb{F}\to\mathbb{H}$ such that $\varphi$ is one-to-one, onto, and, for all $x,y\in\mathbb{F}$,

$$\varphi(x+y)=\varphi(x)\boxplus\varphi(y);\quad\varphi(x\times y)=\varphi(x)\boxtimes\varphi(y).$$

In the case of $\text{GF}(2)$, proving isomorphism amounts to simply defining the function that swaps the symbols as needed.  For example, to show $(\{{\bf e},{\bf o}\},+,\cdot)$ and $(\{0,1\},+_2,\cdot)$ are isomorphic, define4$\varphi\mathrel{:}\{{\bf e},{\bf o}\}\to\{0,1\}$ by putting $\varphi({\bf e})=0$ and $\varphi({\bf o})=1$.

## Concluding Remarks

Mathematics in many respects is the art5 of extension and generalization. The goal is frequently to take an existing object, assume an aerial viewpoint to learn its structure and what makes it work. (This is often carried out by seeing how few pieces of the whole are really needed to accomplish those key characteristics.)

With the right perspective, even these sets of opposites can bear likeness.  I take comfort in that.

Extensions of the Single Server Efficiency Model

## Abstract

Editor’s note: this paper comprises the third chapter of the PhD dissertation by Rachel Traylor. Visit here and here to see chapters one and two, respectively. Herein we further generalize the single server model presented in [3]. In particular, we consider a multichannel server under the cases of both singular and clustered tasks. In the instance of singular tasks, we present a load balancing allocation scheme and obtain a stochastic breakdown rate process, as well as derive the conditional survival function as a result. The model of a multichannel server taking in clustered tasks gives rise to two possibilities, namely, independent and correlated channels. We derive survival functions for both of these scenarios.

## Load Balancing Allocation for a Multichannel Server

### Model Description

Previously, we had assumed that a web server functions as a single queue that attempts to process jobs as soon as they arrive. These jobs originally brought a constant stress $\eta$ to the server, with the system stress reducing by $\eta$ at the completion of each job.

Now, suppose we have a server partitioned into K channels. Denote each channel as $Q_k,$ $k = 1,\ldots,K$. Jobs arrive via a nonhomogenous Poisson process with rate $\lambda(t)$. Upon arrival, each job falls (or is routed) to the channel with the shortest queue length. If all queue lengths are equal or multiple channels have the shortest length, the job will enter one of the appropriate queues with equal probability.

We retain the previous notation for the baseline breakdown rate, or hazard function. This is denoted by $r_0(t)$ and is the hazard function under an idle system. We also retain the assumption that the arrival times $\mathbf{T}$ are independent. In addition, the service times $\mathfrak{W}$ are i.i.d. with distribution $G_W(w)$. We assume that all channels are serving jobs at the same time, i.e., a job can be completed from any queue at any time. We do not require load balancing for service. In other words, any queue can empty with others still backlogged. We also retain the FIFO service policy for each queue.

Since we have now “balanced,” or distributed, the load of jobs in the server, not all jobs will cause additional stress to the system. Suppose all jobs bring the same constant stress $\eta$ upon arrival. Under load balancing, we will define the additional stress to the system as $\eta\max_k|Q_k|$. Figure 1 shows an example server with current stress of $4\eta$.

### Examples

Due to the dynamic nature of arrival times, allocation to queues, and service times, we have many possible configurations of jobs at any point in time. Therefore, the allocation scheme adds an additional layer of variation to the service times and order of service. The placement of jobs in the various queues (and thus the order of service and service times) is wholly dependent on all arrival times and service times of the prior arrivals. The following examples illustrate the effect on the workload stress added to the system in various scenarios.

###### Figure 2(b). Breakdown Rate Process Trajectory

Example 1Suppose for simplicity we have 2 channels. Suppose at the time of observation of the system, 3 jobs have arrived and none have finished. WLOG, suppose job 3 fell into $Q_{1}$. See Figure 2(a). The stress to the system at $t=t_{\text{obs}}$ is $r_{0}(t_{\text{obs}}) + 2\eta$, as shown in Figure 2(b).

Note in Example 1 that Job 2 does not add any additional stress to the system. Job 1 sees an empty queue upon arrival, and $\max_{K}|Q_{K}| = 1$ when it falls into any particular queue. Job 2 arrives as Job 1 is still being processed, and thus the placement of Job 1 forces Job 2 into the empty channel. Since $\max_{K}|Q_{K}|$ is still 1, the stress to the system doesn’t change. Job 3 arrives as Jobs 1 and 2 are in service, and thus its choice of queue is irrelevant due to the configuration of the two queues at $T_{3}$. Regardless of which queue Job 3 falls into, $\max_{K}|Q_{K}| = 2$. Thus the arrival of Job 3 increases the breakdown rate by $\eta$ again.

The next example shows the change in system stress Job 1 from Example 1 when one job has finished processing before $T_{3}$.

###### Figure 3(b). Example 2 Breakdown Rate Process Trajectory

Example 2. Consider the same two-channel system from Example 1. However, now suppose WLOG that $T_{3} < T_{1}+W_{1}$. In other words, service for Job 1 was completed before Job 3 arrived. Hence Job 3 will fall into the opposite queue as Job 2. The stress to the system at the time of observation would be $r_{0}(t) + \eta$. See Figures 3(a) and 3(b).

In this scenario, the workload due to Job 3 does not contribute any additional stress to the server. Also observe that upon completion of Job 1, the workload stress to the server does not decrease, as Job 2 still resides in the system and is being served.

Contrast this behavior with the breakdown rate process given in [3]. In the single-channel, single-server model described in both [1] and [3], each job adds stress to the server upon arrival. Under the load balancing allocation scheme, the additional stress to the server depends on the arrival and service times of all prior jobs. From a stochastic perspective, this breakdown rate process has full memory.

The examples above illustrate that $\max_{K}|Q_{K}|$ depends on the intersection of the intervals $I_{j} = [T_{j}, T_{j} + W_{j}]$, $j = 1,\ldots,N(t)$. The next section details the methodology to obtain the configuration of jobs in the server at time $t$ by deccomposition of $\bigcup_{j=1}^{N(t)}I_{j}$ into disjoint atoms and derives the stochastic breakdown rate process under the load balancing allocation scheme.

### Breakdown Rate Process and Conditional Survival Function

Let $\epsilon = (\epsilon_{1},\ldots,\epsilon_{N(t)})$ be a $N(t)$-tuple whose components $\epsilon_{j}\in\{\emptyset, c\}$, where $\emptyset$ denotes the empty set, and $c$ denotes the complement of the set. Let $E = \{\epsilon \mathrel{:} \epsilon_{j}\in\{\emptyset, c\}\}$ denote the set of all possible $\epsilon$, excepting $\epsilon = (c,\ldots,c)$. Then by Lemma 1 (see Appendix),

$$\bigcup_{j=1}^{N(t)}I_{j} = \bigcup_{\epsilon \in E}\bigcap_{j=1}^{N(t)}I_{j}^{\epsilon_{j}}$$

Remark. $\cap_{j=1}^{N(t)}I_{j}^{\epsilon_{j}}$ indicates which jobs are still in the server at time $t$. The union is disjoint; thus only one $\epsilon$ will describe the server configuration at any given time $t$. For example, if 3 jobs have arrived to the server at time $t_{\text{obs}}$, $|E| = 3\times 2 - 1 = 5$. These may be enumerated:

• $I_{1} \cap I_{2} \cap I_{3}$

• $I_{1}^{c} \cap I_{2}^{c} \cap I_{3}$

• $I_{1} \cap I_{2}^{c} \cap I_{3}^{c}$
• $I_{1}^{c} \cap I_{2} \cap I_{3}$

• $I_{1}^{c} \cap I_{2} \cap I_{3}^{c}$

As an illustration, refer to Example 1. All three jobs are in the system at $t = t_{\text{obs}}$ (that is, none have completed service), and thus $t_{\text{obs}} \in I_{1} \cap I_{2} \cap I_{3}$. Expanding, $t_{\text{obs}} \in [T_{1}, T_{1}+W_{1}]$, $[T_{2}, T_{2}+W_{2}]$, and $[T_{3}, T_{3}+W_{3}]$.

Compare the case with that of Example 2. In this case, three jobs have arrived at $t = t_{\text{obs}}$, but Job 1 has finished by $t_{\text{obs}}$. Thus $t_{\text{obs}} \not\in I_{1}$, but since Jobs 2 and 3 are still in the system, $t_{\text{obs}} \in I_{2} \cap I_{3}$. Thus $t_{\text{obs}} \in I_{1}^{c} \cap I_{2} \cap I_{3}$.

Now, since the additional workload stress to the server is a multiple of $\eta\max_{K}|Q_{K}|$, it remains to derive the appropriate multiplier that accounts for the number of jobs that contribute additional stress to the system.

Let $n = \sum_{j=1}^{N(t)}\mathbf{1}(\epsilon_{j} = \emptyset | \epsilon_{j} \in \epsilon)$ for a particular $\epsilon$, and let $\alpha_{\epsilon}$ be the multiplier that indicates the number of jobs that contribute stress $\eta$ to the system. Under [1] and the generalization in [3], every uncompleted job in the system contributes stress, thus $\alpha_{\epsilon} = n$.

Under the load balancing scheme, $\alpha_{\epsilon} = \lfloor\frac{n+1}{K}\rfloor$, where $K$ is the number of channels in the server. This is due to the allocation scheme’s attempts to evenly distribute jobs across channels. Thus, for Example 1, $n=3$, and $K=2$, meaning $\alpha_{\epsilon} = 2$, as illustrated in Figure 2(b) and for Example 2, $\alpha_{\epsilon} = \lfloor\frac{3+1}{2}\rfloor = 1$, as in Figure 3(b).

Then, the stochastic breakdown rate process under the load balancing allocation scheme is given by
$$\mathcal{B}(t) = r_{0}(t) + \eta\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{I_{1}^{\epsilon_{1}}\cap I_{2}^{\epsilon_{2}} \cap I_{N(t)}^{\epsilon_{N(t)}}}(t)$$

Under this expression, only one indicator function will be nonzero at any given point in time, since all atoms are disjoint. Now, $I_{1}^{\epsilon_{1}} \cap I_{2}^{\epsilon_{2}} \cap \ldots \cap I_{N(t)}^{\epsilon_{N(t)}}$ may be expressed as one interval $[L_{\epsilon},R_{\epsilon}]$, where

\begin{aligned}L_{\epsilon}&=\max\left(\{T_{j}\mathrel{:}\epsilon_{j} = \emptyset\}_{j=1}^{N(t)}\right);\\R_{\epsilon}&=\min\left(\{T_{j} + W_{j}\mathrel{:}\epsilon_{j} = \emptyset \}_{j=1}^{N(t)},\{T_{j}\mathrel{:}\epsilon_{j} = c\}_{j=1}^{N(t)}\right)\end{aligned}

Thus, for a server with $K$ channels under a load balancing routing scheme with all jobs bringing constant stress $\eta$, the breakdown rate process $\mathcal{B}(t)$ may be expressed as

$$\mathcal{B}(t) = r_{0}(t) + \eta\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{[L_{\epsilon}, R_{\epsilon}]}(t)\qquad(*)$$

Thus, the conditional survival function under the load balancing scheme is given by
\begin{aligned}S_{Y|\mathfrak{T},\mathfrak{W}, N(t)}(t|\mathfrak{t},\mathfrak{w}, n) &= e^{-\int_{0}^{t}\mathcal{B}(s)ds}\\&=\bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}\sum_{\epsilon \in E}\alpha_{\epsilon}\mathbf{1}_{[L_{\epsilon}, R_{\epsilon}]}(s)ds\right)\\&=\bar{F}_{0}(t)\exp\left(-\eta\sum_{\epsilon\in E}\alpha_{\epsilon}\min(t-L_{\epsilon},R_{\epsilon})\right) \end{aligned}

### Remarks

Finding the survival function of the single-channel environment relied on the independence of the set of arrival times and service times. From $(*)$, the independence is clearly lost. As noted before, the random breakdown process has full memory, and thus is completely dependent upon the entire trajectory up to $t = t_{\text{obs}}$.

## Clustered Tasks in a Multichannel Server

###### Figure 4. Clustered Tasks in a Multichannel Server

The previous multichannel server model in Section 1 implicitly assumed each job comes with one task, and all channels are identical in their ability to serve any task brought by a job. A classic illustration is a block of registers at a retail establishment. Each customer will survey the length of the various queues at each register before choosing the shortest queue. Viewing each of these separate registers as a channel in a single server under these conditions gave rise to the load balancing allocation model detailed in the previous section. This section presents a different interpretation of a multichannel, single-server model.

Suppose a server has multiple channels $Q_{1},\ldots,Q_{K}$, but each channel serves a different type of task. A customer arrives to the server and may select any number from $0$ to $K$ tasks for the server to perform. Said customer will select each possible task $j$ with probability $p_{j}$. Figure 4 illustrates an example of such a situation in which three customers visit the server and each customer picks a different number and set of tasks at random. A customer is considered fully serviced (i.e. the job is complete) upon completion of the last task belonging to that particular customer.

### Model Assumptions

The following mathematical assumptions are made for the multichannel server with clustered tasks:

1. Customers arrive to the server with $K$ channels via a nonhomogenous Poisson process (NHPP) with intensity $\lambda(t)$.
2. The breakdown rate of the idle server is given by $r_{0}(t)$.
3. Each channel corresponds to a different task the server can perform.
4. The selection of each task is a Bernoulli random variable with probability $p_{k}$. Thus the number of tasks selected by each customer is a binomial random variable.
5. The workload stress to the server is a constant multiple $\eta$ of the number of tasks requested by the customer, i.e. the additional stress is given by $\eta N$, where $N$ is the number of tasks requested.
6. The PDF of each channel’s service time is given by $g_{i}(w)$, $i = 1,\ldots ,K$. Since the customer’s service is not complete until all requested tasks have finished, the service life distribution for the customers is given by $\max_{i}G_{i}(w)$.

Under these assumptions, this model is a special interpretation of the random stress environment developed in [3]. In this case, the random workload stress is $\eta N$, where $N$ is a binomial random variable, and the service life distribution $G_{W}(w) = \max_{i}G_{i}(w)$, which may be easily obtained through the mathematical properties of order statistics. Two variations are considered in the  next section: independent channels and correlated channels.

### Independent Channels in a Clustered Task Server

Suppose the selection probabilities for each task in the server are identical, that is, $p_{1}=p_{2}=\ldots=p_{K}=p$. Then $N\sim\text{Bin}(K,p)$. Using Theorem 3 in [3], the survival function of the multichannel server is given in the following theorem.

Theorem 1 (Survival Function of Multichannel Server with Clustered Tasks and Independent Channels). Suppose model assumptions (1)-(6) above are satisfied. In addition, assume $p_{1}=p_{2}=\ldots=p_{K}=p$. Then the survival function of the server is given by
\begin{aligned}S_{Y}(t)&=\bar{F}_{0}(t)\exp\left(-K\eta\left[e^{-\eta t}\left(1-p+pe^{-\eta t}\right)^{K-1}-p(1-p)^{K-1}\right]\right.\\&\qquad\left.\times\int_{0}^{t}m(t-w)\bar{G}_{W}(w)dw\right)\end{aligned}

where $m(x) = \int_{0}^{x}\lambda(s)ds$, $\bar{F}_{0}(t) = e^{-\int_{0}^{t}r_{0}(s)ds}$, $\bar{G}_{W}(w) = 1-G_{W}(w)$, and $G_{W}(w) = \max_{i}G_{i}(w)$.

### Correlated Channels in a Cluster Server

Now suppose the server tasks are correlated, in that the selection of one particular task may affect the selection of any or all of the other tasks. Thus the channels are a sequence of dependent Bernoulli random variables. The construction of dependent Bernoulli random variables is given in [2], and a summary is given.

#### Dependent Bernoulli Random Variables and the Generalized Binomial Distribution

Korzenwioski [2] constructs a sequence of dependent Bernoulli random variables using a binary tree that distributes probability mass over dyadic partitions of $[0,1]$. Let $0\leq\delta\leq1$, $0, and $q=1-p$. Then define the following quantities:
\begin{aligned}q^{+}\mathrel{:=}q+\delta p&\qquad p^{+}\mathrel{:=}p+\delta q\\q^{-}\mathrel{:=}q(1-\delta)&\qquad p^{-}\mathrel{:=}p(1-\delta)\end{aligned}

The quantities above satisfy the following conditions:
\begin{aligned}q^{+}+p^{-}=q^{-}&+p^{+}=q+p=1\\qq^{+}+pq^{-}=q&,\quad qp^{-}+pp^{+}=1\end{aligned}

###### Figure 5. Construction of Dependent Bernoulli Random Variables

Figure 5 shows the construction shows the dependencies. The following examples using coin flips illustrate the effect of the dependency coefficient $\delta$:

Example 1 $(\delta=1)$. For $\delta = 1$, $q^{+} = q+p=1, q^{-} = 0$, $p^{+} = p+q = 1$, and $p^{-} = 0$. Supposing the first coin flip $\epsilon_{1} = 1$. Then every successive $\epsilon_{i}$ will also be $1$. Similarly if $\epsilon_{1} = 0$. Thus the result of the first coin flip completely determines the outcomes of all the rest.

Example 2 $(\delta = 0)$For $\delta = 0$, $q^{+} = q^{-} = q$, and $p^{+} = p^{-} = p$. Thus, the first coin flip (and all subsequent ones) have no effect on the ones that follow.

Example 3 $(\delta = \frac{1}{4})$Suppose $p=q=\frac{1}{2}$. Then $p^{+} = q^{+} = \frac{5}{8}$, and $p^{-} = q^{-} = \frac{3}{8}$. Then the subsequent outcomes $\epsilon_{i}$, $i \geq 2$ are more likely to match the outcomes of $\epsilon_{1}$ than not.

Now suppose $p = \frac{1}{4}$, $q = \frac{3}{4}$. Then $p^{+} = \frac{7}{16}$$p^{-} = \frac{3}{16}$, $q^{+} = \frac{13}{16}$, and $q^{-} = \frac{9}{16}$. In this example of an unfair coin, the dependency coefficient $\delta$ still attempts to skew the results following the first coin flip in favor of the outcome of $\epsilon_{1}$. However, the dependency here heightens the effect of $\epsilon_{1} = 0$ on subsequent flips, and cannot overcome the discrepancy between the probability of success and failure to skew $\epsilon_{i}$, $i \geq 2$ in favor of a $1$ following the outcome of $\epsilon_{1} = 1$.

Using these dependent Bernoulli random variables, [2] presents a Generalized Binomial Distribution for identically distributed but dependent Bernoulli random variables.

Generalized Binomial Distribution

Let $X = \sum_{i=1}^{n}\epsilon_{i}$, where $\epsilon_{i}$, $i = 1,\ldots ,n$ are identically distributed Bernoulli random variables with probability of success $p$ and dependency coefficient $\delta$.  Then

$$P(X=k) = q\tbinom{n-1}{k}(p^{-})^{k}(q^{+})^{n-1-k} + p\tbinom{n-1}{k-1}(p^{+})^{k-1}(q^{-})^{n-1-(k-1)}$$

#### Survival Function of Correlated Channels in a Cluster Server

Suppose the selection of tasks may be modeled by the dependent Bernoulli random variables given in the previous section. That is, suppose the customer selects Tasks 1-$K$ in sequence, and the selection or rejection of Task 1 affects all subsequent tasks by a dependency coefficient $\delta$. From [2], the correlation between task selections $\epsilon_{i}$ and $\epsilon_{j}$ is given by
$$\rho = \text{Cor}(\epsilon_{i},\epsilon_{j})=\begin{cases}\delta,&i=1;j=2,\ldots,K\\\delta^{2},&i\neq j;i,j\geq 2\end{cases}$$

This illustrates the dependency of Tasks 2-$K$ on the outcome of Task 1, and notes that while Tasks 2-$K$ are still correlated with each other, the dependency is much lower. In a similar fashion to the independent channel server, the survival function is derived.

Theorem 2 (Survival Function of Multichannel Server with Clustered Tasks and Dependent Channels). Suppose model assumptions 1-6 are satisfied. In addition, suppose the selection of channels 1-$K$ are determined by identically distributed Bernoulli random variables with dependency coefficient $\delta$ as defined in [2]. Then the survival function of the server is given by
$$S_{Y}(t)=\bar{F}_{0}(t)\exp\left(-\eta\int_{0}^{t}m(t-w)\bar{G}_{W}(w)\mathscr{S}(w)dw\right)$$

where $m(x) = \int_{0}^{x}\lambda(s)ds$, and

\begin{aligned}\mathscr{S}(w)&=\sum_{n=0}^{K}e^{-\eta n w}\sum_{j=0}^{K-n-1}\tbinom{K-1}{n-1, j,K-1-n-j}p^{K-1-j}(1-p)^{j+1}\delta^{K-1-n-j}(1-\delta)^{n}\\&\quad+\sum_{n=0}^{K}ne^{-\eta nw}\sum_{i=0}^{n-1}\tbinom{K-1}{K-1-n,i,n-1-i}p^{i+1}(1-p)^{K-n}\delta^{n-1-j}(1-\delta)^{K-n-j}\end{aligned}

## Conclusion

The generalized model of a server under random workload proposed in [3] admits further expansion by way of relaxing the assumption that incoming tasks have exactly one queue to enter on arrival. In considering a server partitioned into several channels, a cost is incurred, namely that additional stress to the server is dependent upon arrival and service times of all previous jobs. However, even under these circumstances, we may obtain a breakdown rate process and satisfactory conditional survival function for the server, and the door is opened to further discussion. By examining the multichannel server, we consider the interrelations of the channels themselves, and derive survival functions to meet the case when the channels are independent as well as when they are correlated.

### References

[1] Ki Hwan Cha and Eui Yong Lee. A stochastic breakdown model for an unreliable web server system and an optimal admission control policy. Journal of Applied Probability, 48(2):453–466, 2011.

[2] Andrzej Korzeniowski. On correlated random graphs. Journal of Probability and Statistical Science, 11:43–58, 2013.

[3] R. Traylor. Stochastic reliability of a server under random workload. Academic Advances of the CTO, 1, 2017.

## A Note on Russell’s Paradox

Topics in mathematics very frequently rely on set theory (I am hard-pressed to quickly think of one that doesn’t). Set theory itself is a very abstract area of study. Even so, mathematicians built it on axioms (self-evident truths taken as fundamental starting points), and, of course, debate continues to this day as to which axioms are the best/most efficient.

As mathematicians delight in doing, one can begin with these axioms and search for boundaries. That is, ask the question, “What can I get away with while still following these rules?” And paradoxes can follow. Here set theory is no exception. But paradoxes aren’t the end of the world; on the contrary, they can be nicely illuminating. Taking just one of the first principles of axiomatic set theory, one can arrive at a substantial fact of mathematics, while in the course of uncovering a well-worn conundrum.

This post, of course, concerns Russell’s Paradox, as covered in Naïve Set Theory by Paul Halmos. I highly recommend the book to readers who enjoy the discussion to follow; it is a wonderfully readable treatment of axiomatic set theory.

## A Few Definitions & Notations

Throughout, the word set refers to a general collection of objects; note these objects may themselves be sets. The symbol $\in$ denotes belonging to a set, and will translate $x\in A$ to mean any of the following equivalent expressions in English:

• $x$ is an element of $A$;
• $x$ belongs to $A$;
• $x$ is a member of $A$.

Similarly, $\notin$ denotes not belonging. A set will also be noted with curly braces, e.g., $\{$all real numbers$\}$. Sometimes variables are used as well, as in

$$\{x\in S\mathrel{:}x>5\},$$

read as the set of all $x$ belonging to $S$ such that $x$ is greater than 5.

Finally, a contradiction is any mathematical statement which is always false. (Contradictions are not allowed to exist.1) An example of the ‘canonical contradiction’ insists a statement and its negation are true all at once. Take the clause ‘$n>8$ and $n\leq8$,’ for instance. These properties cannot simultaneously hold for any number $n$, otherwise said the statement is false for all numbers $n$.

## Axiomatic Set Theory

As with most branches of mathematics, set theory has been axiomatized, more than once, as mentioned earlier. In order to come to the mathematical understanding I focus on in this post, we only need one of these building blocks.

Axiom (Specification). Given a set $A$ and a property $p$, there is a set $B$ whose elements are exactly those elements of $A$ for which $p$ is true.

This axiom lays the ground for defining sets in terms of existing ones. All partitioning, categorizing, classifying, and so forth make use of this. As an example, we can speak of the set of all positive integers, $\mathbb{N}=\{1,2,3,\ldots\}$, and specify a subset like the set of all even positive integers:

$$\{2,4,6,\ldots\}=\{n\in\mathbb{N}\mathrel{:}n\text{ is even}\}.$$

Put plainly, given a set, you can create a new one (referred to as a subset) meeting criteria of your choosing. And it’s the choosing that makes set theory so fun. With that, let’s make a choice, and see what we can get away with while following this rule.

First we suppose a set $A$ is given; here $A$ can be any set you like (odd positive integers, irrational numbers, even the set of your long-term goals).

Next define the property $p(x)$ to mean $x\notin x$. That is, $p(x)$ is true if, and only if, $x\notin x$.

Let’s stop for a moment and consider $p(x)$. You might think it nonsensical, but it can be seen to hold even in the physical sense, e.g., a box is not contained in itself, even when it contains several smaller boxes. For a math example, the set of natural numbers is not itself a natural number, so $\mathbb{N}\notin\mathbb{N}$. (Perhaps less reasonable is the opposite $x\in x$, but that can be saved for another discussion.

With a touch of grandeur, you may now INVOKE the Axiom of Specification to create a new set

$$B=\{x\in A\mathrel{:}p(x)\}=\{x\in A\mathrel{:}x\notin x\}.$$

To clarify, this means that in order for it to be true that $y\in B$, then it must be the case that $y\in A$ and $y\notin y$. As you might have guessed, the definition of $B$ is less interesting than what we can do with it. On that note, a query:

Is it possible that $B\in A$?

This question, relatively easily posed, quickly leads to trouble. Let’s examine.

By specifying the new set $B$ by way of the property $p(x)$, we split membership in $A$. If $y\in A$, then either $y\in B$ or $y\notin B$, by definition.

Now we apply that to $B$, and investigate both cases. If we suppose $B\in A$:

• Case 1. If $B\in B$, then $p(B)$ is untrue. By definition of $p(B)$, it follows $B\notin B$, a contradiction.
• Case 2. If $B\notin B$, then $p(B)$ is true. But this means $B\in B$, a contradiction again.

As all our options yield contradictions, the assumption that $B\in A$ in the first place must have been a faulty one2. That is, it can only be the case that $B\notin A$.

The arrival at a contradiction under all possible cases above is known as Russell’s Paradox, attributed to its first recorded discoverer, the logician Bertrand Russell.

## Conclusion (Why does this matter?)

The abstract nature of set theory makes it somewhat easy to regard Russell’s Paradox as more a minor mathematical curiosity/oddity than, say, The Fundamental Theorem of Calculus. In particular, the discussion above only makes use of nondescript sets and minimally defined elements thereof.

Unlikely as it may seem, the importance lies in leaving the set $A$ arbitrary. Remember, we said let $A$ be any set you like. And still we found something that cannot possibly belong to that arbitrary set. Even leaving almost nothing to specifics, we created something necessarily separate.

What this means is that no set which contains everything can exist. There is no ‘universal’ set in mathematics. A frame of reference cannot be truly complete. Indeed, that speaks even to the axiomatic system by which we derived this very fact!3 No matter how ‘large’ the set, there exists something apart from it. On a further philosophical lean, there is always room to take a grander view and think outside the present mode, to pick up something new.