### Browsed byCategory: Spines

Exploiting Chemistry for Better Packet Flow Management 2: Formal Model

## Exploiting Chemistry for Better Packet Flow Management 2: Formal Model

This post is the second breaking down a report/review of a technical report by Meyer and Tschudin [11] that modifies the formal notion of an artificial chemistry and creates an artificial packets chemistry with the goal of designing better flow management by exploiting the natural behavior of chemical reactions.

Note: for those more interested in the application and implementation review and discussion, this section can be skipped.

## Formal Model of Artificial Packet Chemistry

### Artificial Chemistry

The notion of an formal artificial chemistry has been around for some time. There is an excellent paper by Dittrich, Ziegler, and Banzhaf that gives a survey of the work in this area[1]. Put simply, an artificial chemistry is a tuple $(\mathcal{S},\mathcal{R},\mathcal{A})$ where $\mathcal{S} = \{s_{1},s_{2},\ldots s_{n}\}$ is a set of all valid molecules, $\mathcal{R}$ is a set of rules $r$ that describe interactions between molecules, and $\mathcal{A}$ is the reactor algorithm that determines how the set of rules in $\mathcal{R}$ is applied to a collection of molecules termed $\mathcal{P}$. $\mathcal{P}$ may be a reaction vessel, reactor, or “soup” (as Dittrich et al call it). It’s also notable that $\mathcal{P}$ cannot be identical to $\mathcal{S}$.1

To expand a bit more, we’ll note that the rules $r \in \mathcal{R}$ all take the form

$$s_{1} + s_{2} + \ldots s_{n}\longrightarrow \tilde{s}_{1}+\tilde{s}_{2}+\ldots \tilde{s}_{m}$$

where all $s, \tilde{s} \in \mathcal{S}$. These rules are fairly abstract, and don’t explicitly seem to describe just a reactant to product type reaction. These can be collisions or other types of interactions. The set $\mathcal{S}$ of valid molecules presumably can be partitioned into disjoint subsets of different species of molecule as well, though the representation in [1] is more general.

Regarding the reactor algorithm $\mathcal{A}$, Dittrich et al [1]  give several different descriptions/approaches by which it can be defined, depending on whether each molecule is treated explicitly, or all molecules of a type are represented by a single number (frequency or concentration):

1. Stochastic Molecular Collisions.  Every single molecule is worked with, where a sample of molecules from the reaction vessel $\mathcal{P}$ is drawn and the algorithm checks to see if a particular rule $r \in \mathcal{R}$ applies.
2. Differential Rate Equations: This approach seeks to describe the dynamics of a chemical system using concentrations of molecular species. The rules under this algorithm take a species approach: $$r: a_{1}s_{1} + a_{2}s_{2} + \ldots a_{N}s_{N} \longrightarrow b_{1}s_{1} + b_{2}s_{2} + \ldots + b_{N}s_{N}$$
Here, the $s_{i}$‘s are species, not individual molecules. The coefficients are stoichiometric factors of the reaction. They are simply indicator functions to denote whether species $s_{i}$ is a reactant or product. That is $a_{i} = 1$ if and only if $s_{i}$ is a reactant in the rule $r$, and $b_{i} = 1$ if and only if $s_{i}$ is a product in the rule $r$. It is this form of $\mathcal{A}$ that Meyer and Tschudin [11] utilize in their packet chemistry.

The change of overall concentration (concentration denoted $c_{s_{i}}$) is given by a system of differential equations
$$\frac{\text{d}c_{s_{i}}}{\text{d}t} = (b_{i}-a_{i})\prod_{j=1}^{N}c_{s_{j}}^{a_{j}}, \quad i=1,\ldots,N$$
according to the Law of Mass Action discussed earlier. There may be multiple rules/reactions $r \in \mathcal{R}$ that affect the concentration of species $s_{i}$, so
$$\frac{\text{d}c_{s_{i}}}{\text{d}t} = \sum_{r\in \mathcal{R}}\left[(b_{i}^{r}-a_{i}^{r})\prod_{j=1}^{N}c_{s_{j}}^{a_{j}^{r}}\right], \quad i=1,\ldots,N$$
3. Others: There are other options, such as metadynamics (where the number of species and thus differential equations may change over time), mixed approaches, or symbolic analysis of the differential equations. As this article would be far too cumbersome to discuss these, they are omitted, but may be found in [1].

According to Dittrich et al [1], the reactor algorithm $\mathcal{A}$ depends on the representation of the elements of $s_{i}$ and thus the population. Meyer and Tschudin [11] utilize the second approach, though they do not explicitly state this.

### Artificial Packet Chemistry

Meyer and Tschudin adapt the artificial chemistry in the previous section to suit their queuing networks in a given computer network. They add an element $\mathcal{G}$ to the artificial chemistry tuple to get an artificial packet chemistry $PC = (\mathcal{G},\mathcal{S},\mathcal{R},\mathcal{A})$, where $\mathcal{G}$ is the digraph that gives the topology of the computer network. $\mathcal{G}$ consists of a set of nodes $V_{\mathcal{G}}$ which represent the chemical reaction vessels. (These were the $\mathcal{P}$ in the previous section.), and $E_{\mathcal{G}}$ is the set of directed arcs that represent network links connecting adjacent nodes.2

Here, since a molecular species is the analogue of a queue, $\mathcal{S} = \cup_{i \in V}\{S_{i}\}$. Here, $\{S_{i}\}$ is a set of all queue instances in a particular node $i$. At this point, some discussion on clarity is warranted. It is possible to have more than one queueing instance (more than one molecular species) inside each reaction vessel (here the nodes of the network). I don’t think this is meant to be a disjoint union, since a reaction species can show up in more than one reaction vessel, so there may be repeats of certain species in this representation of $\mathcal{S}$ when written this way. Perhaps it’s just a nitpick, but it’s worth mentioning.

$\mathcal{R} = \cup_{i \in V}\{R_{i}\}$ gives all the flow relations among the queues. Here the rules $r$ take form 2 of $\mathcal{A}$:

$$r \in R_{i}: \sum_{s \in S_{i}}a_{s,r}s \longrightarrow \sum_{s \in S_{i} \cup \{S_{j}\} : j \in N_{i}}b_{s,r}s$$

The reaction rules basically describe what’s going on in a particular reaction vessel. We can send packets to neighboring nodes/vessels ($N_{i}$ is the notation for the neighborhood of node $i$, or the set of adjacent nodes), or we can keep packets in the same node after the reaction is done. The reactions that send packets to neighboring nodes are transmissions.

The mean reaction rate $\nu_{r}$ of each reaction is given by the Law of Mass Action as applied to forward reactions:

$$\nu_{r} = k_{r}\prod_{s\in S}c_{s}^{a_{s,r}}$$

just as described in the previous section.

Figure 3 from Meyer and Tschudin [11] gives an explicit example to help solidify these abstract ideas. The network consists of 4 nodes, so $V = \{n_{1}, n_{2}, n_{3}, n_{4}\}$. Each node has a bidirectional link with its neighbors, so $E = \{n_{1}n_{2}, n_{2}n_{1}, n_{2}n_{3}, n_{3}n_{2}, n_{2}n_{4}, n_{4}n_{2}, n_{3}n_{4}, n_{4}n_{3}\}$. In this case, we only have one species of molecule (one queue) per node, so $\mathcal{S} = \{X_{1}, X_{2}, X_{3}, X_{4}\}$. The set of reactions is simply a first-order reaction per arc: $\mathcal{R} = \{r_{a,b}: X_{a} \to X_{b}: ab \in E\}$

From a review standpoint, I would have liked to see a less trivial example, such as one with multiple queues in a node, and rules that may keep packets in a node instead of just transmitting. These types of scenarios would be interesting to model this way, and demonstrate better the power of this approach.

## Continuation

The next post in the series will discuss the mathematical analyses of the artificial packet chemistry described here.

## References

1. Dittrich, P., Ziegler, J., and Banzhaf, W. Artificial chemistries – a review. Artificial Life 7(2001), 225–275.
1. Feinburg, M. Complex balancing in general kinetic systems. Archive for Rational Mechanics and Analysis 49 (1972).
2. Gadgil, C., Lee, C., and Othmer, H. A stochastic analysis of first-order reaction networks. Bulletin of Mathematical Biology 67 (2005), 901–946.
3. Gibson, M., and Bruck, J. Effcient stochastic simulation of chemical systems with many species and many channels. Journal of Physical Chemistry 104 (2000), 1876–1889.
4. Gillespie, D. The chemical langevin equation. Journal of Chemical Physics 113 (2000).
5. Gillespie, D. The chemical langevin and fokker-planck equations for the reversible isomerizationreaction. Journal of Physical Chemistry 106 (2002), 5063–5071.
6. Horn, F. On a connexion between stability and graphs in chemical kinetics. Proceedings of the RoyalSociety of London 334 (1973), 299–330.
7. Kamimura, K., Hoshino, H., and Shishikui, Y. Constant delay queuing for jitter-sensitive iptvdistribution on home network. IEEE Global Telecommunications Conference (2008).
8. Laidler, K. Chemical Kinetics. McGraw-Hill, 1950.
9. McQuarrie, D. Stochastic approach to chemical kinetics. Journal of Applied Probability 4 (1967), 413–478.
10.  Meyer, T., and Tschudin, C. Flow management in packet networks through interacting queues and law-of-mass-action-scheduling. Technical report, University of Basel.
11.  Pocher, H. L., Leung, V., and Gilles, D. An application- and management-based approach to atm scheduling. Telecommunication Systems 12 (1999), 103–122.
12. Tschudin, C. Fraglets- a metabolistic execution model for communication protocols. Proceedings of the 2nd annual symposium on autonomous intelligent networks and systems (2003).

Exploiting Chemistry for Better Packet Flow Management 1: Introduction

## Exploiting Chemistry for Better Packet Flow Management 1: Introduction

Perhaps the phrase “don’t reinvent the wheel” is overused. However, many newer disciplines, particularly in the technology sector, seem to insist on it. One thing physical engineers learned long ago was to study the world around them, work with it, and emulate it in their designs. Network engineering should be no different. In a technical report from 2011, authors Thomas Meyer and Christian Tschudin from the University of Basel describe a highly elegant natural flow management method [11] that exploits much of the hard work done in the chemistry sector in chemical kinetics. They describe a scheduling approach that creates an artificial chemistry as an analogue of a queueing network, and uses the Law of Mass Action to schedule events naturally. The analysis of such networks utilizing their implementation is simplified regardless of the depth one wishes to go into. In addition, they show a congestion control algorithm based on this framework is TCP fair and give evidence of actual implementation (a relief to practitioners).

This report will discuss their paper at length with a goal of covering not only their work, but the underlying ideas. Since the paper requires knowledge of chemical kinetics, probability, queueing theory, and networking, it should be of interest to specialists in these disciplines, but a comprehensive discussion of the fundamentals glossed over in the paper would make dissemination more likely.

Note: the full pdf of the report is attached to each section of this series at the bottom.

## Overall Review of the Paper

The paper is well written, readable, and quite clear despite the breadth of scope it contains. It’s a major benefit to the authors and readers that their theoretical model has been tested in a proof-of-concept network. The work shows much promise for the networking space as a whole.

## Overview of the Paper and Chemistry Basics

Just as in chemistry and physics, packet flow in a network has microscopic behavior controlled by various protocols, and macro-level dynamics. We see this in queueing theory as well–we can study (typically in steady-state to help us out, but in transient state as well) the stochastic behavior of a queue, but find in many cases that even simple attempts to scale the analysis up to networks (such as retaining memorylessness) can become overwhelming. What ends up happening in many applied cases is a shift to an expression of the macro-level properties of the network in terms of average flow. The cost of such smoothing is an unpreparedness to model and thus deal effectively with erratic behavior. This leads to overprovisioning and other undesirable and costly design choices to mitigate those risks.

Meyer and Tschudin have adapted decades of work in the chemical and physical literature to take advantage of the Law of Mass Action, designing an artifical chemistry that takes an unconventional non-work-conserving approach to scheduling. Non-work-conserving queues add a delay to packets and have tended to be avoided for various reasons, typically efficiency. Put simply, they guarantee a constant wait time of a packet regardless of the number of packets in a queue by varying the processing rate with fill level of the queue. The more packets in queue, the faster the server processes those packets.

### Law of Mass Action in Chemistry

If we have some chemical reaction with reactants $A_{1}$, $A_{2}$,$\ldots$, $A_{n}$ and products $B_{1}$, $B_{2}$, $\ldots$, $B_{m}$, and the reaction is only forward1, then we may express the reaction as

$$A_{1} + A_{2} + \ldots + A_{n} \longrightarrow_{k} B_{1} + B_{2} + \ldots + B_{n}$$

where $k$ is the rate constant. In a simple reaction $A \to P$, with $P$ as the product, we can see the rate expressed nicely in a very basic differential equation form[9]:

$$-\frac{\text{d}c_{A}}{\text{d}t} = k\cdot c_{A}$$

This should actually look somewhat similar to problems seem in basic calculus courses as well. The rate of change of draining the reactant is a direct function of the current concentration.

The reaction rate $r_{f}$ of a forward reaction is proportional to the concentrations of the reactants:

$$r_{f} = k_{f}c_{A_{1}}c_{A_{1}}\cdots c_{A_{N}}$$

for a set of reactants $\{A_{i}\}$.

### The Queuing Analogue and Assumptions

Meyer and Tschudin[11] express the networking version of these chemical reactions in a very natural way. Packets are molecules. A molecular species is a queue, so molecules of species $X$ go into queue $X$. The molecular species is a temporary buffer that stores particular packets types until they are consumed by a reaction (processed by some server in the queueing space). FIFO (first-in-first-out) discipline is assumed.

The figure above from the technical report shows how a small system of reactions looks in the chemical space and the queuing space. Where analysis and scheduling can get complicated is in the coupled nature of the two reactions. The servers both drain packets from queue $Y$, so they are required to coordinate their actions in some way. It’s important to note here that this equivalence rests on treating the queuing system as M/M/1 queues with a slightly modified birth-death process representation.

Typically, in an M/M/1 queue, the mean service rate is constant. That is, the service rate is independent of the state the birth-death process is in. However, if we model the Law of Mass Action using a birth-death process, we’d see that the rate of service (analogously, the reaction rate) changes depending on the fill-level of the queue (or concentration of the reactant). We’ll investigate this further in the next sections, discussing their formal analysis.

## Related Work and Precedent

The authors noted that adding packet delay is not unheard of in the networking space. Delay Frame Queuing[12] utilizes non-work-conserving transmission at edge nodes in an ATM network in order to guarantee upper bounds on delay and jitter for virtual circuits. Three other researchers in 2008 (Kamimura et al) proposed a Constant Delay Queuing policy that assigns a constant delay to each packet of a particular priority stream and forward other best-effort packets during the delay[8].

## Continuation

The next article will discuss the formal mathematical model of an artificial packet chemistry.

## References

1. Dittrich, P., Ziegler, J., and Banzhaf, W. Artificial chemistries – a review. Artificial Life 7(2001), 225–275.
1. Feinburg, M. Complex balancing in general kinetic systems. Archive for Rational Mechanics and Analysis 49 (1972).
2. Gadgil, C., Lee, C., and Othmer, H. A stochastic analysis of first-order reaction networks. Bulletin of Mathematical Biology 67 (2005), 901–946.
3. Gibson, M., and Bruck, J. Effcient stochastic simulation of chemical systems with many species and many channels. Journal of Physical Chemistry 104 (2000), 1876–1889.
4. Gillespie, D. The chemical langevin equation. Journal of Chemical Physics 113 (2000).
5. Gillespie, D. The chemical langevin and fokker-planck equations for the reversible isomerizationreaction. Journal of Physical Chemistry 106 (2002), 5063–5071.
6. Horn, F. On a connexion between stability and graphs in chemical kinetics. Proceedings of the RoyalSociety of London 334 (1973), 299–330.
7. Kamimura, K., Hoshino, H., and Shishikui, Y. Constant delay queuing for jitter-sensitive iptvdistribution on home network. IEEE Global Telecommunications Conference (2008).
8. Laidler, K. Chemical Kinetics. McGraw-Hill, 1950.
9. McQuarrie, D. Stochastic approach to chemical kinetics. Journal of Applied Probability 4 (1967), 413–478.
10.  Meyer, T., and Tschudin, C. Flow management in packet networks through interacting queues and law-of-mass-action-scheduling. Technical report, University of Basel.
11.  Pocher, H. L., Leung, V., and Gilles, D. An application- and management-based approach to atm scheduling. Telecommunication Systems 12 (1999), 103–122.
12. Tschudin, C. Fraglets- a metabolistic execution model for communication protocols. Proceedings of the 2nd annual symposium on autonomous intelligent networks and systems (2003).

Expecting the Unexpected: Borel’s Paradox

## Expecting the Unexpected: Borel’s Paradox

One of the best ways to shorten a proof in statistics or probability is to use conditioning arguments. I myself have used the Law of Total Probability extensively in my work, as well as other conditioning arguments in my PhD dissertation. Like many things in mathematics, there are subtleties that, if ignored, can cause quite a bit of trouble. It’s a theme on which I almost feel like I sound preachy, because subtlety, nuance, and deliberation followed by cautious proceeding is about as old-fashioned as my MS Statistics1

One particularly good paper that discusses this was written by Michael Proschan and Brett Presnell  in The American Statistician  in August 1998 titled “Expect the Unexpected from Conditional Expectation”. In it, the authors noted the following seemingly innocuous question posed on a statistics exam

If $X$ and $Y$ are independent standard normal random variables, what is the conditional distribution of $Y$ given $Y=X$?

There are three approaches to this problem.

### (1) Interpret the statement that $Y=X$ by declaring a new random variable $Z_{1} = Y-X$ where $Z_{1}=0$.

Here, the argument proceeds as follows:

$Y$ and $Z_{1}$ have a bivariate normal distribution with $\mu = (0,0)$, $\sigma_{Y}^{2}=1$, $\sigma_{Z_{1}}^{2}=2$, and correlation $\rho = \tfrac{1}{\sqrt{2}}$. Thus, we know that the conditional distribution of $Y$ given $Z_{1}=0$ is itself normal with mean $\mu_{Y}+\rho\frac{\sigma_{Y}}{\sigma_{Z_{1}}}(0-\mu_{Z_{1}})=0$ and variance $\sigma_{Y}^{2}(1-\rho^{2}) = \tfrac{1}{2}$. Thus, the conditional density is

$$f(y|Z_{1}=0) = \frac{1}{\sqrt{2\pi}}e^{-y^{2}}$$

This was the expected answer. However, one creative student did a different argument:

### (2) Interpret the statement that $Y=X$ by declaring a new random variable $Z_{2} = \tfrac{Y}{X}$ where $Z_{2}=1$.

This student used the Jacobian method 2 and transformed the variables via declaring $s=y$, $z_{2}=\tfrac{y}{x}$ and finding the joint density of $S$ and $Z_{2}$. The marginal density for $S$ was then found by dividing the joint density by the marginal density of $Z_{2}$ evaluating at $z_{2}=1$. The reason for the ratio is that the marginal density of $Z_{2}$ being the ratio of independent standard normal random variables has a Cauchy distribution3. This student’s answer was

$$f(y|Z_{2}=1) = |y|e^{-y^{2}}$$

Not the expected answer. This is a correct interpretation of the condition $Y=X$, so the calculations are correct. There was a third different answer, from a student who had taken a more advanced probability course.

### (3) Interpret the statement $Y=X$ as $Z_{3} = 1$ where $Z_{3} = \mathbb{I}[Y=X]$

Here $\mathbb{I}(\cdot)$ is the indicator function, where the variable is 1 if the condition is met, and 0 otherwise. The argument here is that $Y$ and $Z_{3}$ are independent. Why? $Z_{3}$ is a constant zero with probability 1. A constant is independent of any random variable. Thus the conditional distribution of $Y$ given $Z_{3}=1$ must be the same as the unconditional distribution of $Y$, which is standard normal.

This is also a correct interpretation of the condition.

From the paper, “At this point the professor decided to drop the question from the exam and seek therapy.”

## What happened?

At this point, both the paper did and we shall revisit exactly what conditional probability means. Suppose we have continuous random variables $X$ and $Y$. We’ll usually write expressions like $\mathbb{P}(Y\leq y|X=x)$ or $\mathbb{E}(Y|X=x)$. However, an acute reader might already ask the question about conditioning on sets of probability 0. For a continuous random variable, the probability that we land on any specific real value $x$ is indeed 0, which hearkens back to the measure-theoretic basis of probability. As it turns out, this little subtlety is indeed the culprit.

### Formal definition of conditional distribution and conditional expectation

First, we take a look at the formal definition of conditional expectation

Definition.  A conditional expected value $\mathbb{E}(Y|X)$ of $Y$ given $X$ is any Borel function $g(X) = Z$ that satisfies

$$\mathbb{E}(Z\mathbb{I}(X \in B)) = \mathbb{E}(Y\mathbb{I}(X \in B))$$

for every Borel set $B$. Then $g(x)$4 is the conditional expected value of $Y$ given $X=x$ and we can write $\mathbb{E}(Y|X=x)$

What this means is that the conditional expectation is actually defined as a random variable whose integral over any Borel set agrees with that of $X$5. Of import here is the fact that the conditional expectation is defined only in terms of an integral. From Billingsley, 1979, there always exists at least one such Borel function $g$, but the problem here is that there may be infinitely many. Each Borel function $g$ that satisfies the above definition is called a version of $\mathbb{E}(Y|X)$

So let’s say we have two versions of $\mathbb{E}(Y|X)$, called $Z_{1} = g_{1}(X)$ and $Z_{2} = g_{2}(X)$. Then we have that $\mathbb{P}(Z_{1}=Z_{2})=1$. Still seems pedantic, but if we look at this from a measure-theoretic perspective, this means that two versions are equal except on a set $N$ of $x$ such that $\mathbb{P}(X \in N)=0$.

What does this have to do with conditional distributions?

For each $y$ we fix we can find some function6 $G_{y}(x) = G(y|x)$ such that

$G(y|x) = \mathbb{P}(Y\leq y|X=x)$

In other words7, $G(y|X)$ is a version of $\mathbb{E}(\mathbb{I}(Y\leq y)|X)$. This last expression here is a conditional distribution of $Y$ given $X=x$

Notice I said “a” conditional distribution, and not “the” conditional distribution. The words were chosen carefully. This leads us into the Borel Paradox, and the answer to why all three answers of that exam question are technically correct.

## The Borel Paradox

Also known as the Equivalent Event Fallacy, this paradox was noted by Rao (1988) and DeGroot (1986). If we attempt to condition on a set of probability (or measure) 0, then the conditioning may not be well-defined, which can lead to different conditional expectations and thus different conditional distributions.

In the exam question, $Z_{1}, Z_{2}$ and $Z_{3}$ are all equivalent events. The fallacy lies in assuming that this equivalence would mean that conditioning on, say, $\{Y-X=0\}$ is the same as conditioning on the event $\{Y/X=1\}$. In almost all8 cases, this is true. If the events in question have nonzero probability, then the professor’s assumption was true. However, when we’re conditioning on events that have zero probability, the classic Bayes’s formula interpretation doesn’t hold anymore, because the denominator is now 0.9

If we have an event like $\{Y=X\}$, then from the previous section we know that there are infinitely many versions of this event. We have to think of conditioning on a random variable, not just a value. There were three versions given above:

• $Z_{1} = Y-X$, which has value $z_{1}=0$
• $Z_{2} = Y/X$, with value $z_{2}=1$
• $Z_{3} = \mathbb{I}(Y=X)$ with value $z_{3} =1$

Proschan and Presnell dig a bit deeper and discuss the details here on exactly where these equivalent interpretations diverge10 and yield different conditional distributions. They also discuss the substitution fallacy, which again notes the consequences of having infinitely many versions of $\mathbb{E}(Y|X)$, and why the convolution argument typically given to derive the distribution of the sum of independent random variables is nowhere near as air-tight as it appears.11

## What’s the solution?

The Borel Paradox reared its ugly head because different yet correct interpretations of the conditioning events/sets of probability 0 yielded different conditional expectations and different conditional distributions. The core of the problem was those sets having probability 0. How do we avoid this? We actually would like to calculate $\mathbb{E}(Y|X=x)$ (and conditional distributions that result from it), so how do we get around this paradox of infinite versions?

We take the analysts’ favorite trick: sequences and limits. Take a sequence of sets $(B_{n})$ where $B_{n} \downarrow x$. Now we’re only ever actually computing $\mathbb{E}(Y|X \in B_{n})$ where $\mathbb{P}(X \in B_{n})>0$. Now define

$\mathbb{E}(Y|X=x) = \lim_{n\to\infty}\mathbb{E}(Y|X \in B_{n})$

(I admit I’m seriously sidestepping some careful considerations of convergence and choices of sequence here. There is more subtlety and consideration in the analytic study of sequences of sets, but I don’t want to push too far down the rabbit hole here. The point is that we can avoid the paradox with care.)

## Conclusion

Subtleties and paradoxes occur all over mathematics. This doesn’t mean mathematics is broken, that all statistics are lies, or any other variation of frustrated exasperation I’ll hear when discussing these. What these subtleties, fallacies, and paradoxes do show is that careful consideration is paramount to the study, practice, and application of mathematics.

## References

1. Billingsley P,. (1979), Probability and Measure(1st ed.),New York:Wiley
2. Casella, G. and Berger, R.(2002) Statistical Inference (2nd ed.), Wadsworth
3. DeGroot,M.H. (1986), Probability and Statistics (2nd ed.), Reading,MA: Addison-Wesley.
4. Proschan, M.A. and Presnel, B. (1998), “Expect the Unexpected from Conditional Expectation”, American Statistician, 48, 248-252
5. Rao,M.M.(1988),”Paradoxes in Conditional Probability,” Journal of Multivariate Analysis,27, 434-446
Using Boolean Algebra to Find all Maximal Independent Sets in a Graph

## Using Boolean Algebra to Find all Maximal Independent Sets in a Graph

Graph theory may be one of the most widely applicable topics I’ve seen in mathematics. It’s used in chemistry, coding theory, operations research, electrical and network engineering, and so many other places. The subject is mainly credited to have begun with the famous  Seven Bridges of Königsberg problem posed by Leonard Euler in 1736. Frank Harary should also be credited with his massive work in bringing applications of graph theory to the sciences and engineering with his famous textbook written in 1969.

My own research forced me to stumble into this area once my  research partner, Jason Hathcock, suggested we explore the idea of viewing dependency relations in the sequences of variables we were studying as digraphs. Since then, I’ve been buried in graph theory texts, finding a wealth of fascinating topics to explore.

Of this article’s particular interest is finding all maximally independent sets in a graph using Boolean algebra.

## What’s a maximally independent set?

Firstly, what’s an independent set?

Definition (Independent Set): A set of vertices of a graph is independent if no two vertices in the set are adjacent.

If we take a look at the digraph above (from our paper on vertical dependence), and look at the underlying graph1, $\{1,6,11\}$ form an independent set, as an example. There are lots more, and of varying sizes. Of particular interest here are maximal independent sets.

Definition:(Maximal Independent Set): An independent set to which no other vertex in the graph can be added to retain the independence property

An example from the graph above is $\{2,3,4,5,13\}$. If we added any other vertex to that set, it would be adjacent to some vertex already in there.

A few notes:

(1) There are many maximal independent sets in a graph, and they may not all have the same cardinality.

(2) Maximal and maximum are not the same thing. An independent set may be a maximal independent set without being the largest independent set in the graph. The largest cardinality among all the maximal independent sets is called the independence number of the graph and is denoted $\beta(G)$.

## Why do we care about maximal independent sets?

Of the many applications that arise, one in particular is in coding theory. We want to find the largest error correcting codes we can, particularly in internet transmissions that can lose packets. A paper discussing this can be found here. (Paywall warning). We’ve discussed some basics of coding theory on this site as well. Finding error correcting codes with desirable properties is equivalent to solving the problem of finding maximal independent sets. The purpose of this article isn’t to discuss the applications here, but I’ve learned long ago that no one will keep reading unless I mention at least one application.

## Finding a maximal independent set

Finding a maximal independent set is relatively simple. Start with any vertex $v \in V(G)$. Add another vertex $u$ that is not adjacent to $v$. Continue adding vertices that are not adjacent to any already in the set. For a finite graph2, this process will terminate and the result will be a maximally independent set.

Will it be one of largest cardinality? Not necessarily.

For example, using one more of our dependency graphs generated by $\alpha(n) = \sqrt{n}$, we can take the order to be 24 as shown, and build a maximal independent set starting with vertex 3. Note that none of vertices 9-15 or 1 can be in the set, since they’re all adjacent to vertex 3. Vertex 2 is not adjacent to vertex 3, so we add it into our set: $V = \{2,3\}$. Now, the next vertex we add can’t be adjacent to either 2 or 3, so that rules out 1, 9-15, and 4-8. Grab vertex 16. Now $V = \{2,3,16\}$. Notice that none of the remaining vertices are adjacent to any of the previous vertices. Continuing this process, we’ll get that $V = \{2,3,16,17,18,19,20,21,22,23,24\}$. Notice that if we add any other vertices to this set, they’ll be adjacent to something already in it.

## Finding all Maximal Independent Sets

We’re rarely interested in just finding one maximal independent set. We’d prefer to find them all, and doing it by inspection is not very palatable. The heart of the article is an admittedly not optimal but still interesting way to find all maximal independent sets for reasonably small graphs.

We’ll illustrate the method on the 6-node graph above.

### Getting started

First, we’ll assign a Boolean variable to each vertex according to its inclusion in a maximal independent set. For example $A = 1$ implies $A$ is in the maximal independent set. Recall from Boolean algebra that

$$x+y = \left\{\begin{array}{lr}1, & x = 1 \text{ or } y = 1 \text{ or } (x=y=1)\\0,&x=0 \text{ and } y=0\end{array}\right.$$

Remark: $x+y$ is just another way of writing a union. This isn’t addition mod 2 here.

$$xy=\left\{\begin{array}{lr}1, & x = 1 =y\\0,&\text{ otherwise}\end{array}\right.$$

What we’ve done here is set up inclusion into our maximal independent sets in a Boolean fashion. So $x+y = 1$ corresponds to the inclusion of either vertex $x$ OR vertex $y$ OR both vertices $x$ and $y$. Similarly, $xy = 1$ corresponds to the inclusion of both vertices $x$ and $y$.

Now, we can express an edge of a graph as a Boolean product $xy$, where $x$ and $y$ are the vertices at either end of the edge.

Finally, set up the sum of all edges and call it $\phi$:

$$\phi = \sum xy \text{ for all } (x,y) \in G$$

For our graph above,

$$\phi = AB + AD + AE + BC + CE + CF + DE + EF$$

### Why did we do this?

For a vertex to be in an independent set, it can’t be adjacent to any other vertices in the set. Put another way, for each edge, we can only have at most one of the vertices that make it up. If we include $A$ in the independent set $V$, then $B$ cannot be in there.

Returning to our $\phi$, note that its value under Boolean algebra can only be 0 or 1. If $\phi = 1$, then at least one edge has both of its vertices “on”. This means, only combinations of $A, B, C, D, E, F$ that yield $\phi = 0$ will give us a maximally independent set.

### Solving the problem

Our goal now is to find all combinations of our Boolean vertex variables that yield $\phi = 0$. As it turns out, solving this directly is pretty annoying3. If we want $\phi = 0$, that’s logically equivalent to seeking $\phi^{c} = 1$, where $\phi^{c}$ is the Boolean complement (or negation) of $\phi$

Recall from Boolean algebra the following4:

\begin{aligned}(xy)^{c}&=x^{c}+y^{c}\\(x+y)^{c} &= x^{c}y^{c}\end{aligned}

So, if we take $\phi^{c}$ for our graph above,

\begin{aligned}\phi^{c}&=(A^{c}+B^{c})(A^{c}+D^{c})(A^{c}+E^{c})(B^{c}+C^{c})(C^{c}+E^{c})\\&\quad(C^{c}+F^{c})(D^{c}+E^{c})(E^{c}+F^{c})\end{aligned}

What does the negation here actually mean? By taking the complement, instead of finding vertices to include, now we’re finding vertices to excludeWhen we multiply this expression out, we’ll get a sum of terms, where each term is a product of complements of our original Boolean variables. To get $\phi^{c} = 1$, all we need is one of those terms to be 1. To get a term to be 1, all members of the product must themselves be 1, meaning each term gives us a set of variables to exclude. Excluding these variables gives us one maximally independent set for each term, so this gives us all the maximally independent sets.

The nice thing about dealing with Boolean arithmetic is that we can program a computer to do this for us. Any time we can invoke a relationship with Boolean algebra, we can enlist a friendly helpful computer.

### Finishing the example

We’ll do it by hand here, because I’m old-school like that. For larger graphs, obviously one would want to enlist some computational help, or just be very patient. We’ll remember a few other rules for Boolean algebra before we finish5:

\begin{aligned}xx &=x\\x+x &=x\\x +xy&=x\end{aligned}

After an insane amount of tedious Boolean algebra,

$$\phi^{c} = A^{c}C^{c}E^{c}+A^{c}B^{c}E^{c}F^{c}+A^{c}C^{c}D^{c}F^{c}+B^{c}C^{c}D^{c}E^{c}+B^{c}D^{c}E^{c}F^{c}$$

Recall that each term now tell us which sets of vertices to exclude from a maximal independent set. We negated the question logically. That means we have 5 maximal independent sets:

$$\{B,D,F\}, \{C,D\}, \{B,E\}, \{A,F\}, \{A,C\}$$

We can actually say what the independence number is as well, since we just have to find the maximum cardinality among the sets listed. For this graph, $\beta(G) = 3$.

## Conclusion

I happened to find this interesting, and ended up obsessed with it for a day, much to the chagrin of my daily planner, which expected me to be working on writing my research monograph. I tried several different ways of solving this beyond the one given. I tried using the direct equation $\phi$, and tried using regular arithmetic on just $\{0,1\}$, setting up a sort-of structure function similar to the reliability block diagrams detailed here.

I always hesitate to blame the method, and rather my own arithmetic errors, but I didn’t have much luck with the structure-function method, though I may try again to see if it’s an equivalent method. I believe it should be.

Looking at $\phi^{c}$ makes more sense after playing with this problem for some hours. The sum/union is quite nice, because it neatly separates out the various sets to exclude. It’s a better exploitation of Boolean algebra than trying to work with $\phi$ but aiming for a sum of 0. I still think it should be possible to work with it directly, even if not advisable. If I decide to torture myself with it further, and end up with something to write about, perhaps I’ll append here.

I always end up ending my articles with some takeaway. I don’t have much of one here, except it was a curiosity worth sharing. Perhaps a decent takeaway is to reveal a bit of the tedium and dead-ends mathematicians can run into when exploring something. That’s just part of research and understanding. It’s entirely possible to spend hours, days, weeks on something and all you conclude is that the original method you saw is definitely superior than the one you were trying to develop.

Beyond Cookbook Mathematics, Part 2

## Beyond Cookbook Mathematics, Part 2

The previous article discussed the importance of definitions to mathematical thought. We looked at a definition (of an end-vertex in a graph), and picked it apart by finding multiple ways to look at it. We also directly used the definition in a practical manner to find “weak links” in a network. This time, we’ll look at the structure of mathematical theorems and proofs, and how we can read the proofs to understand the theorems.

A mathematical theorem (informally) is a statement that takes the following structure:

IF {we have this stuff}, THEN {we get to say this about other stuff}.

Let’s pick this apart a bit even at the abstract level, because even this simple structure is important when we consider using theorems:

## IF {we have this stuff}:

That IF is really important. You can’t move to the THEN without the IF part being true. Let’s say a particular theorem (say, the Central Limit theorem), has a set of requirements in the IF part. We have to have all of those parts satisfied before we can invoke the conclusion. I see data scientists just invoking “Central Limit Theorem” like a chant when they are analyzing a dataset. However, in many cases, their data does not fit the IF conditions given. The Central Limit Theorem requires, among other things, that the random variables be independent. No independence, no Central Limit Theorem. Do not move forward.

Many complaints I hear, especially in statistics, revolve around “you can use statistics to say anything you like.” No. No you cannot. It only appears that way because practitioners are applying theorems blindly without ensuring the hypotheses (the IF bits) are all met. A theorem written and proven is not some magic silver bullet that allows for universal application and use.

## IF -> THEN:

When we write (or read) a proof, we assume the IF part is true, and use those conditions in the IF to logically deduce the conclusions in the THEN.

(A remark: I just described a particular method of proof—the direct proof. There are other methods by which we may prove a statement that are logically equivalent to a direct proof such as proof by contradiction, or proof by contrapositive. Since the idea here is to understand how all the parts of a theorem and proof work, we’ll stick with direct proofs here. We’re also discussing a particular type of logical statement—a one directional implication. We can have bidirectional statements as well, but these are an extension of understanding this first type, so we’ll start here.)

The first part discussing definitions is essential to understanding the IF part of our theorem and how this part connects to our “then” conclusion. Suppose we take the statement

If a real-valued function $f$ is differentiable, then $f$ is continuous.

We can use the statement without understanding. As long as we understand the definitions reasonably well (as in, we know how to tell if a function is differentiable), then we say it’s continuous and move on with our day. This isn’t good enough anymore. Why does differentiability imply continuity? A good proof should illuminate this for us. My suggestion at this point is to find a calculus book that proves this statement and have it open for this next part. I’m going to outline a series of thought-steps to consider as you read a proof.

### (1) Write down definitions.

Write down the definition of differentiability, and that of continuity. Study both. Play with examples of differentiable and continuous functions. See if all the differentiable functions you play with are also continuous. Try to play with some weird ones.

Examples: $f(x) = x^{2}$. Definitely differentiable. Definitely continuous. (I do suggest actually using the definitions here to really show that $f$ fits these.)

$f(x) = x+5$. Fits nicely. Try some nonpolynomials now.

$f(x) = \sin(x)$. Still good.

$f(x) = e^{x}$. Really nice. Love differentiating that guy.

As you’re playing with these, try to move between formally showing these functions fit the definitions and developing a visual picture of what it means for these functions to fit the definitions. (My high school calculus teacher, Andy Kohler, gave a nice intuitive picture of continuity—you shouldn’t have to pick up your pen to draw the function.)

This step helps you visualize your starting point and your destination. Every single time I develop a theorem, this is the process I go through. This isn’t always short either. I’ve spent weeks on this part—exploring connections between my start and my target to develop an intuition. Often, this leads me to a way to prove the desired theorem. Occasionally I manage to create an example that renders my statement untrue. This isn’t a trivial or unimportant part, so I implore you to temper any building frustration with the speed of this bit. However, since here we’re focused on reading proofs rather than writing them, we’ll likely assume we’re working with a true statement.

### (2) Now take a look at the proof.

A good proof should take you gently by the hand and guide you through the author’s reasoning in nice, comfortable steps of logic. The author shouldn’t require you to fill in gaps or make huge leaps, and especially never require you to just take something on faith.

The proof contains all the pieces we need for understanding. We just need to know how to read it. At some point in the proof, every single part of the IF should be invoked as a justification for a logical step. Find these parts.

“Because $f$ is differentiable, [STATEMENT]”

Study this part of the proof. Flag it. Do this for each time an IF condition is invoked. Make sure you can use the definition or condition invoked to make the leap the author does. Here I implore you to not just “pass over”. Really take the time to convince yourself this is true. Go back to the definition. Return to the proof. Again, this study isn’t necessarily quick.1

Doing this for each line in the proof will help you see what’s going on between IF and THEN.

### (3) Break things.

The last piece of advice I’ll give is one that was drilled into me by my graduate analysis professor at UTA, Dr. Barbara Shipman. Breaking things (or trying to) in mathematics is the best way to really cement your understanding.

Go back to all those points you flagged in the proof where the IF conditions were invoked. Now imagine you don’t have that condition anymore. The proof should fail in some way. Either you end up with a false conclusion (“if I don’t have X anymore, then I definitely cannot have Y”), or you just end up stuck (“if I don’t have X anymore, I can’t say anything about Y”.) This failure point helps you understand why that condition was necessary in the IF. Doing this for each condition in the proof helps you see the interplay among all the conditions, and showing you how and why all those parts were needed to get you to your conclusion.

None of this is trivial. Please don’t take the informal descriptions of this process to mean that you can breeze through and develop this skill as quickly as you memorized derivatives of functions. This is a skill that is developed over years. The reason I recommend using material you already know to study proofs is that it removes the challenge of learning new material while studying the “why”. There’s a reason mathematicians take calculus classes before real analysis courses (which spend time deeply developing and proving many things from calculus.) You’re familiar with how the subject works and that it works, then you can understand why it works.

This happens in engineering too. We flew a plane before we understood airflow deeply, but due to our understanding of airflow, we were able to advance flight technology far beyond what I imagine early aviators and inventors even thought possible. There’s a beautiful feedback loop between mathematics, formal proofs, and engineering. Mathematical proofs aren’t just the dry formalisms we use and throw away—they’re the keys to understanding why things work the way they do. Developing the skill to read and understand these arguments is not a waste of time for an engineer; it will help propel engineering forward.

Beyond Cookbook Mathematics, Part 1

## Beyond Cookbook Mathematics, Part 1

This post is due to the requests of several independent engineers and programmers. They expressed disappointment at their mathematics education and its failure to impart a deeper understanding of the formulas and algorithms they were taught to use.

This also reflects my observations of teaching university mathematics over the years. I started as a TA (and frequent substitute lecturer) in 2008, and have taught all levels of calculus, basic statistics, and advanced undergraduate statistics thus far. I’ve certainly noticed even since 2008 a de-emphasis on proofs and “why” in favor of more examples, applications, and formulas in general. In fact, proofs were passed over in lectures in calculus courses meant for general engineers because “there wasn’t enough time”, or it was not considered something engineers needed to know.

This attitude is actually fairly recent. Many of the older calculus books in my library were written for engineers in an undergraduate program, and these books are quite proof-heavy. Some examples are F. Hildebrand’s Advanced Calculus for Engineers (1949), Tom Apostol’s Calculus (Volumes 1 and 2), and R. Courant’s Differential and Integral Calculus (1938), to name a few. For a more modern text that still has some proof treatment, the 10th edition of Calculus: One and Several Variables, by Salas et al is a good resource. I used this one both as a student at Georgia Tech and an instructor.1. However, when I taught at University of Texas at Arlington as a grad student from 2013-2015 and then Foothill College in early 2018, the texts they chose to use was woefully inadequate to suit a college-level calculus course; proofs were nonexistent, and reasoning was thrown away in favor of contrived examples. The course was designed to steer engineers away from any proofs or thorough reasoning, showing the experience is somewhat widespread.2

I do not want to discuss the reasons for this departure; this isn’t an education site. What is important now is to discuss how to satisfy the desire of an engineer or other highly technical person using various mathematical topics/formulas to develop a deeper understanding of what he or she is doing. It wouldn’t be particularly helpful to simply suggest acquiring books and reading the proofs.

The best thing an education can give is the ability to teach oneself through developing methods of logical thought and creativity. There are books on formal logic and mathematical proofs circulating, but they can be a bit daunting to start with, as they are quite abstract and discuss mathematical logic and proof theory in general.3What I’ll give here can be perhaps considered a friendly introduction as to how to begin using mathematical proofs to facilitate understanding of material.

Most of you who are reading this have likely taken a calculus course or two, and probably some basic linear algebra/matrix theory (especially if you’re in computer science). You’ve been exposed to limits, differentiation, continuity, determinants, and linear maps. You know more math than you think. We’ll use this material to begin learning how to deconstruct mathematical statements and arguments in order to understand how the pieces all fit together. (It’s really not unlike diagramming sentences.)

Pick a topic you know well. That way we’re not trying to introduce new material and learn how to read proofs at the same time. Dust off your old calculus book, or differential equations book.

## Definitions

Definitions are the most important thing in mathematics, and perhaps the most ignored by those using it. “Continuous” has a meaning. “Differentiable” has a meaning. Spending time to really understand the definition of a mathematical term will provide an unshakable foundation.

Example: Let’s take something visual: the degree of a vertex in a graph.

You might see this definition of degree of a vertex:

Def. 1: The degree of a vertex $v$ in a graph $G$ is the number of edges connected to $v$

This is fairly intuitive and straightforward. One way to go about understanding this definition is to find other equivalent ways to express it. For example, we know that if there’s an edge sticking out of a vertex $v$, there must be something on the other end of the edge. (Graphs don’t allow dangling edges.) Thus, we might reframe this definition as

Def. 2: The degree of a vertex $v$ is the number of vertices adjacent (connected to) $v$

If we go one step further and collect all the vertices adjacent to $v$ into a set or bucket, and name that bucket the neighborhood of $v$, we can write one more equivalent definition of the degree of a vertex.

Def. 3: The degree of a vertex $v$ is the number of vertices in (or cardinality of) the neighborhood of $v$. (Formally, mathematicians would say that the number of elements in a set is the cardinality of the set.)

Notice what we’ve done here. We’ve taken one simple definition and expressed it three equivalent ways. Each one of these gives us a slightly different facet of what the degree of a vertex is. We can look at it from the perspective of edges or vertices.

Let’s take this definition and use it in another definition.

An end-vertex or pendant vertex in a graph is a vertex of degree 1.

New definition using our previously defined degree. But what does it really tell us here? Can we visualize this? If a vertex only has degree 1, then we know that only one edge sticks out of that vertex. Equivalently, we can also say that it has only one neighbor vertex adjacent to it. The size of its neighborhood is 1. We now can picture an end-vertex quite nicely.

## Using Definitions

Many applications rely on checking to see if a definition is satisfied.

• Is $f(x) = \sin(x)$ continuous?
• Do I have any end vertices in my network?

Here we are taking a specific example and looking to see if a definition is satisfied, typically because we know that (due to theorems) we get certain properties we either want (or maybe don’t want) if the definition is satisfied.

For example, network engineers like to have resilient networks. By “resilient”, I mean that they’d like to be able to tolerate a link failure and still be able to send information anywhere on the network. Intuitively, it would be really bad if a particular link failure isolated a node/switch so that no information could reach it.

Let’s try to frame that in mathematical terms. A network can be drawn as a graph, with circles representing nodes/switches/computers/whatever, and edges between nodes representing the physical links connecting them. We want to design a network so that a single link failure anywhere in the network will not isolate a node.

We can mathematically represent a link failure by the removal of an edge. So we can take our graph representing our network, and start testing edges by deleting them to see if a node ends up isolated with no edges emanating from it. Or…we could return to a definition from earlier and think about this mathematically.

If a single link failure isolates a particular node, then that means only one edge sticks out from it. That means that node has degree 1, by our first definition of degree. Thus, we may now conclude that this node also fits the definition of an end-vertex.

Moving back to the practical space, we conclude: a vertex can only be isolated via a single-link failure if it’s an end-vertex. We can also write the statement the other way:  If a vertex is an end-vertex, then the deletion of its incident edge isolates it.

Now, thanks to these definitions, and our understanding of them, we can find a way to spot all the end-vertices in a network. Since we have multiple ways of looking at this problem, we can find the way that suits us best.

Using definition 1, we can count the number of links from each node. Any node that has only one link connected to it is an end-vertex, and the failure of that link will isolate that node.

Using definition 2, we can count the number of adjacent neighbors, especially if we have a forwarding table stored for each node. If any node only has one other node in its table, it’s an end-vertex, and the removal of the link connecting the two nodes would isolate our vertex.

I used a fairly visual, practical definition and example here. Other definitions in mathematics can get fairly involved; I’ve spent hours simply picking apart a definition to understand it. But the strategy doesn’t change. The first step in developing a deeper understanding of mathematics is to pay attention to definitions–not just what they say, but what they mean. The next article will discuss how we use definitions to write theorems and understand proofs.

Simulating Soundscapes Using Convolutions

## Simulating Soundscapes Using Convolutions

One of the most powerful areas of electrical engineering that flourished in the 20th century is the field of signal processing. The field is broad and rich in some beautiful mathematics, but by way of introduction, here we’ll take a look at some basic properties of signals and how we can use these properties to find a nice compact representation of operations on them. As a motivating application, we’ll use what we study today to apply certain effects to audio signals. In particular, we’ll take a piece of audio, and be able to make it sound like it’s being played in a cathedral, or in a parking garage, or even through a metal spring.

First things first: what is a signal? For this discussion we’ll limit ourselves to looking at the space $\ell = \{x :\mathbb{Z} \rightarrow \mathbb{R}\}$ – the set of functions which take an integer and return a real number. Another way to think of a signal then is as an infinite sequence of real numbers. We’re limiting ourselves to functions where the domain is discrete (the integers), rather than continuous (the real numbers), since in many applications we’re looking at signals that represent some measurement taken at a bunch of different times1. It’s worth noting that any signal that’s been defined on a countable domain $\{..., t_{n-1}, t_n, t_{n+1},...\}$ can be converted to one defined on the integers via an isomorphism. We like to place one further restriction on the signals, in order to make certain operations possible. We restrict the space to so-called finite-energy signals:

$$\ell_2 = \left\{x \in \ell : \sum_{n = -\infty}^{\infty} |x(n)|^2 < \infty\right\}.$$

This restriction makes it much easier to study and prove things around these functions, while still giving us lots of useful signals to study, without having to deal with messy things like infinities. In practice, when dealing with audio we usually have a signal with a finite length and range, so this finite-energy property is trivially true.

Studying signals is only as useful if we can also define operations on them. We’ll study the interaction of signals with systems, which take one signal and transform it into another – essentially, a function operating on signals. Here, we’ll say that a system $H : \ell_2 \rightarrow \ell_2$ takes an input signal $x(t)$ and produces output signal $H\{x(t)\} = y(t)$.

## Linearity and Time Invariance

There are certain properties that are useful for systems to have. The first is linearity. A system $H$ is considered linear if for every pair of inputs $x_1, x_2 \in \ell_2$, and for any scalar values $\alpha, \beta \in R$, we have

$$H\{\alpha x_1 + \beta x_2\} = \alpha H\{x_1\} + \beta H\{x_2\}$$

This is very useful, because it allows us to break down a signal into simpler parts, study the response of the system to each of those parts, and understand the response to the more complex original signal.

The next property we’re going to impose on our systems is time-invariance:

$$\forall s \in \mathbb{Z}, H\{x(n)\} = y(n) \Rightarrow H\{x(n-s)\} = y(n-s)$$

This means that shifting the input by $s$ corresponds to a similar shift in the output. In our example of playing music in a cathedral, we expect our system to be time-invariant, since it shouldn’t matter whether we play our music at noon or at midnight, we’d expect it to sound the same. However, if we were playing in a building that, for example, lowered a bunch of sound-dampening curtains at 8pm every night, then the system would no longer be time-invariant.

So what are some more concrete examples of systems that are linear and time-invariant?
Let’s consider an audio effect which imitates an echo – it outputs the original signal, plus a quieter, delayed version of that signal. We might express such a system as

$$H_{\Delta, k}\{x(n)\} = x(n) + kx(n-\Delta)$$

where $\Delta \in \mathbb{Z}$ is the time delay of the echo (in terms of the number of samples), and $k \in \mathbb{R}$ is the relative volume of the echoed signal. We can see that this signal is time-invariant, because there is no appearance of the time variable outside of the input. If we replaced $k$ by a function $k(n) = \sin(n)$, for example, we would lose this time invariance. Additionally, the system is plainly linear:

\begin{aligned}H_{\Delta, k}\{\alpha x_1(n) + \beta x_2(n)\} &= \alpha x_1(n) + \beta x_2(n) + k x_1(n-\Delta) +k x_2(n-\Delta) \\&= H_{\Delta, k}\{x_1(n)\} + H_{\Delta, k}\{x_2(n)\}\end{aligned}

A common non-linearity in audio processing is called clipping — we limit the output to be between -1 and 1: $H\{x(n)\} = \max(\min(x(n), 1), -1)$. This is clearly non-linear since doubling the input will not generally double the output.

## The Kronecker delta signal

There is a very useful signal that I would be remiss not to mention here: the Kronecker delta signal. We define this signal as

$$\delta(n) = \begin{cases} 1 & n = 0 \\ 0 & n \neq 0 \end{cases}$$

The delta defines an impulse, and we can use it to come up with a nice compact description of linear, time-invariant systems. One property of the delta is that it can be used to “extract” a single element from another signal, by multiplying:

$$\forall s \in \mathbb{Z}, \delta(n-s)x(s) = \begin{cases} x(n) & n=s \\ 0 & n \neq s\end{cases}$$

Similarly, we can then write any signal as an infinite sum of these multiplications:

$$x(n) = \sum_{s=-\infty}^{\infty} \delta(n-s)x(s) = \sum_{s=-\infty}^{\infty}\delta(s)x(n-s)$$

Why would we want to do this? Let $H$ be a linear, time-invariant system, and let $h(t) = H\{\delta(t)\}$, the response of the system to the delta signal. Then we have

\begin{aligned} H\{x(n)\} &= H\left\{\sum_{s=-\infty}^{\infty} \delta(n-s)x(s)\right\}\\&=\sum_{s=-\infty}^{\infty}H\{\delta(n-s)\}x(s) \text{ by linearity}\\&=\sum_{s=-\infty}^{\infty}h(n-s)x(s) \text{ by time-invariance}\end{aligned}

We can write any linear, time-invariant system in this form. We call the function $h$ the impulse response of the system, and it fully describes the behaviour of the system. This operation where we’re summing up the product of shifted signals is called a convolution, and appears in lots of different fields of math 2.

## Firing a Gun in The Math Citadel

The power of this representation of a system is that if we want to understand how it will act on any arbitrary signal, it is sufficient to understand how it responds to an impulse. To demonstrate this, we’ll look at the example of how audio is affected by the environment it is played in. Say we were a sound engineer, and we wanted to get an instrument to sound like it was being played in a big, echoing cathedral. We could try to find such a place, and actually record the instrument, but that could be expensive, requiring a lot of setup and time. Instead, if we could record the impulse response of that space instead, we could apply the convolution to a recording we did back in a studio. How do we capture an impulse response? We just need a loud, very short audio source – firing a pistol or popping a balloon are common. To demonstrate, here are some example impulse responses, taken from OpenAirLib, and their effects on different audio signals.

First, here is the unprocessed input signal – a short piece of jazz guitar:

Here is the same clip, as if it were played in a stairwell at the University of York. First, the impulse response, then the processed audio.

That sounds a little different, but we can try a more extreme example: the gothic cathedral of York Minster. Again, here is the impulse response, followed by the processed signal.

In this case, we have a much more extreme reverberation effect, and we get the sound of a guitar in a large, ringing room. For our last example, we’ll note that impulse responses don’t have to be natural recordings, but instead could be entirely synthetic. Here, I’ve simply reversed the first impulse response from the stairwell, which creates this pre-echo effect, which doesn’t exist naturally.

This is just one of the most basic examples of what can be done with signal processing, but I think it’s a particularly good one – by defining some reasonable properties for signals and systems, we’re able to derive a nice compact representation that also makes a practical application very simple.

Sequences & Tendency: Topology Basics Pt. 2

## Introduction

In my previous post I presented abstract topological spaces by way of two special characteristics. These properties are enough to endow a given set with vast possibilities for analysis. Fundamental to mathematical analysis of all kinds (real, complex, functional, etc.) is the sequence.

We have covered the concept of sequences in some of our other posts here at The Math Citadel. As Rachel pointed out in her post on Cauchy sequences, one of the most important aspects of the character of a given sequence is convergence.

In spaces like the real numbers, there is convenient framework available to quantify closeness and proximity, and which allows naturally for a definition of limit or tendency for sequences. In a general topological space missing this skeletal feature, convergence must be defined.

This post will assume only some familiarity with sequences as mathematical objects and, of course, the concepts mentioned in Part 1. For a thorough treatment of sequences, I recommend Mathematical Analysis by Tom M. Apostol.

## Neighborhoods

Suppose $(X,\mathscr{T})$ is a given topological space, and nothing more is known. At our disposal so far are only open sets (elements of $\mathscr{T}$), and so it is on these a concept of vicinity relies.

Definition. Given a topological space $(X,\mathscr{T})$, a neighborhood of a point $x\in X$ is an open set which contains $x$.

That is, we say an element $T\in\mathscr{T}$ such that $x\in T$ is a neighborhood1 of $x$. To illustrate, take the examples from my previous post.

#### The Trivial Topology

When the topology in question is the trivial one: $\{\emptyset,X\}$, the only nonempty open set is $X$ itself, hence it is the only neighborhood of any point $x\in X$.

#### The Discrete Topology

Take $X=\{2,3,5\}$ and $\mathscr{T}$ to be the collection of all subsets of $X$:

 $\emptyset$ $\{2\}$ $\{3\}$ $\{5\}$ $\{2,3\}$ $\{2,5\}$ $\{3,5\}$ $\{2,3,5\}$

Then, for, say $x=5$, neighborhoods include $\{5\}$, $\{2,5\}$, $\{3,5\}$, and $\{2,3,5\}$.

#### The Standard Topology on $\mathbb{R}$

The standard topology on $\mathbb{R}$ is defined to be the family of all sets of real numbers containing an open interval around each of its points. In this case, there are infinitely2 many neighborhoods of every real number. Taking $x=\pi$ for instance, then $(3,4)$, $(-2\pi,2\pi)$, and even

$$\bigcup_{n=1}^{\infty}\left(\pi-\frac{1}{n},\pi+\frac{1}{n}\right)$$

are all neighborhoods of $\pi$.

Remark. A special type of neighborhood in the standard topology is the symmetric open interval. Given a point $x$ and a radius $r>0$, the set

$$(x-r,x+r)=\{y\in\mathbb{R}\mathrel{:}|x-y|

is a neighborhood of $x$. These sets form what is referred to as a basis for the standard topology and are important to definition of convergence in $\mathbb{R}$ as a metric space.

## Convergence

“…the topology of a space can be described completely in terms of convergence.” —John L. Kelley, General Topology

At this point in our discussion of topological spaces, the only objects available for use are open sets and neighborhoods, and so it is with these that convergence of a sequence are built3.

Definition. A sequence $(\alpha_n)$ in a topological space $(X,\mathscr{T})$ converges to a point $L\in X$ if for every neighborhood $U$ of $L$, there exists an index $N\in\mathbb{N}$ such that $\alpha_n\in U$ whenever $n\geq N$. The point $L$ is referred to as the limit of the sequence $(\alpha_n)$.

Visually, this definition can be thought of as demanding the points of the sequence cluster around the limit point $L$. In order for the sequence $(\alpha_n)$ to converge to $L$, it must be the case that after finitely many terms, every one that follows is contained in the arbitrarily posed neighborhood $U$.

As you might expect, the class of neighborhoods available has a dramatic effect on which sequences converge, as well as where they tend. Just how close to $L$ are the neighborhoods built around it in the topology?

We will use the example topologies brought up so far to exhibit the key characteristics of this definition, and what these parameters permit of sequences.

#### The Trivial Topology

In case it was to this point hazy just how useful the trivial topology is, sequences illustrate the issue nicely. For the sake of this presentation, take the trivial topology on $\mathbb{R}$. There is precisely one neighborhood of any point, namely $\mathbb{R}$ itself. As a result, any sequence of real numbers converges, since every term belongs to $\mathbb{R}$. Moreover, every real number is a limit of any sequence. So, yes, the sequence $(5,5,5,\ldots)$ of all $5$‘s converges to $\sqrt{2}$ here.

#### The Discrete Topology

Whereas with the trivial topology a single neighborhood exists, the discrete topology is as packed with neighborhoods as can be. So, as the trivial topology allows every sequence to converge to everything, we can expect the discrete topology to be comparatively restrictive. Taking the set $\{2,3,5\}$ with the discrete topology as mentioned above, we can pinpoint the new limitation: every set containing exactly one point is a neighborhood of that point. Notice the sets4 $\{2\}$, $\{3\}$, and $\{5\}$ are all open sets.

What does this mean? Any sequence that converges to one of these points, say $3$, must eventually have all its terms in the neighborhood $\{3\}$. But that requires all convergent sequences to be eventually constant! This seems to be a minor issue with the finite set $\{2,3,5\}$, but it presents an undesirable, counter-intuitive problem in other sets.

Take $\mathbb{R}$ with the discrete topology, for example. Under these rules, the sequence

$$(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right),$$

though expected to converge to $0$, does not converge at all.

So, the discrete topology is too restrictive, and the trivial topology lets us get away with anything. Fortunately, a happy middle ground exists by being a little more selective with neighborhoods.

#### The Standard Topology

By requiring an open set to contain an open interval around each of its points, it is impossible that a singleton be an open set. Therefore a singleton cannot be a neighborhood, and we eliminate the trouble posed by the discrete topology. Yet every open interval around a real number $L$ contains a smaller one, and each of these is a neighborhood.

This effectively corrals the points of any convergent sequence, requiring the distance between the terms and the limit to vanish as $n$ increases. Take again the sequence

$$(\alpha_n)=\left(\frac{1}{n}\right)=\left(1,\frac{1}{2},\frac{1}{3},\frac{1}{4},\ldots\right).$$

We suspect $(\alpha_n)$ converges to $0$, but this requires proof. Therefore, we must consider an arbitrary neighborhood of $0$, and expose the index $N\in\mathbb{N}$ such that all terms, from the $N$th onward, exist in that neighborhood.

Suppose $U$ is a given neighborhood of $0$, so that $U$ contains an open interval surrounding $0$. Without loss of generality, we may assume this interval is symmetric; that is, the interval has the form $(-r,r)$ for some radius $r>0$. Take $N$ to be any integer greater than $\tfrac{1}{r}$. Then, whenever $n\geq N$,

$$\alpha_n = \frac{1}{n} \leq \frac{1}{N} < \frac{1}{1/r} = r.$$

But this means $\alpha_n\in(-r,r)\subset U$ so long as $n\geq N$. Since we chose $U$ arbitrarily, it follows $(\alpha_n)$ converges to $0$.

## Conclusion

The behavior of a sequence in a given set can change rather drastically depending on the network of neighborhoods the topology allows. However, with careful construction, it is possible to have all the sequential comforts of metric spaces covered under a briefly put definition.

My next post in this series will push the generalization of these concepts much further, by relaxing a single requirement. In order to define convergence in the preceding discussion, the set of indices $\mathbb{N}$ was important not for guaranteeing infinitely many terms, but instead for providing order. This allows us to speak of all terms subsequent to one in particular. It turns out that if we simply hold on to order, we can loosen the nature of the set on which it is defined. That is the key to Moore-Smith Convergence, to be visited next.

Paper Review: Active Queue Management with Non-Linear Packet Dropping Function

## Paper Review: Active Queue Management with Non-Linear Packet Dropping Function

As promised in the previous article, I plan to review Reference 2, Active Queue Management with Non-Linear Packet Dropping Function, by D. Augustyn, A. Domanski, and J. Domanska, published in HET-NETs 2010, which discusses a change in the structure of the packet drop probability function using the average queue length in a buffer. I mentioned previously that choosing a linear function of the average queue length can be viewed as a bit of an arbitrary choice, since we’re designing a control mechanism here, and this paper attempts to define a new form of this packet drop probability function.

In summary, the best lesson one can take from this paper is that publication in a journal or conference proceedings does not guarantee that the paper withstands scrutiny. The paper is linked above for the interested reader to peruse himself, and to investigate the claims.

## Summary

The paper intended to give a new function to calculate the probability of proactively dropping a packet in a queue in order to prevent a full buffer. It seemed to be presented as an alternative to RED, described in my previous article. The authors define this new function, then set up a simulation in order to examine the effects.

## When Orthogonal is Abused

The authors describe using a finite linear combination of orthogonal basis polynomials defined on a finite interval as the underlying mathematical structure.

First, we should discuss what we mean by orthogonal in context of functions. Orthogonal is most commonly understood in terms of vectors, and when we’re in two dimensions, orthogonal becomes our familiar perpendicular.

### Orthogonality

Beginning with the familiar notion of perpendicular, we can generalize this to understand orthogonality. The geometric interpretation of two vectors  being perpendicular is that the angle between them is $90^{\circ}$. Once we leave two and three dimensions (or jump to the space of polynomials, as we’ll do soon), the concept of an angle isn’t as helpful.

Another way to define perpindicular is through an operation known as the dot productSuppose we take two 2D vectors, $\mathbf{x}$ and $\mathbf{y}$. Each vector will have coordinates: $\mathbf{x} = (x_{1},x_{2})$ and $\mathbf{y} = (y_{1}, y_{2})$. The dot product is a special type of multiplication defined on vectors, and denoted $\cdot$:

$$\mathbf{x}\cdot\mathbf{y} = x_{1}y_{1} + x_{2}y_{2}$$

The dot product can be described in words as the sum of the component-wise multiplication of the coordinates of the two vectors.

Now, we can say that two vectors are perpendicular if their dot product is 0. That is, $\mathbf{x}$ and $\mathbf{y}$ are perpindicular if $\mathbf{x}\cdot\mathbf{y} = 0$. (This article gives a nice overview of how we move from the algebraic to the geometric definition of perpendicular and orthogonal.)

Remember, perpendicular doesn’t make sense once we get into higher dimensions. Orthogonal is a more general notion of vectors being perpendicular, and is defined for two vectors (of any length) as their dot product equalling zero.

### From Dot Product to Inner Product

The dot product is used on vectors, and defines another type of product that is different from the scalar multiplication we know. In fact, we can generalize the notion of a dot product to something called an inner product, which can be defined on many different spaces than just vectors. We can define operations and products however we like, but for our definition to qualify as an inner product (denoted $\langle \cdot, \cdot\rangle$), it must meet certain criteria

For instance, on the set of real valued functions with domain $[a,b]$, we define the inner product of two functions $f(x)$ and $g(x)$ as

$$\langle f, g\rangle := \int_{a}^{b}f(x)g(x)dx$$

The concept of orthogonality generalizes to an inner product as well. If the inner product of two functions is zero (as defined above), we say the functions are orthogonal.

### Back to the paper

The authors claim to be using a set of orthogonal polynomials to define their drop probability function, and they give the structure of such functions. For a domain $[a,b]$, and for $\phi_{j}$ in the set of polynomials, they define $\phi_{j} = (x-a)^{j-1}(b-x)$. So, for example, $\phi_{1} = (b-x)$, and $\phi_{5} = (x-a)^{4}(b-x)$

Now, in order to be an orthogonal basis for a space1, the set of functions that are claimed to form the basis of the set must be pairwise orthogonal. That is, I need to be able to take the inner product of any two functions $\phi_{i}$ and $\phi_{j}$ and get 0. If that isn’t true for even one pair, then the set is not orthogonal.

As it turns out, if we take the inner product of any two functions in the basis set over the domain given, we find that there are no pairs that are orthogonal. To do this in general, we compute the integral

$$\int_{a}^{b}(x-a)^{i-1}(b-x)\cdot (x-a)^{j-1}(b-x)dx$$

The integral computation is one of simple polynomial integration, and can be done either by hand or using your favorite software (Mathematica) of choice. What we find here is that this set of functions defined in general this way is never orthogonal, yet the paper claims they are.

Applying to the particular situation of designing a drop probability function, they give the following for average queue length thresholds $T_{\min}$ and $T_{\max}$ $$p(x,a_{1},a_{2}) = \left\{\begin{array}{lr}0, &x < T_{\min}\\\phi_{0} + a_{1}\phi_{1}(x) + a_{2}\phi_{2}(x),&T_{\min}\leq x \leq T_{\max}\\1,&x > T_{\max}\end{array}\right.$$

where the basis functions are

\begin{aligned}\phi_{0}(x) &= p_{m}\frac{x-T_{\min}}{T_{\max}-T_{\min}}\\\phi_{1}(x) &= (x-T_{\min})(T_{\max}-x)\\\phi_{2}(x) &= (x-T_{\min})^{2}(T_{\max}-x)\end{aligned}

The reader will recognize $\phi_{0}$ as the original drop probability function from the RED algorithm. These functions are absolutely not orthogonal though (as the authors claim), and a simple check as we did above will verify it.

## Other issues

Another issue is that these mysterious coefficients $a_{1}$ and $a_{2}$ need to be determined. How, you ask? The authors do not say, other than to note that one can define “a functional” implicit on the unknown $p(x,a_{1},a_{2})$ that can be minimized to find the optimal values for those coefficients. They write that this mysterious functional2 can be based on either the average queue length or average waiting time, yet provide no details whatsoever as to the functional they have chosen for this purpose. They provide a figure with a sample function, but give no further details as to how it was obtained.

One other issue I have in their methodology is discussing the order of estimation. For those familiar with all sorts of ways to estimate unknown functions, from Taylor series, to splines, to Fourier series, we know that a function is exactly equal to an infinite sum of such functions. Any finite sum is an estimation, and the number of terms we use for estimation with a desired accuracy may change with the function being estimated.

For instance, if I want to use Taylor series (a linear combination of polynomials) to estimate a really icky function, how many terms should I include to ensure my accuracy at a point is within 0.001 of the real value? It depends on the function.3.

The authors simply claim that a second order term is appropriate in this case. The issue I take with that is these queueing management drop probability functions are designed. We’re not estimating a function describing a phenomenon, we are designing a control policy to seek some sort of optimal behavior. Fundamentally, the authors posing this as an estimation problem of some unknown drop probability function is incorrect. This isn’t a law of nature we seek to observe; it’s a control policy we seek to design and implement and optimize. Using language that implies the authors are estimating some “true function” is misleading.

Regarding the simulation itself, since the design was arbitrary, and not based on sound mathematical principles, I cannot give any real comment to the simulations and the results. The authors briefly discuss and cite some papers that explore the behavior of network traffic, and claim to take this into account in their simulations. To those, I cannot comment yet.

## Conclusion

Always verify a paper for yourself, and don’t accept anything at face value. Research and technical publications should be completely transparent as to choices and methodologies (and obviously free of mathematical inaccuracies), and derivations and proofs should be present when necessary, even if in an appendix.