Interventions, not Anomalies
Anomaly Detection is becoming almost universally considered a “hot topic” across industries and business fields.
A venture capitalist wants to know if a potential investment is an anomaly that will make billions at the time of IPO. A credit card company wants to know if a particular transaction is anomalous, as it could be potentially fraudulent. A systems administrator wants to know if the latency on his storage system is unique to him, or widespread through out the network. A logistics manager wants to know if the recent influx of orders to his warehouse is unusual and/or permanent. In each of these cases, they are all implicitly asking the same question:
Is my current observation abnormal?
Before we discuss how to detect anomalies/abnormalities/outliers, the term must be defined. What is an anomaly? The brutal secret is that there is no rigid (i.e. mathematical) definition of an anomaly. (See the previous discussion here). Some attempts at a definition are

 “An outlying observation, or “outlier,” is one that appears to deviate markedly from other members of the sample in which it occurs.” [1]


An outlier is an observation that is far removed from the rest of the observations. [2]

 Anything that lies outside the median \pm 1.5IQR^{1}
In all of these cases, these statements are all relative to some definition of “typical” or “normal”, and defining an outlier to lie “too far” according to some distance metric away from it. Many debates focus on exactly which metric to use, depending on the type of data (L^{2} norm, absolute distance away from the median, more than three standard deviations away from the mean, etc), but the crux of the issue is sidestepped.
In order to define an anomaly, one must define “normal”, and then decide how much variability in “normal” is tolerable. There are two stages of subjectivity at play here:
 The definition of “normal”
 How much variability is acceptable
These two items must be decided before a definition of an anomaly is attempted, which means the resulting definition of an anomaly has a shaky foundation at best. That is, the working definition of an anomaly is relative, and relative to moving targets. Subjectivity has no place in mathematics, for the applications of the vague mathematics are weak and prone to misuse and abuse. (See previous post)
There are occasions when the definition of an anomaly using the above two steps can be made clear. These occasions are rare, but should be noted. These circumstances typically show up in rigid, welldefined environments, like manufacturing, or engineering/physics laws.
Manufacturing specifications define “normal” clearly and unambiguously, like the location of a hole in sheet metal. Laws of physics are immutable, and not dependent on judgment or subjective interpretation. The laws of motion define how a “macro” object behaves (I hear you, quantum physicists), regardless of the object. We know where both a bowling ball and a golf ball are going to land given their initial position and velocity on Earth.
Acceptable variability is also clearly defined in both of these cases. In manufacturing, a design engineer specifies an acceptable tolerance for deviation of the hole’s location. In the case of observing physical laws, the variability comes from measurement error, and is given by the accuracy of the measurement tool.
Example. (The Ideal Gas Law)
The ideal gas law describes the relationship between pressure (in atmospheres), volume (in liters), temperature (in degrees Kelvin), and amount of gas (in moles) of an “ideal gas”. Nitrogen is considered to behave very closely to an ideal gas, so for this example, we will assume that nitrogen is an ideal gas and follows the ideal gas law exactly. The ideal gas law is given by
PV = nRTwhere R = 0.08 \tfrac{L\cdot \text{atm}}{\text{mol}\cdot K} is the ideal gas constant. Suppose now that we have exactly 1 mol of nitrogen in a 1 L container. We wish to increase the temperature of the nitrogen to see the effect on the pressure. In this case, we know exactly what the result should be, because we have a physical law that defines the model for this behavior. Here we will also note that the hypothetical pressure sensor has a tolerance of 0.001 atm. That is, the sensor can be up to 0.001 atm off when a measurement is taken.
The graph above shows the exact pressure readings we should expect as we increase the temperature of the container. Therefore, if the temperature of the container is known to be 2 degrees Celsius, and the pressure sensor reads more than 22.001 atm, then we can say the pressure reading is anomalous, because the tolerance is defined by the sensor, and the model is defined exactly by a physical law.
That is, both (1) and (2) from above are specified rigidly and objectively, and not according to a person’s judgment. In the vast majority of use cases for “anomaly detection”, measurements are not being taken according to a known model. Most of the time, the mathematical model that governs the data is unknown prior to even the desire to detect anomalies (whatever we mean by that) in the dataset. Thus, we cannot extend the principles used in manufacturing and physics to the uncertainty of data science.
Intervention vs. Anomalies
The data science world requires a different approach that encompasses the inherent uncertainty involved in modeling behavior in the first place, let alone determining “excessive” deviations from it. Enter intervention analysis.
Intervention analysis is a broader, yet more rigid term that describes the study of changes to a data set.^{2} Conceptually, this term embodies the question
Did the behavior change, and is it permanent?
When we express a desire for an “anomaly detection system”, this is the fundamental question we are asking. Typical algorithms used for outlier analysis, like the Extreme Studentized Deviate^{3} are only looking for “strange points” within the sequence or dataset relative to what you currently see. If a decision is to be made based on a model or a deviation from it, we would like to know what kind of deviation it is. Intervention analysis formally classifies five^{4}:

 Level shift: At a particular point, the model made a stepwise change that is permanent.

 Level shift with Lag: Same as (1), but takes a few time steps to occur. This shows up in the dataset as a gradual increase to the new permanent level.

 Temporary Change: The model experiences a level shift that has a decaying effect over time.

 Innovation Outlier: This is the most complex type, typically is defined to represent the onset of a causal effect or a change in the model itself.
 Additive Outlier: the “point outlier”. It has no effect on future behavior.
For examples of each, visit this page for a good introductory description. I also cite the relevant papers in the references.
The important aspect with this approach is that a model is formally defined (for time series, this is called ARIMA), as well as changes to that model. Thus, when we talk about a level shift, there is no ambiguity. Moreover, with formal definitions and types, we can classify a change according to the effect the point has on the future points.^{5}
Proposed Strategy
Detecting innovations in practice is difficult; this post is not meant to diminish this. Creating a model from data without much prior knowledge of the behavior of the data is also difficult, as these interventions can have adverse effects on building the model in the first place. That is, if a level shift is present in the data at a certain point, we may end up with a different initial model that doesn’t notice the level shift, because it got built into the model as if it were normal, simply because we didn’t know any better. There are estimation techniques, and some papers are referenced at the end for those interested. Here we are interested in an overall strategy to consider these problems in the first place.
Since we must estimate the model, then attempt to identify interventions and classify them, a sensible solution is a probabilistic approach. That is, the modelling is done via some (hopefully robust) method X, and then the potential intervention points are identified via another method Y. Behind the scenes, we ultimately want to classify each point as
 nothing
 one of the types of interventions describes above
and rank these possibilities according to the likelihood a particular point belongs in one of the 6 possibilities. The uncertainty is therefore demonstrated in a true probabilistic fashion, and a decision can be made taking into account all the possibilities for a particular intervention.
Conclusion
In order to identify and solve business problems effectively, the terms must be rigidly defined. By looking at intervention analysis, we can approach very difficult problems with a more solid foundation, since the model and the interventions/outliers have unambiguous, formal definitions. Moreover, a “big picture” strategy for employment should be probabilistic, rather than threshold or classificationbased. In this way, the uncertainty can be expressed fully, which allows for more informed decisionmaking.
This work is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License.
References
 Grubbs, F. E. (February 1969). “Procedures for detecting outlying observations in samples”. Technometrics. 11 (1): 1–21. doi:10.1080/00401706.1969.10490657.
 Maddala, G. S. (1992). “Outliers”. Introduction to Econometrics (2nd ed.). New York: MacMillan. pp. 88–96 [p. 89]. ISBN 0023745452.
 Joint Estimation of Model Parameters and Outlier Effects in Time Series. Chung Chen and LonMu Liu. Journal of the American Statistical Association Vol. 88 , Iss. 421,1993
Footnotes
 IQR stands for the Interquartile Range, and is defined as the difference between the third and first quartile. (IQR = Q3 – Q1).
 Typically, the term is used when studying time series, but I shall extend its use here.
 Description here.
 There are formal mathematical definitions for these in the context of ARIMA modeling. A future post will address this.
 I have not addressed the problem of parameter estimation of the model, which is a valid concern. Here we are only concerned about having a good definition of the model family, as opposed to allowing “black box” algorithms such as neural networks or various other “deep learning” strategies that have no interpretable definition