Predictive analytics is a phrase that gets used often as a feature in many tech offerings. Predicting when a problem is likely to occur allows either a human or an automated system to take some action to mitigate a potential issue before it occurs and has potentially catastrophic effects. If we take data storage, for example, the worst thing that can happen on your storage array is the loss of data. Data loss is permanent (unless you have a backup), so when it’s gone, it’s gone1.
Mitigating problems proactively involves modeling system behavior (whatever that system happens to be) according to some metric or set of metrics, then evaluating new data as it arrives and deciding if this new data is “abnormal” or indicative of a potential problem. This means there are three steps –modeling the behavior of the system or aspect of the system you are trying to monitor, determining what is normal (and by complement, abnormal), and then deciding how you will handle an abnormal data point/event/other metric.
These three steps, while simple to articulate, are three very difficult and distinct problems that each involve a whole host of considerations that are themselves interrelated. For instance2:
- How do we decide which metrics to use in our modeling?
- How do we verify that this model is an accurate representation of system behavior?
- Once we have a model, how do we define unusual or anomalous behavior?
- Once we define anomalous behavior, how do we decide our courses of action? Do we act on any “weird” point that crosses some threshold, or should we see the threshold crossed repeatedly, or something else?
Proactively mitigating system issues is a well-justified desire of many companies, because it increases system reliability. I watched Starwind present their Virtual Tape Library at Storage Field Day 15 , and they, like many other companies, strive to create a way to detect impending failure patterns and take preventative measures before a catastrophic failure. The presentation is only two hours long, and covered their entire architecture, not just the specific feature regarding failure pattern detection, so we were unable to take the time to discuss the specifics of Starwind’s Proactive Support, as they call it.
Detecting any kind of pattern in data is difficult, especially a failure pattern. There are always tradeoffs. If we set our tolerance for what we consider “normal behavior” to be too low, we risk alerting on potential issues too often. When this happens, alerts get ignored, and real problems are assumed to be just another “false alarm.” On the other hand, if we set the tolerance for what we consider normal too high, we run the risk of not detecting an issue at all.
At this point, I’d like to open dialogue in the comments, particularly because these subjects are deep; in many cases so deep an entire two-hour presentation can be devoted to just a few aspects of this very large challenge. How do we balance these tradeoffs? How do we decide whether an unusual data point or set of points is really something bad? Is it possible if we are generating too many “false alarms”, that our original behavior model is off?
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.