Commentary: High Level Data Filtration

Commentary: High Level Data Filtration

The consensus over the last five or so years has converged on a conclusion regarding data: we’re drowning in it. We have more than we can possibly monitor with our own eyeballs, and certainly more than we know what to do with intelligently. The motto for data scientists has been “More is better.” Well, ask and ye shall receive. The image I see is a virtual version of someone looking at the end of a firehose as it’s turned on full blast, then trying to spot an occasional speck in the stream of water. More data is good, only if it’s good data.

In modeling as well, we have generated great predictive algorithms1 for a multitude of applications, including sales, cybersecurity, and customer support. These models are meant to alert a good sales prospect, a high priority customer support need, or a security issue on your network. One potential issue is the notion of too many alerts, or alerts generated from repeat or replicated data. Alert fatigue is a real thing; humans can become adjusted to what used to be a panicked state, and consequently, alerts get ignored. How many times do you hear a car alarm and assume it’s actually due to a thief? Most of the time, we put the pillow over our head and grumble as we try to get back to sleep. What about an alert that pops up on your computer repeatedly? You end up ignoring it because it keeps bothering you about the same thing you acknowledged three hours ago. 

The issue with these scenarios is that when a real alert comes in between these ones you prefer to dismiss, you’ll likely ignore the new one as well.  Enter the need for filtration. We have so much data (much of it repeats) in enterprise scenarios that we need a way to filter these streams by eliminating the obvious and duplicated, as it were. To focus on a very real illustration of this need, take a look at enterprise network traffic. Enterprises have thousands of devices sending massive amounts of data within the network. They also deal with thousands of traffic that moves into and out of the network. The amount of packet data and metadata you could capture in an attempt to monitor this is difficult for us to really fathom. We need to decide what data is useful, and what isn’t; and we need to decide this in real time, as the information flows. 

An intelligent way to approach data filtration is similar to how we look at water filtration. The first thing you want to do is get the obvious large chunks of…whatever…out. Then you apply progressively tighter layers of filtration (plus some chemical treatment) until you get clean water. Data cleaning can be a bit like that. Data scientists2 recognize that they will never be handed clean water to work with, and that they’ll have to do some cleaning themselves. But rarely does anyone who is actually tasked with developing data science or cybersecurity solutions want to be the ones removing the obvious big garbage. 

I watched, as part of Tech Field Day 153, a presentation by Ixia (video can be found here) on what is effectively a real-time filtration system for network treat intelligence. The idea is to leverage their database of known obvious issues and “bad players”, as it were, to quickly filter out and mitigate these “large chunks” before passing the data to the more refining and advanced cybersecurity monitoring products. Ixia’s products also look for duplicate data, and remove that as well.  I like that they stepped in to do the “dirty work”, as it were, and offer a solution to help with that high level filtration in real time. They were very clear about what their analytics did and did not do, which I respect. 

The benefit of clearing out these known issues or duplicated data is clear whenever someone downstream, as it were, feeds data into some variation of a predictive machine learning algorithm that is meant to monitor, evaluate, and alert. The algorithms can run more efficiently with less data flowing through it, and unnecessary alerts can be eliminated, allowing a security monitoring system to only deal with the data that suggest potentially new threats that may require human intervention, as the known threats were identified and mitigated many steps earlier. 

The world needs all kinds. Everyone wants to be the surgeon, because he’s famous, but the surgeon cannot perform well without good technicians and nurses to help with the prep work. Ixia stepped in and offers a technician for the job of filtering prepping network traffic for better analysis and better security monitoring, which will keep the surgeon from fatigue. 
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


  1. even if they are uninterpretable
  2. the good ones, at least
  3. Disclaimer: I was not compensated for my contributions or attendance in any way, other than travel/lodging expenses covered.