Commentary: Infrastructure Considerations for Machine Learning

# Commentary: Infrastructure Considerations for Machine Learning

Welcome to another brief commentary and departure from the heavier mathematics. I have been endeavoring to expand the breadth of my knowledge on the tech side of things, and chronicling some things I’ve learned and observed from speaking with different companies, both as an independent and as a Tech Field Day delegate. Many of these articles have focused on considerations for a practitioner rather than a mathematician, but occasionally we theorists have to show some business value, so I try to keep current on the tools and methods utilized in the corporate world.

It’s fairly common knowledge now that most machine learning and deep learning algorithms are highly data-dependent. That is, the data you feed something like a neural network heavily affects the results. Since the common analogy for machine learning, artificial intelligence, and neural networks is one of a biological learning process, let me continue that analogy. These algorithms are like small children; they’re sponges. They learn based on the type and amount of data given, and in surprising ways. If you want to teach a child (or baby Terminator, perhaps one named John Henry) what a cow looks like, you must be very careful what you give him. If you only give him pictures of birds and cows, he may decide that a cow is identified by the number of legs he has. Then what happens when he is given a picture of a cat?

Perhaps you think of this and throw in pictures of dogs too. Aha! So a cow has four legs and hoofed feet! Until John Henry sees a zebra. This silly example illustrates just how long we took to learn even simple things as children, and how important large amounts of repetitive and varied data were to us converging on how to recognize a cow. These AI/ML/NN algorithms are designed to mimic this learning process, and thus require vast amounts of highly varied data. Good performance by an algorithm on a subset of the data may not hold up to the expanse of the real world data, just like the example of learning to recognize a cow. Thus, these algorithms are not ergodic, to borrow a term from dynamics and probability. The models and methods are not independent of the initial data you feed them. In other words, if two different people feed the same algorithm different datasets and let the algorithm “learn”, the end results can be vastly different.

To get around this, most practitioners of data science want to throw as much data as possible, ideally the entirety of everything. If you’re wanting to learn the shopping habits on an e-commerce site, you’d prefer to let your John Henry learn on the whole database rather than just a subset.1

However, your IT department would likely be unhappy with your request to run tests on a production environment for a multitude of reasons: security and performance being two of those. Having a bunch of copies floating around takes up massive amounts of storage, not to mention the security risks. A mistake in the code run against the production environment can take the whole e-store down due to a bad query.2 I spoke twice with Actifio about their Sky Infrastructure, first hearing from them at Tech Field Day 15, then interviewing them again to get some more details about use cases rather than an overview of the infrastructure itself.

As a quick overview (Mr. Achilles does a great job on the tech details in this video here), Actifio basically creates what they term a “golden copy” of your data, after which updates are done incrementally to save storage space, and everyone can get their own virtual copy (which are really more like pointers) to interact with. Now a data scientist can’t affect the production database when he/she queries against it, and can also use far more data in testing than before. This should shorten a data science development cycle, because the workaround for using subsets of data to train is to sample many subsets and train the algorithm over and over again, which takes time. In addition, the data scientist can find out very quickly if the code that worked for 100,000 rows will hold up against 10 million (guilty as charged of writing unscalable code in my past experience as a data scientist).

Being more of a theoretician, I don’t tend to step out of my bubble to consider the various infrastructures that are necessary to provide me my sandbox. To fix that, I endeavor to occasionally speak with various tech companies about their work. I like the way Actifio has streamlined a good solution that aims to satisfy the IT gatekeepers and the developers/data scientists/users of the data. Overall, I’m not exactly a fan of the semi-blind deep-learning approaches to make all business decisions, but those methods do have their uses, particularly in exploration and discovery. This platform definitely has a good potential to help a data science team in their development.

[Disclaimer: I have never been compensated by Actifio or any other company for my commentary articles.]