24-01-2020 Door: Keith McCormick

Are Proofs of Concept Best for Buy-In in Machine Learning and Predictive Analytics?

Deel dit bericht

I am a skeptic when it comes to Proof of Concept projects (or "POC"s) in predictive analytics. Naturally, I do love the idea of looking before you leap, and I like the notion of "failing quickly". The intentions are always good. But there are mistakes I see so often that I keep my eye out for them and you should too.

The basic issue is that POCs create a rushed atmosphere. I can always predict what everybody wants to rush or skip over. And it’s almost everything except the modeling phase of the project. The bottom line for me is this. I have a clear vision for what has to be accomplished during the first 2-3 weeks of a real project and none of it should be skipped over just because it is a POC.

Here are my issues
First, there is no focus on measurable benefit or returns. You might be thinking “Why do I need an ROI, it’s just a POC?” I’m of the opinion that you should always desire a clear benefit. Unfortunately, very few try to measure it with a POC. Even if a vendor has bundled it into the price of their software, you will still have dozens or hundreds of person-hours that you are going to contribute. Nothing is free. I would rather have a 6-month project that will produce a tangible benefit over time than a 2-month trial project that produces nothing more than some interesting insights in a slide deck.

Second, I’ve never heard of a POC where practitioners take data preparation seriously. They use “readily available” data to “see what we can do”. You should never do that on a real project. What you get is something vaguely like a concept car. They are never production ready. You might get an insight or two, but you have no tangible work product to bring into the real project if that day ever comes. When you try to capitalize on what you’ve learned, you discover that you truly have to start from scratch.

Third, there is usually no discussion of the purpose of the model. The sense is that first you uncover predictive potential and then you seek implementation details. Initially, this may seem harmless and even appealing. However, one of the most fundamental decisions to make about any predictive model is the time horizon of the prediction.
For example, if you want to predict tomorrow’s weather you would want to use the most current information available as of the time that you make the prediction. You might sketch out the following ideas on the whiteboard as possible features to create.

Input Variables         Target Variable
Temperature difference
between Friday and Saturday
Map activity showing rain a day away? Presence of clouds Saturday? Saturday humidity Difference in pressure between Friday and
Saturday
Will it rain tomorrow? (Sunday)

 

If you want to make the same prediction a week in advance, then none of these features would exist yet the previous Sunday. You would need to use a completely different set of variables. You would have a different time gap between what you are predicting and what you are predicting it with. For that longer-term forecast, you would need something more suited to the time horizon. You simply can’t choose the relevant data without deciding what the goal is. If someone were to suggest that we should just “figure that our later” then they are ignoring this critical component of an effective model.

And finally, I have other concerns, but I’ll share just one more - our fourth reason. There is almost always a rush to the most complicated model that the technology will allow... which is almost always a so-called black box model. These models certainly have their place. But you shouldn’t start with them.
You should start with transparent models that teach you about your data and give you a baseline sense of performance. You then proceed to trying a more complex model to see if you get a bump in performance that is worthy of the interpretability tradeoff. Jumping right in with an opaque model gives you no value-add in terms of data exploration. It also typically gives a false sense of performance because complex models are difficult to diagnose when they are broken. The level of accuracy you think you are experiencing might have no basis in reality.

In summary: Always measure benefit; take data prep seriously; make sure to agree upon the purpose; and start with “glass box” models first. Don’t rule out POCs completely, but before you jump into one ask yourself some final questions:
• Am I simply assuming that it is worth a try simply because there is readily available data and the commitment seems to be low?
• Will I truly derive more value than I’m investing while taking all of my team’s time and effort into account?
• What tangible and reusable work product will I have when I’m done?

The last one is perhaps the most important. Do you truly believe that a month from now that you will be that much farther down the road toward a valuable outcome? If you like where these questions take you, then you might have something worth doing. If on the other hand, you are just gambling with a month’s work because you are getting the POC perhaps for “free” from a vendor, think about whether there is a better way for your team to invest a month’s work.

Keith McCormick will present a keynote during the Datawarehousing & Business Intelligence Summit: 'Data Preparation for Machine Learning: Why Feature Engineering Remains a Human-Driven Activity' on June 10th.

Keith McCormick

Keith McCormick is a highly accomplished professional senior consultant, mentor, and trainer, having served as keynote and moderator at international conferences focused on analytic practitioners and leadership alike. Keith has leveraged statistical software since 1990 along with deep expertise utilizing popular industry advanced analytics solutions such as IBM SPSS Statistics, IBM SPSS Modeler, AMOS, Answer Tree, popular open source and other tools involving text and big data analytics.
Keith McCormick has guided organizations to establish highly effective analytical practices across industries, to include public sector, media, marketing, healthcare, retail, finance, manufacturing and higher education. He holds a very unique blend of tactical and strategic skill along with the business acumen to ensure superior project design, oversight and outcomes that align with organizational targets.

Alle blogs van deze auteur

Partners