I am a skeptic when it comes to Proof of Concept projects (or "POC"s) in predictive analytics. Naturally, I do love the idea of looking before you leap, and I like the notion of "failing quickly". The intentions are always good. But there are mistakes I see so often that I keep my eye out for them and you should too.
The basic issue is that POCs create a rushed atmosphere. I can always predict what everybody wants to rush or skip over. And it’s almost everything except the modeling phase of the project. The bottom line for me is this. I have a clear vision for what has to be accomplished during the first 2-3 weeks of a real project and none of it should be skipped over just because it is a POC.
Here are my issues
First, there is no focus on measurable benefit or returns. You might be thinking “Why do I need an ROI, it’s just a POC?” I’m of the opinion that you should always desire a clear benefit. Unfortunately, very few try to measure it with a POC. Even if a vendor has bundled it into the price of their software, you will still have dozens or hundreds of person-hours that you are going to contribute. Nothing is free. I would rather have a 6-month project that will produce a tangible benefit over time than a 2-month trial project that produces nothing more than some interesting insights in a slide deck.
Second, I’ve never heard of a POC where practitioners take data preparation seriously. They use “readily available” data to “see what we can do”. You should never do that on a real project. What you get is something vaguely like a concept car. They are never production ready. You might get an insight or two, but you have no tangible work product to bring into the real project if that day ever comes. When you try to capitalize on what you’ve learned, you discover that you truly have to start from scratch.
Third, there is usually no discussion of the purpose of the model. The sense is that first you uncover predictive potential and then you seek implementation details. Initially, this may seem harmless and even appealing. However, one of the most fundamental decisions to make about any predictive model is the time horizon of the prediction.
For example, if you want to predict tomorrow’s weather you would want to use the most current information available as of the time that you make the prediction. You might sketch out the following ideas on the whiteboard as possible features to create.
|Input Variables||Target Variable|
between Friday and Saturday
|Map activity showing rain a day away?||Presence of clouds Saturday?||Saturday humidity||Difference in pressure between Friday and
|Will it rain tomorrow? (Sunday)|
If you want to make the same prediction a week in advance, then none of these features would exist yet the previous Sunday. You would need to use a completely different set of variables. You would have a different time gap between what you are predicting and what you are predicting it with. For that longer-term forecast, you would need something more suited to the time horizon. You simply can’t choose the relevant data without deciding what the goal is. If someone were to suggest that we should just “figure that our later” then they are ignoring this critical component of an effective model.
And finally, I have other concerns, but I’ll share just one more - our fourth reason. There is almost always a rush to the most complicated model that the technology will allow... which is almost always a so-called black box model. These models certainly have their place. But you shouldn’t start with them.
You should start with transparent models that teach you about your data and give you a baseline sense of performance. You then proceed to trying a more complex model to see if you get a bump in performance that is worthy of the interpretability tradeoff. Jumping right in with an opaque model gives you no value-add in terms of data exploration. It also typically gives a false sense of performance because complex models are difficult to diagnose when they are broken. The level of accuracy you think you are experiencing might have no basis in reality.
In summary: Always measure benefit; take data prep seriously; make sure to agree upon the purpose; and start with “glass box” models first. Don’t rule out POCs completely, but before you jump into one ask yourself some final questions:
• Am I simply assuming that it is worth a try simply because there is readily available data and the commitment seems to be low?
• Will I truly derive more value than I’m investing while taking all of my team’s time and effort into account?
• What tangible and reusable work product will I have when I’m done?
The last one is perhaps the most important. Do you truly believe that a month from now that you will be that much farther down the road toward a valuable outcome? If you like where these questions take you, then you might have something worth doing. If on the other hand, you are just gambling with a month’s work because you are getting the POC perhaps for “free” from a vendor, think about whether there is a better way for your team to invest a month’s work.
Keith McCormick will present a keynote during the Datawarehousing & Business Intelligence Summit: 'Data Preparation for Machine Learning: Why Feature Engineering Remains a Human-Driven Activity' on June 10th.
(7,) 8 en 9 maart 2022 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoo...
15 maartPraktisch seminar van een halve dag met internationaal gerenommeerde trainer Keith McCormick over supervised machine learning. Alhoewel veel aandacht uit gaat naar Deep Learning technologieën blijkt dat voor 70 tot 80 procent van de toep...
17 maart 2022 (online seminar op 1 middag)Praktische tutorial met Alec Sharp Alec Sharp illustreert de vele manieren waarop conceptmodellen (conceptuele datamodellen) procesverandering en business analyse ondersteunen. Waardevolle online tutori...
22 maart 2022Praktische workshop met Rogier Werschkull over cloud datawarehousing.Wat zijn de voor- en nadelen van Cloud Datawarehousing en hoe pak je dat aan? Tijdens dit seminar door expert Rogier Werschkull krijgt u een duidelijk beeld van de vers...
29 en 30 maart 2022 (Face-to-face én Live Video Stream) Niet eerder nam zo'n keur aan internationale topsprekers deel aan de DW&BI Summit. Schrijf in voor de negende editie van ons jaarlijkse congres met wederom een ijzersterke sprekers li...
31 maart 2022 (online seminar op 1 middag)Praktisch en interactief seminar met Donald Farmer Drie eenvoudige doch effectieve manieren om een start te maken met Data en Analytics als een 'Line of Business'. Gerenommeerd analist en thought leader ...
5 en 6 april 2022 Correcte informatie die in de juiste vorm en op het gewenste moment beschikbaar is lijkt een vanzelfsprekendheid. Dit doel kan alleen worden bereikt met een consequent beleid, dat doordacht alle fases van de levenscyclus van informa...
7 april 2022 (online seminar op 1 middag)Praktisch seminar met John O'Brien DataOps is van cruciaal belang voor bedrijven om veerkrachtig te worden met data en het leveren van analytics in een volatiele en onzekere wereld. In dit seminar zal Joh...