I am a skeptic when it comes to Proof of Concept projects (or "POC"s) in predictive analytics. Naturally, I do love the idea of looking before you leap, and I like the notion of "failing quickly". The intentions are always good. But there are mistakes I see so often that I keep my eye out for them and you should too.
The basic issue is that POCs create a rushed atmosphere. I can always predict what everybody wants to rush or skip over. And it’s almost everything except the modeling phase of the project. The bottom line for me is this. I have a clear vision for what has to be accomplished during the first 2-3 weeks of a real project and none of it should be skipped over just because it is a POC.
Here are my issues
First, there is no focus on measurable benefit or returns. You might be thinking “Why do I need an ROI, it’s just a POC?” I’m of the opinion that you should always desire a clear benefit. Unfortunately, very few try to measure it with a POC. Even if a vendor has bundled it into the price of their software, you will still have dozens or hundreds of person-hours that you are going to contribute. Nothing is free. I would rather have a 6-month project that will produce a tangible benefit over time than a 2-month trial project that produces nothing more than some interesting insights in a slide deck.
Second, I’ve never heard of a POC where practitioners take data preparation seriously. They use “readily available” data to “see what we can do”. You should never do that on a real project. What you get is something vaguely like a concept car. They are never production ready. You might get an insight or two, but you have no tangible work product to bring into the real project if that day ever comes. When you try to capitalize on what you’ve learned, you discover that you truly have to start from scratch.
Third, there is usually no discussion of the purpose of the model. The sense is that first you uncover predictive potential and then you seek implementation details. Initially, this may seem harmless and even appealing. However, one of the most fundamental decisions to make about any predictive model is the time horizon of the prediction.
For example, if you want to predict tomorrow’s weather you would want to use the most current information available as of the time that you make the prediction. You might sketch out the following ideas on the whiteboard as possible features to create.
|Input Variables||Target Variable|
between Friday and Saturday
|Map activity showing rain a day away?||Presence of clouds Saturday?||Saturday humidity||Difference in pressure between Friday and
|Will it rain tomorrow? (Sunday)|
If you want to make the same prediction a week in advance, then none of these features would exist yet the previous Sunday. You would need to use a completely different set of variables. You would have a different time gap between what you are predicting and what you are predicting it with. For that longer-term forecast, you would need something more suited to the time horizon. You simply can’t choose the relevant data without deciding what the goal is. If someone were to suggest that we should just “figure that our later” then they are ignoring this critical component of an effective model.
And finally, I have other concerns, but I’ll share just one more - our fourth reason. There is almost always a rush to the most complicated model that the technology will allow... which is almost always a so-called black box model. These models certainly have their place. But you shouldn’t start with them.
You should start with transparent models that teach you about your data and give you a baseline sense of performance. You then proceed to trying a more complex model to see if you get a bump in performance that is worthy of the interpretability tradeoff. Jumping right in with an opaque model gives you no value-add in terms of data exploration. It also typically gives a false sense of performance because complex models are difficult to diagnose when they are broken. The level of accuracy you think you are experiencing might have no basis in reality.
In summary: Always measure benefit; take data prep seriously; make sure to agree upon the purpose; and start with “glass box” models first. Don’t rule out POCs completely, but before you jump into one ask yourself some final questions:
• Am I simply assuming that it is worth a try simply because there is readily available data and the commitment seems to be low?
• Will I truly derive more value than I’m investing while taking all of my team’s time and effort into account?
• What tangible and reusable work product will I have when I’m done?
The last one is perhaps the most important. Do you truly believe that a month from now that you will be that much farther down the road toward a valuable outcome? If you like where these questions take you, then you might have something worth doing. If on the other hand, you are just gambling with a month’s work because you are getting the POC perhaps for “free” from a vendor, think about whether there is a better way for your team to invest a month’s work.
Keith McCormick will present a keynote during the Datawarehousing & Business Intelligence Summit: 'Data Preparation for Machine Learning: Why Feature Engineering Remains a Human-Driven Activity' on June 10th.
2 maart 2021 (online seminar op 1 ochtend) Cloud Native technologieën als FaaS (Function-As-A-Service), Cloud Native messaging en Serverless API Management zijn belangrijke bouwstenen voor een nieuwe generatie van integratie-architecturen. ...
8 - 12 maart 2021 [5 halve dagen online]Praktische workshop met internationaal gerenommeerde spreker Alec Sharp over het modelleren met Entity-Relationship vanuit business perspectief. De workshop wordt ondersteund met praktijkvoorbeelden en duidelij...
23 en 24 maart 2021 Het Logical Data Warehouse, een door Gartner geïntroduceerde architectuur, is gebaseerd op een ontkoppeling van rapportage en analyse enerzijds en gegevensbronnen anderzijds. Een flexibelere architectuur waarbij sneller nieuwe ge...
14 en 15 april 2021 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoord ...
20 april 2021 (online seminar op 1 ochtend)Praktische workshop met Rogier Werschkull over cloud datawarehousing.Wat zijn de voor- en nadelen van Cloud Datawarehousing en hoe pak je dat aan? Tijdens deze online sessie van een halve dag door expert Ro...
22 april 2021 (online seminar op 1 ochtend) Iedere organisatie heeft te maken met het integreren van systemen en applicaties. Maar hoe worden integratieprocessen en informatiestromen nu werkelijk geautomatiseerd? En hoe pakt u dit op een efficië...
18 mei 2021 Praktische workshop Datavisualisatie en Data-driven Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling, data science en de data artist? Lex Pierik behandelt de stromi...
19 en 20 mei 2021 Correcte informatie die in de juiste vorm en op het gewenste moment beschikbaar is lijkt een vanzelfsprekendheid. Dit doel kan alleen worden bereikt met een consequent beleid, dat doordacht alle fases van de levenscyclus van informa...