I am a skeptic when it comes to Proof of Concept projects (or "POC"s) in predictive analytics. Naturally, I do love the idea of looking before you leap, and I like the notion of "failing quickly". The intentions are always good. But there are mistakes I see so often that I keep my eye out for them and you should too.
The basic issue is that POCs create a rushed atmosphere. I can always predict what everybody wants to rush or skip over. And it’s almost everything except the modeling phase of the project. The bottom line for me is this. I have a clear vision for what has to be accomplished during the first 2-3 weeks of a real project and none of it should be skipped over just because it is a POC.
Here are my issues
First, there is no focus on measurable benefit or returns. You might be thinking “Why do I need an ROI, it’s just a POC?” I’m of the opinion that you should always desire a clear benefit. Unfortunately, very few try to measure it with a POC. Even if a vendor has bundled it into the price of their software, you will still have dozens or hundreds of person-hours that you are going to contribute. Nothing is free. I would rather have a 6-month project that will produce a tangible benefit over time than a 2-month trial project that produces nothing more than some interesting insights in a slide deck.
Second, I’ve never heard of a POC where practitioners take data preparation seriously. They use “readily available” data to “see what we can do”. You should never do that on a real project. What you get is something vaguely like a concept car. They are never production ready. You might get an insight or two, but you have no tangible work product to bring into the real project if that day ever comes. When you try to capitalize on what you’ve learned, you discover that you truly have to start from scratch.
Third, there is usually no discussion of the purpose of the model. The sense is that first you uncover predictive potential and then you seek implementation details. Initially, this may seem harmless and even appealing. However, one of the most fundamental decisions to make about any predictive model is the time horizon of the prediction.
For example, if you want to predict tomorrow’s weather you would want to use the most current information available as of the time that you make the prediction. You might sketch out the following ideas on the whiteboard as possible features to create.
Input Variables | Target Variable | ||||
Temperature difference between Friday and Saturday |
Map activity showing rain a day away? | Presence of clouds Saturday? | Saturday humidity | Difference in pressure between Friday and Saturday |
Will it rain tomorrow? (Sunday) |
If you want to make the same prediction a week in advance, then none of these features would exist yet the previous Sunday. You would need to use a completely different set of variables. You would have a different time gap between what you are predicting and what you are predicting it with. For that longer-term forecast, you would need something more suited to the time horizon. You simply can’t choose the relevant data without deciding what the goal is. If someone were to suggest that we should just “figure that our later” then they are ignoring this critical component of an effective model.
And finally, I have other concerns, but I’ll share just one more - our fourth reason. There is almost always a rush to the most complicated model that the technology will allow... which is almost always a so-called black box model. These models certainly have their place. But you shouldn’t start with them.
You should start with transparent models that teach you about your data and give you a baseline sense of performance. You then proceed to trying a more complex model to see if you get a bump in performance that is worthy of the interpretability tradeoff. Jumping right in with an opaque model gives you no value-add in terms of data exploration. It also typically gives a false sense of performance because complex models are difficult to diagnose when they are broken. The level of accuracy you think you are experiencing might have no basis in reality.
In summary: Always measure benefit; take data prep seriously; make sure to agree upon the purpose; and start with “glass box” models first. Don’t rule out POCs completely, but before you jump into one ask yourself some final questions:
• Am I simply assuming that it is worth a try simply because there is readily available data and the commitment seems to be low?
• Will I truly derive more value than I’m investing while taking all of my team’s time and effort into account?
• What tangible and reusable work product will I have when I’m done?
The last one is perhaps the most important. Do you truly believe that a month from now that you will be that much farther down the road toward a valuable outcome? If you like where these questions take you, then you might have something worth doing. If on the other hand, you are just gambling with a month’s work because you are getting the POC perhaps for “free” from a vendor, think about whether there is a better way for your team to invest a month’s work.
Keith McCormick will present a keynote during the Datawarehousing & Business Intelligence Summit: 'Data Preparation for Machine Learning: Why Feature Engineering Remains a Human-Driven Activity' on June 10th.
7 november (online seminar op 1 middag)Praktische tutorial met Alec Sharp Alec Sharp illustreert de vele manieren waarop conceptmodellen (conceptuele datamodellen) procesverandering en business analyse ondersteunen. En hij behandelt wat elke data-pr...
11 t/m 13 november 2024Praktische driedaagse workshop met internationaal gerenommeerde trainer Lawrence Corr over het modelleren Datawarehouse / BI systemen op basis van dimensioneel modelleren. De workshop wordt ondersteund met vele oefeningen en pr...
18 t/m 20 november 2024Praktische workshop met internationaal gerenommeerde spreker Alec Sharp over het modelleren met Entity-Relationship vanuit business perspectief. De workshop wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikbare ...
26 en 27 november 2024 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoo...
De DAMA DMBoK2 beschrijft 11 disciplines van Data Management, waarbij Data Governance centraal staat. De Certified Data Management Professional (CDMP) certificatie biedt een traject voor het inleidende niveau (Associate) tot en met hogere niveaus van...
3 april 2025 (halve dag)Praktische workshop met Alec Sharp [Halve dag] Deze workshop door Alec Sharp introduceert conceptmodellering vanuit een non-technisch perspectief. Alec geeft tips en richtlijnen voor de analist, en verkent datamodellering op c...
10, 11 en 14 april 2025Praktische driedaagse workshop met internationaal gerenommeerde spreker Alec Sharp over herkennen, beschrijven en ontwerpen van business processen. De workshop wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikba...
15 april 2025 Praktische workshop Datavisualisatie - Dashboards en Data Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling en data science? Lex Pierik behandelt de stromingen in ...
Deel dit bericht