I am a skeptic when it comes to Proof of Concept projects (or "POC"s) in predictive analytics. Naturally, I do love the idea of looking before you leap, and I like the notion of "failing quickly". The intentions are always good. But there are mistakes I see so often that I keep my eye out for them and you should too.
The basic issue is that POCs create a rushed atmosphere. I can always predict what everybody wants to rush or skip over. And it’s almost everything except the modeling phase of the project. The bottom line for me is this. I have a clear vision for what has to be accomplished during the first 2-3 weeks of a real project and none of it should be skipped over just because it is a POC.
Here are my issues
First, there is no focus on measurable benefit or returns. You might be thinking “Why do I need an ROI, it’s just a POC?” I’m of the opinion that you should always desire a clear benefit. Unfortunately, very few try to measure it with a POC. Even if a vendor has bundled it into the price of their software, you will still have dozens or hundreds of person-hours that you are going to contribute. Nothing is free. I would rather have a 6-month project that will produce a tangible benefit over time than a 2-month trial project that produces nothing more than some interesting insights in a slide deck.
Second, I’ve never heard of a POC where practitioners take data preparation seriously. They use “readily available” data to “see what we can do”. You should never do that on a real project. What you get is something vaguely like a concept car. They are never production ready. You might get an insight or two, but you have no tangible work product to bring into the real project if that day ever comes. When you try to capitalize on what you’ve learned, you discover that you truly have to start from scratch.
Third, there is usually no discussion of the purpose of the model. The sense is that first you uncover predictive potential and then you seek implementation details. Initially, this may seem harmless and even appealing. However, one of the most fundamental decisions to make about any predictive model is the time horizon of the prediction.
For example, if you want to predict tomorrow’s weather you would want to use the most current information available as of the time that you make the prediction. You might sketch out the following ideas on the whiteboard as possible features to create.
|Input Variables||Target Variable|
between Friday and Saturday
|Map activity showing rain a day away?||Presence of clouds Saturday?||Saturday humidity||Difference in pressure between Friday and
|Will it rain tomorrow? (Sunday)|
If you want to make the same prediction a week in advance, then none of these features would exist yet the previous Sunday. You would need to use a completely different set of variables. You would have a different time gap between what you are predicting and what you are predicting it with. For that longer-term forecast, you would need something more suited to the time horizon. You simply can’t choose the relevant data without deciding what the goal is. If someone were to suggest that we should just “figure that our later” then they are ignoring this critical component of an effective model.
And finally, I have other concerns, but I’ll share just one more - our fourth reason. There is almost always a rush to the most complicated model that the technology will allow... which is almost always a so-called black box model. These models certainly have their place. But you shouldn’t start with them.
You should start with transparent models that teach you about your data and give you a baseline sense of performance. You then proceed to trying a more complex model to see if you get a bump in performance that is worthy of the interpretability tradeoff. Jumping right in with an opaque model gives you no value-add in terms of data exploration. It also typically gives a false sense of performance because complex models are difficult to diagnose when they are broken. The level of accuracy you think you are experiencing might have no basis in reality.
In summary: Always measure benefit; take data prep seriously; make sure to agree upon the purpose; and start with “glass box” models first. Don’t rule out POCs completely, but before you jump into one ask yourself some final questions:
• Am I simply assuming that it is worth a try simply because there is readily available data and the commitment seems to be low?
• Will I truly derive more value than I’m investing while taking all of my team’s time and effort into account?
• What tangible and reusable work product will I have when I’m done?
The last one is perhaps the most important. Do you truly believe that a month from now that you will be that much farther down the road toward a valuable outcome? If you like where these questions take you, then you might have something worth doing. If on the other hand, you are just gambling with a month’s work because you are getting the POC perhaps for “free” from a vendor, think about whether there is a better way for your team to invest a month’s work.
Keith McCormick will present a keynote during the Datawarehousing & Business Intelligence Summit: 'Data Preparation for Machine Learning: Why Feature Engineering Remains a Human-Driven Activity' on March 26th.
Furthermore he will present an unique post-conference workshop: ‘Putting Machine Learning to Work’ on March 30th en 31st.
Op woensdag 25 en donderdag 26 maart 2020 vindt in het Van der Valk Hotel in Utrecht voor de zevende keer de Data Warehousing & Business Intelligence Summit plaats. Dit onafhankelijke congres wordt wederom georganiseerd door Adept Events, en heeft oo...
30 en 31 maart 2020Praktische workshop met internationaal gerenommeerde trainer Keith McCormick over machine learning. De workshop wordt ondersteund met oefeningen en praktijkvoorbeelden.Praktische workshop met Keith McCormick over het toepasse...
30 en 31 maart 2020Praktische workshop met internationaal gerenommeerde trainer Dave Wells over cloud datawarehousing. De workshop wordt ondersteund met oefeningen en praktijkvoorbeelden.Wat zijn de voor- en nadelen van Cloud Datawarehousing en hoe...
7 en 8 april 2020 Het Logical Data Warehouse, een door Gartner geïntroduceerde architectuur, is gebaseerd op een ontkoppeling van rapportage en analyse enerzijds en gegevensbronnen anderzijds. Een flexibelere architectuur waarbij sneller nieuwe ...
9 april 2020 Praktische workshop Datavisualisatie en Data-driven Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling, data science en de data artist? Lex Pierik behandelt de ...
21 en 22 april 2020 Praktisch tweedaags seminar met internationaal gerenommeerde spreker Mike Ferguson over het opzetten van een Enterprise Data Lake. Het seminar wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikbare richtlijnen. In d...
12 en 13 mei 2020 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoord op...
13 - 15 mei 2020Praktische driedaagse workshop met internationaal gerenommeerde trainer Lawrence Corr over het modelleren Datawarehouse / BI systemen op basis van dimensioneel modelleren. De workshop wordt ondersteund met vele oefeningen en praktijkv...