In the 10 years since Hadoop became an Apache project, the momentum behind it as a key component platform in big data analytics has been nothing short of enormous. In that time we have seen huge strides in the technology with new ‘component’ Hadoop applications being contributed by vendors and organisations alike. Hive, Pig, Flume, Sqoop, to name a few have all become part of the Hadoop landscape accessing data in the Hadoop Distributed File System (HDFS). However, it was perhaps the emergence of Apache Hadoop YARN in 2013 that opened the flood gates by breaking the dependency on MapReduce.
Today we have Hadoop distributions from Cloudera, Hortonworks, MapR, IBM, Microsoft as well as cloud vendors like Amazon with EMR, Altiscale and Qubole. Yet the darling technology is Spark with scalable massively parallel in-memory processing. It can run on Hadoop or on its own cluster and access HDFS, cloud storage, RDBMSs and NoSQL DBMSs. It is a key technology combining the ability to do streaming analytics, machine learning, graph analytics and SQL data access all in the same execution environment and even in the same application. AMPlab and then Databricks progressed Spark functionality to the point where even vendors the size of IBM strategically committed to its future. From a developer perspective we have progressed way beyond just JAVA with languages like R, Scala and Python now in regular use. Interactive workbenches like Apache Zeppelin have also taken hold in the development and data science communities, speeding up analysis.
Today we are entering a new era. The era of automation and of lowering of skills to let in convert self-service business analysts into so-called ‘citizen data scientists’. Data mining tools like KNIME, IBM SPSS and RapidMiner are already supporting in-memory analytics by leveraging the analytic algorithms in Spark. SAS is also running at scale in the cluster but with its own in-memory LASR server. There is also a flood of analytic libraries emerging like ADAM and GeoTrellis with IBM also open sourcing SystemML. The ETL vendors have all moved over to run data cleansing and integration jobs natively on Hadoop (e.g. Informatica Blaze, IBM BigIntegrate and Big Quality) or by running on top of Spark (e.g. Talend). Also, Spark-based self-service data preparation vendor startups have emerged such as Paxata, Trifacta, Tamr and Clear Story Data. On the analytical tools front, there too we have seen enormous strides. Search-based vendors like Attivio, Lucidworks, Splunk and Connexica all crawl and index Big Data in Hadoop and relational data warehouses. New analytical tools like Datameer and Platfora were born on Hadoop with the mainstream BI vendors (e.g. Tableau, Qlik, MicroStrategy, IBM, Oracle, SAP, Microsoft, Information Builders and many more) having all built connectors to Hive and other SQL on Hadoop engines.
If that is not enough check out the cloud. Amazon, Microsoft, IBM, Oracle, Google all offer Hadoop as a Service. Spark is available as a service and there are analytics clouds everywhere. If you think we are done you must be kidding. Apache Flink, security is still being built out with Apache Sentry, Hortonworks Ranger, Zettaset, IBM Guardium and more. Oh, and data governance is finally getting done but still work in progress with the emergence of the information catalog (Alation, Waterline Data, Semanta, IBM) together with reservoir management, data refineries… Exhausting isn’t it.
Without a doubt Hadoop along with Spark has and is transforming the analytical landscape. It has pushed analytics into the board room. It has extended the analytical environment way beyond the data warehouse but is not replacing it. ETL offload is a common use case to take staging areas off data warehouses so that CIOs can avoid data warehouse upgrades. And yet more and more data continues to pour into the enterprise to be processed and analysed. There is an explosion of data sources with a tsunami of them coming over the horizon from the Internet of Things. But strangely, here we are with increasingly fractured distributed data and yet business demands more agility! Thank goodness for the fact that SQL prevails. Like it or loathe it, the most popular API on the planet is coming over the top of all of it. I’m tracking 23 SQL on Hadoop engines right now and that excludes the data virtualisation vendors! Thank goodness for data virtualisation and external tables in relational DBMS that reach into Hadoop. If you want to create the logical data warehouse, this is where it is going to happen. Who said relational is dead! Federated SQL queries and optimisers are here to stay. So… are you ready for all this? Do you have a big data and analytics strategy, a business case, a maturity model and a reference architecture? Are you organised for success? If you want to be disruptive in business you’ll need all of this.
Happy Birthday Hadoop!
7 november (online seminar op 1 middag)Praktische tutorial met Alec Sharp Alec Sharp illustreert de vele manieren waarop conceptmodellen (conceptuele datamodellen) procesverandering en business analyse ondersteunen. En hij behandelt wat elke data-pr...
11 t/m 13 november 2024Praktische driedaagse workshop met internationaal gerenommeerde trainer Lawrence Corr over het modelleren Datawarehouse / BI systemen op basis van dimensioneel modelleren. De workshop wordt ondersteund met vele oefeningen en pr...
18 t/m 20 november 2024Praktische workshop met internationaal gerenommeerde spreker Alec Sharp over het modelleren met Entity-Relationship vanuit business perspectief. De workshop wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikbare ...
26 en 27 november 2024 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoo...
De DAMA DMBoK2 beschrijft 11 disciplines van Data Management, waarbij Data Governance centraal staat. De Certified Data Management Professional (CDMP) certificatie biedt een traject voor het inleidende niveau (Associate) tot en met hogere niveaus van...
3 april 2025 (halve dag)Praktische workshop met Alec Sharp [Halve dag] Deze workshop door Alec Sharp introduceert conceptmodellering vanuit een non-technisch perspectief. Alec geeft tips en richtlijnen voor de analist, en verkent datamodellering op c...
10, 11 en 14 april 2025Praktische driedaagse workshop met internationaal gerenommeerde spreker Alec Sharp over herkennen, beschrijven en ontwerpen van business processen. De workshop wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikba...
15 april 2025 Praktische workshop Datavisualisatie - Dashboards en Data Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling en data science? Lex Pierik behandelt de stromingen in ...
Deel dit bericht