05-02-2016 Door: Mike Ferguson

Celebrating Hadoop's 10-year anniversary

Deel dit bericht

In the 10 years since Hadoop became an Apache project, the momentum behind it as a key component platform in big data analytics has been nothing short of enormous. In that time we have seen huge strides in the technology with new ‘component’ Hadoop applications being contributed by vendors and organisations alike. Hive, Pig, Flume, Sqoop, to name a few have all become part of the Hadoop landscape accessing data in the Hadoop Distributed File System (HDFS). However, it was perhaps the emergence of Apache Hadoop YARN in 2013 that opened the flood gates by breaking the dependency on MapReduce.

Today we have Hadoop distributions from Cloudera, Hortonworks, MapR, IBM, Microsoft as well as cloud vendors like Amazon with EMR, Altiscale and Qubole. Yet the darling technology is Spark with scalable massively parallel in-memory processing. It can run on Hadoop or on its own cluster and access HDFS, cloud storage, RDBMSs and NoSQL DBMSs. It is a key technology combining the ability to do streaming analytics, machine learning, graph analytics and SQL data access all in the same execution environment and even in the same application. AMPlab and then Databricks progressed Spark functionality to the point where even vendors the size of IBM strategically committed to its future. From a developer perspective we have progressed way beyond just JAVA with languages like R, Scala and Python now in regular use. Interactive workbenches like Apache Zeppelin have also taken hold in the development and data science communities, speeding up analysis.

The era of self-service

Today we are entering a new era. The era of automation and of lowering of skills to let in convert self-service business analysts into so-called ‘citizen data scientists’. Data mining tools like KNIME, IBM SPSS and RapidMiner are already supporting in-memory analytics by leveraging the analytic algorithms in Spark. SAS is also running at scale in the cluster but with its own in-memory LASR server. There is also a flood of analytic libraries emerging like ADAM and GeoTrellis with IBM also open sourcing SystemML. The ETL vendors have all moved over to run data cleansing and integration jobs natively on Hadoop (e.g. Informatica Blaze, IBM BigIntegrate and Big Quality) or by running on top of Spark (e.g. Talend). Also, Spark-based self-service data preparation vendor startups have emerged such as Paxata, Trifacta, Tamr and Clear Story Data. On the analytical tools front, there too we have seen enormous strides. Search-based vendors like Attivio, Lucidworks, Splunk and Connexica all crawl and index Big Data in Hadoop and relational data warehouses. New analytical tools like Datameer and Platfora were born on Hadoop with the mainstream BI vendors (e.g. Tableau, Qlik, MicroStrategy, IBM, Oracle, SAP, Microsoft, Information Builders and many more) having all built connectors to Hive and other SQL on Hadoop engines. 

If that is not enough check out the cloud. Amazon, Microsoft, IBM, Oracle, Google all offer Hadoop as a Service. Spark is available as a service and there are analytics clouds everywhere. If you think we are done you must be kidding. Apache Flink, security is still being built out with Apache Sentry, Hortonworks Ranger, Zettaset, IBM Guardium and more. Oh, and data governance is finally getting done but still work in progress with the emergence of the information catalog (Alation, Waterline Data, Semanta, IBM) together with reservoir management, data refineries… Exhausting isn’t it. 

Without a doubt Hadoop along with Spark has and is transforming the analytical landscape. It has pushed analytics into the board room. It has extended the analytical environment way beyond the data warehouse but is not replacing it. ETL offload is a common use case to take staging areas off data warehouses so that CIOs can avoid data warehouse upgrades. And yet more and more data continues to pour into the enterprise to be processed and analysed.  There is an explosion of data sources with a tsunami of them coming over the horizon from the Internet of Things. But strangely, here we are with increasingly fractured distributed data and yet business demands more agility! Thank goodness for the fact that SQL prevails. Like it or loathe it, the most popular API on the planet is coming over the top of all of it. I’m tracking 23 SQL on Hadoop engines right now and that excludes the data virtualisation vendors! Thank goodness for data virtualisation and external tables in relational DBMS that reach into Hadoop. If you want to create the logical data warehouse, this is where it is going to happen. Who said relational is dead! Federated SQL queries and optimisers are here to stay. So… are you ready for all this? Do you have a big data and analytics strategy, a business case, a maturity model and a reference architecture? Are you organised for success? If you want to be disruptive in business you’ll need all of this.

Happy Birthday Hadoop!


Hadoop, Spark, NoSQL

Mike Ferguson

Mike Ferguson is oprichter van Intelligent Business Strategies Ltd. en als analist en consultant gespecialiseerd in business intelligence, big data, data management en enterprise business integration. Hij kan bogen op meer dan 30 jaar ervaring in IT, ondermeer op gebied van BI en Corporate Performance Management, Data Management en Big Data Analytics (Hadoop, MapReduce, Hive, Graph DBMSs).

Mike opereert afwisselend op  bestuursniveau, IT management niveau en ook gespecialiseerde technische IT niveau’s voor de terreinen BI, corporate performance management strategie, technologie- en toolselectie, enterprise architectuur, MDM and data-integratie. Hij is een veelgevraagd spreker op internationale conferenties en heeft veelvuldig artikelen gepubliceerd in de vakbladen en via weblog, waaronder zijn eigen channel op B-Eye-Network. Mike spreekt ook voor Adept Events en treedt regelmatig op tijdens de DW&BI Summit in Amsterdam. 

Alle blogs van deze auteur