Big revolutions are happening and Patrick Wendell feels that at Databricks he is right in the middle of this. Patrick is co-founder of Databricks as well as a founding Committer and PMC member of Apache Spark. In the Spark project Patrick has acted as release manager for several Spark releases, in addition to maintaining several subsystems of Spark's core engine. At Databricks, Patrick directs the company's maintaince and development of Spark.
The Databricks team was present at The Spark Summit Europe in Amsterdam on October 27th to 29th. After the impressive days of the Spark Summit, BI-platform thanks the very sympathetic Patrick for finding some time in his busy schedule to talk with us about Databricks, Spark, and training of data scientists.
Patrick, Databricks is founded by the people who made and build Spark and Databricks delivers Spark-as-a-service. Can you tell something about the backgrounds?
Patrick: "Yes, I was one of the founding team of Databricks. I was at UC Berkeley and in a PhD program there is a lab, called the AMPLab. It is a researchlab focussed on Big Data actually and one of their earliest research groups with this focus. Spark was really intended as a researchproject, but suddenly you had companies putting Spark into production to solve business problems.
And we got the feeling that maybe we should start a company to help support this open source project, because most succesful open source is backed by a commercial entity today. So we started a company after we saw many people using Spark. Now the original founding team is largely people from that research group at Berkeley, of which I'm one. And then we have a softwareproduct called Databricks, which is the same name as the company. The problem we are trying to solve at Databricks is that, as much as we make the API's in Spark very developer friendly, very powerful and fast, for many companies actually going into operational production with Spark it is still very difficult. And this is not really specific to Spark, it's true of many other Big Data technologies as well.
The goal at Databricks was: let us actually build this software-as-a-service platform and provide the way to put down a creditcard and have an up and running Spark environment in less than ten minutes. Previously many companies would take weeks or months or even more than one year to just deploy a full Spark cluster. This is kind of the broad background on the Berkeley AMPLab, the founding of the company and what is our product strategy."
There are new things going on, for instance the whole SMACK (Spark, Mesos, Akka, Cassandra, Kafka) stack is quite a new concept, it replaces deployment of Hadoop DFS and MapReduce. There is a lot of new information to follow for people from the traditional RDBMS world!
"Yes. It is a constellation of many different open source technologies. It's a lot to keep track of, there are many moving pieces. Basically what you see is that along all of these dimensions, where you are used to have proprietary software offerings, things like virtualization, storage, processing, you have now open source alternatives in many of these areas. That provides many benefits for the users, in particular they have much more flexibility in the way that they think about storing and processing data.
For instance, Spark has a datasource API, and we basically implemented it nine months ago. Today there are more than thirty implementations of datasource integrations, because it is an open source API and all of the other systems that integrate with it are also open source, like Cassandra, MapR's datastorage system.
The speed at which the community can evolve and innovate with open source is very staggering. It is just much more than in this traditional heavily vendor controlled type of ecosystem. On the other hand the one tradeoff is there is lots of complexity that goes in to managing and operating all of this cacophony of different technologies. There are so many different things, even just knowing what they all are supposed to do is very hard.
So, that is part of our vision with Databricks, that we want it to be easy to consume Spark as more of a service. We don't have to think to much about which Docker containers are running in and exactly the monitoring and healthchecking, we kind of do that for you because from an operational perspective, dealing with all of the moving parts is very costly to some companies."
So this is also relevant when enterprises want to scale up, for instance with a new office, new customers, because they don't want to worry about all the IT-management.
"Exactly, we have many customers who set up a data pipeline with Databricks, for instance they are analyzing some data from their customer applications and suddenly they have a ten times increase in customers. Which is for them a great situation, but what they are used to have to worry about was not to go in panic mode and rebuild the infrastructure for a larger scale. Would you have a managed service that has been tested at a very large scale, they can sit back and don't have to worry about that. And they are willing to pay for that. These are open source technologies, but we are building a company that we want to be a succesful company.
We need to find someway to provide value, and the way we have chosen is doing a SaaS offering. I don't know if you have seen the growth matrix for the cloud services in the last year but Amazon recently announced they had almost doubled their revenue in AWS in the last twelve months. And this is a very large number that is doubling and growing like crazy. This is by far the market leader in cloud services and we have build Databricks to rest on top of Amazon Web Services."
Is it posssible for companies to have Spark services not in the cloud but on-premise, in their own systems?
"Absolutely. Databricks is focussing on the cloud, but here (in the vendorhall of Spark Summit) you see more than ten companies that are offering some distribution of Spark or application based on Spark for all premises. And Databricks is a partner with many of those companies. Databricks is a very small startup, we are just focussed exclusively on cloud because if we look in the next five years, this is where we think will be a major growth area. But nonetheless we have partners, for instance we are partnering with HortonWorks, a Hadoop company, and we have a very good partner relationship with Datastax to help them distribute Spark to their customers.
So for this, on-premise users can consume Spark through and with help from Databricks but they consume it through these other vendors. The market is allready very well covered there."
How are new releases of open source software planned?
"In the open source business model in many cases some companies have a significant portion of the main committers: the people who maintain the product work at the company. For instance at Datastax many of the Cassandra committers are very active ones, they work at Datastax.
So the company is not of a 100 percent neutral, because the whole point is that an open source project should have some autonomy but in general many of the open source projects have a steady release timeline. Like for Spark we release every three months and all the vendors and everyone working on Spark, we agreed on this - personally I was the one who put this in the place, so its close to my heart - to have a three month release cadence. Every three months we deliver new code, that way we know ahead of time when things will be delivered."
There is rather a shortage of data science professionals. For the amount of work to come, for the amount of problems to solve, what is needed to educate and train good data scientists?
"We have a huge gap between the necessary skills and the current training of people. So its a pervasive problem, specially in the Big Data area. I think the way to fix that problem is to provide open source and widely available training materials and to let people self educate. One really nice thing about open source is that anyone in the world can download Spark for free and start reading the documentation and start playing with it. I think this big revolution around open source technology in the last ten years will actually help a whole generation of people self educate and become more empowered in ability, able to use these technologies. Another aspect which we focus on at Databricks is providing widely available public training, we have all sorts of training business. But how many people can we train directly in one year, maybe a few thousand at best?
What we have done is taking all of our training materials and posted them publicly online: videos, example notebooks. We have this kind of notebook type of abstraction, it is a great way to teach people about data science. The way we get there is, according to me, by having widely available public training datasets and that is what we are working on at Databricks. It is a very key part of our mission, coming out of the academic context."
There is a lot of development happening with open source database solutions. It seems likes the OS platform is leading?
"I think there's a few things going on now. One is that the quality of open source software is increasing a lot. In some cases the open source is moving faster technologically than proprietary. It is almost the inverse of the normal case.
The second thing is that more and more companies which are making platform infrastructure investments really want that platform level at last to be open source. Because they are making a long term commitment, they don't want to be a slave to a particular vendor in the long term. It is almost the case now that someone purchasing software has to defend the decision of why not to do open source, if they chose that. The default question, the assumption is they would do something with open source software.
There have been some big changes in the last few years, and we feel that at Databricks we are right in the middle of this!"
[3 ochtenden online] 5, 6 en 7 oktober 2020 Het Logical Data Warehouse, een door Gartner geïntroduceerde architectuur, is gebaseerd op een ontkoppeling van rapportage en analyse enerzijds en gegevensbronnen anderzijds. Een flexibelere architectu...
27 oktober 2020 (online seminar op 1 ochtend) Iedere organisatie heeft te maken met het integreren van systemen en applicaties. Maar hoe worden integratieprocessen en informatiestromen nu werkelijk geautomatiseerd? En hoe pakt u dit op een efficiën...
2 - 4 november 2020Praktische driedaagse workshop met internationaal gerenommeerde trainer Lawrence Corr over het modelleren Datawarehouse / BI systemen op basis van dimensioneel modelleren. De workshop wordt ondersteund met vele oefeningen en prakti...
10 november 2020 Praktische workshop Datavisualisatie en Data-driven Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling, data science en de data artist? Lex Pierik behandelt de s...
11 en 12 november 2020 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoo...
23 - 24 november 2020Praktische workshop met internationaal gerenommeerde spreker Alec Sharp over het modelleren met Entity-Relationship vanuit business perspectief. De workshop wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikbare ri...
1 en 2 december 2020 Correcte informatie die in de juiste vorm en op het gewenste moment beschikbaar is lijkt een vanzelfsprekendheid. Dit doel kan alleen worden bereikt met een consequent beleid, dat doordacht alle fases van de levenscyclus van info...