Patrick McFadin works for DataStax, a solutions provider specialized in Apache’s project called Cassandra, a distributed NoSQL database. DataStax has contributed the most to Cassandra and Patrick McFadin, being involved in the project from the beginning, is considered the Chief Evangelist for Apache’s Cassandra. “We put a lot of our resources in it to make sure that Apache Cassandra is well-received and we built the enterprise tools around it.” At the first European version of the Spark Summit that took place on October 27th to 29th 2015, McFadin was one of the keynote speakers on how to use Cassandra and Spark together, and BI-Platform had the opportunity to have a chat with him.
As Cassandra is an open-source project governed by the Apache Software Foundation, DataStax is not the owner of Cassandra. However, DataStax can call itself the owner of the Spark-Cassandra Connector, as this was created by them. They work together with Databricks, which is specialized in Spark, in order to make sure that Cassandra and Spark work well together.
So you own the love story?
"Yes we own the love story, we’re the matchmaker!"
"Apache Cassandra is the core of the DataStax Enterprise offering. And as you know, this is open source. The part where we contribute the most to is the base, the data layer, where we store the data. We build out the tools and integrations that are somewhat difficult. So for instance, we ship a product, in this case Datastax Enterprise, with Spark already included. So with a single download off of our website, people have a complete package with a database in which Spark is integrated and they can use it right away. This is a very unique offering, as there are not a lot of ‘shipping Spark products’ out there. We also integrate Apache Solr for indexing, and we also use Spark in Solr. This means that we allow Solr queries inside Spark jobs, which is pretty unique. All of these integrations are packaged together in DataStax Enterprise. We have some tooling around it as well, we have a thing called Opscenter that allows us to manage all of that, and we offer more integrated solutions for security that enterprises need, such as LDAP and encryption auditing. Large companies have to have this, otherwise they cannot use the product. So putting that all together into one package is what we do with DataStax Enterprise. It makes it easier for a company to move. The main problem for big companies is that they know that this is the right choice, but they do not want to hire all the people for it because that requires a lot of effort and money. Therefore, they prefer taking a smaller route by purchasing our product. And along with that they get the support. They can call us when they have a problem. We do a lot of short-term consulting that is involved with the initial project. It helps them setting up. People and companies moving away from relational databases into this new arena have no experience with these new technologies. They need help to get there. There are larger consulting companies that will help them for years or months, but we help them with that initial transition."
Do you see a lot of companies moving from relational into these new environments? Or is it always in addition to their existing relational database?
Normally it is a combination of things. However, we certainly see more people moving away from the relational databases. In many cases, they have a new product or project and therefore they need to change some things. Because of these changes, they choose not to use a relational database for it. So they may have Oracle, MySQL, or some standard traditional SQL server, traditional set-up. If companies want to create a web or mobile product for example, they know that is not going to work with the traditional settings. The biggest issues that make companies move from relational to Cassandra are scale and uptime. When you only have a single server, you will not have either of those. Eventually you will run out of both. You either get unlucky and your server goes down or you run out of server and you won’t be able to scale anymore. That is a serious problem. That is why we are replacing relational databases all the time.
What kind of data modelling do you see being used for these environments?
Because of the way Cassandra works, not being relational, there are no joint data warehouse queries. That is exactly why we integrated Apache’s Spark. Data modelling with Cassandra is based on views. You build denormalized views for a specific case. For instance, if a website needs data, you build a view to store that data that can be displayed quickly. A website or mobile applications typically have a response time of milliseconds. However, when a lot of data is collected and a part of it needs to be fed back to the customer, one does not want to feed back all the raw data. This is pointless, because there is too much data that is irrelevant to the customer. Alternatively, one can use Spark for data analytics and create new data. In this case, only a fraction of the data is used and made useful. This is then put back into the view data model in Cassandra and with Spark one can create materialized views that can then be served on an application.
So the traditional ER data modelling is not in use for this type of environments?
ER data modelling still exists, but it is just not built into the data model itself. When we do formal data modelling with Cassandra, we do start with creating an ER diagram because at this point there is no technology behind it. There is a free online course offered on DataStax’s website in how to set up an ER diagram. This is still important because the first question that we receive from customers is “how do you do this?”. With this formal data modelling course, that takes several hours, people start with creating an ER diagram. Afterwards they are walked down through the application-specific settings. They then learn that the relationships from the ER diagram can be expressed in a different way, namely in a more real-time, but denormalized data model. However, the formal process is still necessary to get to this point.
Is there tooling available for interfacing with the Embarcadero tools or other applications for BI use cases?
Yes there is. For example Tableau, Jaspersoft, and many others offer this. And there also exist native Cassandra connectors for those. What we get in addition is Spark SQL. By using Spark SQL, you get really nice, rich workloads that can be created from random SQL. BI tools love that, they all talk to Hive, and you can set up a Hive listener that will take an SQL command from BI tools. I think this is interesting because you can build out these things yourself. Futhermore, the tooling that is built around Cassandra can give you faster dashboarding and other functionalities. It is not the same as when you do it with Oracle, but it is still relevant for in a BI space.
Does the SMACK stack make implementation very complex as compared to implementing traditional database management systems? Or is that your business model, to make it more simplified?
We are not the only ones doing this. And you are right, it is complicated because of the multiple parts. It is comparable with how airplanes were purchased in the early years. You had to buy the separate parts and build them yourself. I would be terrified to build my own airplane because I do not trust myself in building an airplane. I think we are seeing that transition now with SMACK. (ed. Spark, Mesos, Akka, Cassandra, Kafka) The mesosphere is using it now. We have separate parts of which we know they work well together. The next step is to combine them in a useful way so that it can be used without having to know that there are multiple individual parts working together. The next couple of years we are going to see a lot of development in these combined parts, because we need them. All those parts together make sense, but it is very difficult to put them all together. Therefore, companies selling a complete package are great solutions to this.
When looking at the marketplace, how does Cassandra differentiate from offerings such as Couch and Cloudera?
Cloudera is basically Hadoop. This is for data enterprise warehouses, with large-scale analyses of data. Cassandra is closer to the customer, that is the easiest way to put it. It would be right next to the mobile application, or to the IoT or web application that a company’s customer will use. Cloudera is further away from the customer because they are collecting large amounts of data and analyze it. They are making it feed-forward but not as close to the customer.
Couch is closer to the customer, like Cassandra. However, they are in-memory, so they are very different in some respects. They do smaller, short-request types of workloads. The main difference with Cassandra is the amount of collected data. With Cassandra, we try to store and collect a lot of data, and this is not what Couch is trying to accomplish. They want to process a short amount of data quickly.
So depending on the needs and wants of the company, you would choose either one of the products?
Exactly, and within a company this can even differ from project to project. When you look into different projects of a large enterprise, they might use all three of these tools. Things are starting to come more together though.
Around the 1980s there were so many relational databases, with Ashton-Tate being one of the first and one of the largest with its product called Dbase. Then there was a small startup called Oracle, changing the relational database business environment. All of those database companies started to make that initial play into relational in the 1980s, because the small computers were a thing and the mainframes were not anymore. Relational databases were needed because they deduplicated data a lot. Since 40 megabyte hard disks cost thousands of dollars, organizations viewed this as a problem, and relational databases offered a solution to this. That database market collapsed, came together, and eventually there remained two to three established players in the market.
We are seeing that happening again, but now with a different problem. Now we could say the current market is broken again, as there is a whole bunch of new database companies that are solving this new problem and are now starting to come together again. And we can see the trend evolving now.
Regarding risk-averse companies, don’t you think they would want to stick with their current databases from Oracle, IBM and Microsoft and try to build things on top of that instead of migrating?
One of the most poisonous things that an enterprise can do is to say: “that’s the way we’ve always done it”. This caused many bankruptcies already. When it happens, it is sad because it could have been avoided. Sticking to what previously had always worked will not result in success. There is a barrier to go past. At this moment in time it might “work fine”, but it will not take you to the next level. Ten to fifteen years ago, people did not expect their bank to be available at three o’clock in the morning. A bank could easily take its whole service down at midnight and it would still be fine. Nowadays however, consumers also want to be able to check their bank account in the middle of the night. As a result, banks have to adapt to this.
Look at newspapers. They never adapted properly and look at how many are disappearing. The world is changing but they did not want to change. Now they are disappearing. The survivors are online now, for a reason.
Op woensdag 25 en donderdag 26 maart 2020 vindt in het Van der Valk Hotel in Utrecht voor de zevende keer de Data Warehousing & Business Intelligence Summit plaats. Dit onafhankelijke congres wordt wederom georganiseerd door Adept Events, en heeft oo...
30 en 31 maart 2020Praktische workshop met internationaal gerenommeerde trainer Keith McCormick over machine learning. De workshop wordt ondersteund met oefeningen en praktijkvoorbeelden.Praktische workshop met Keith McCormick over het toepasse...
30 en 31 maart 2020Praktische workshop met internationaal gerenommeerde trainer Dave Wells over cloud datawarehousing. De workshop wordt ondersteund met oefeningen en praktijkvoorbeelden.Wat zijn de voor- en nadelen van Cloud Datawarehousing en hoe...
7 en 8 april 2020 Het Logical Data Warehouse, een door Gartner geïntroduceerde architectuur, is gebaseerd op een ontkoppeling van rapportage en analyse enerzijds en gegevensbronnen anderzijds. Een flexibelere architectuur waarbij sneller nieuwe ...
9 april 2020 Praktische workshop Datavisualisatie en Data-driven Storytelling. Hoe gaat u van data naar inzicht? En hoe gaat u om met grote hoeveelheden data, de noodzaak van storytelling, data science en de data artist? Lex Pierik behandelt de ...
21 en 22 april 2020 Praktisch tweedaags seminar met internationaal gerenommeerde spreker Mike Ferguson over het opzetten van een Enterprise Data Lake. Het seminar wordt ondersteund met praktijkvoorbeelden en duidelijke, herbruikbare richtlijnen. In d...
12 en 13 mei 2020 Organisaties hebben behoefte aan data science, selfservice BI, embedded BI, edge analytics en klantgedreven BI. Vaak is het dan ook tijd voor een nieuwe, toekomstbestendige data-architectuur. Dit tweedaagse seminar geeft antwoord op...
13 - 15 mei 2020Praktische driedaagse workshop met internationaal gerenommeerde trainer Lawrence Corr over het modelleren Datawarehouse / BI systemen op basis van dimensioneel modelleren. De workshop wordt ondersteund met vele oefeningen en praktijkv...