Big Data Performance Anxiety and Data Grids
In Memory Data Grid (IMGD) is a data structure that is being increasingly cited as a solution to the problem of scaling big data applications. Unlike in-memory applications, IMGDs distribute only the data across RAM over multiple servers. With memory prices continuing to fall and the volume of data for an application continuing to rise, solutions based on memory are looking more attractive to manage the performance bottlenecks of applications using Big Data. Should IMGD be on your radar screen for a Big Data application?
In order to understand this and other questions on IMGDs, Carpe Datum Rx spoke to Miko Matsumura, VP of Marketing and Developer Relations at Hazelcast, who has seen recent adoption of this technology in banks, telcos and technology companies. Here is an extract from our discussion.
Why is it so important to distribute data in a data grid? Why should it be In-memory?
When you run an app on your laptop, where is the data? Where is the code? It’s in RAM. There are some exceptional cases for example if you are using Photoshop and handling a really big image then you get this thing called “swapping”. As soon as the application or OS starts swapping RAM to disk, things slow down a lot. Why is that? In terms of orders of magnitude, RAM is nanosecond and Disk is millisecond scale access, this means that RAM is on the order of one million times faster than disk. When you run enterprise applications, where is the data? In some cases the data is in a database, which we expect to be on a disk. In the case of something like Hadoop, we expect the data to be in a filesystem (HDFS) again which is on a disk. This is generally because we are dealing with very big data sets on the order of Petabytes. So in the case of In-Memory Data Grid, you bring the data to the application tier. This means that the code and the data reside together on the same machine, just like they do in the case where you’re running an application on your laptop or mobile device. What makes this complex to achieve is that the application tier can be radically clustered and so if there are dozens or even hundreds of application servers changing the data, you have potentially big problems with data consistency and concurrency. This is where the In-Memory Data Grid shines. In-Memory Data Grids typicaly handle Terabyte scales of data in RAM but typically not Petabyte scale, we just aren’t there yet, although in a few years we will be. Because of this, what is being called “Big Data” will typically not be entirely held in-memory, rather there will be an analytic working memory where large data sets will be pulled into memory for analytics. In addition an In-Memory Data Grid is ideal for handling “Big Fast Data” situations where there are extremely high data ingest rates on the order of tens of millions of records per second.
There are other solutions that use the power of the in-memory approach to solve analytics related problems. Can you comment on these solutions? Do you see in-memory as an important trend for the future?
In memory is exceedingly important because the amount of available RAM at the same price keeps doubling and this exponential growth enables more and more applications to run in the most sensible way (the way they do on your laptop) in memory. Right now the biggest amount of hype is around In-Memory Databases, namely Oracle 12c and SAP HANA. Unfortunately, in-memory database, while speeding up the database does not solve your problem, because in most cases you have a database “tier” and an application server “tier” which means the data and the code are on different machines. In addition, the database vendors often ask you to beef up your hardware in order to accommodate in-memory databases, which is a poorly kept “secret” in this industry. This is actually a wrong-headed architecture, to put more and more processors and memory into the database tier. Even if you have an in memory database, you still have to reach across the network to get the data from the database into the application server. Even fast networks add a lot of latency compared to RAM. So the problem is that we need to have the code and the data on the same machine? One approach is to bring the code into the data, i.e. Database Stored Procedures. Database Stored Procedures not only utilize better the processors in the database tier, but they also execute exceedingly fast especially on tasks like stream processing. This approach has been generally frowned upon because as soon as you store procedures in your database, you are going to experience vendor lock-in and incredible complexity of maintaining the code. So the In-Memory Data Grid approach brings the data up to the application layer. It’s a bit like distributed caching only much smarter.
Can you give examples of companies where IMGD can provide a competitive advantage to their core business? Does it let a business achieve something that was impossible to do before?
That’s exactly what it’s about. If your platform is able to leverage radical advancements in the underlying technology such as the exponential RAM doubling interval, you are able to gain significant competitive advantage. We see this of course in financial trading, but more and more industries care about this kind of explosive performance. The gaming industry is increasingly dealing with massively multiplayer games, the travel industry is dealing with “burst” situations such as weather based cancellations and eCommerce vendors are building huge platforms to address multiply “burst” channels such as retail stores and online stores, both of which explode in activity during Black Friday. One major technology hardware vendor had a problem where they were selling so many devices that the front-end web store was unable to know when the inventory was depleted. This extreme case in eCommerce was mitigated by building very large in memory data structures in Hazelcast which allowed them to coordinate thousands of retail outlets and millions of online users during their product launch. However, as I mentioned, Hazelcast has profound ease of use and developer friendliness–and it manages memory in a way that confers competitive advantage. So by embedding Hazelcast into your new initiatives can help you grow your business. Several key features support this ability beyond just ultra low latency transaction processing, another feature is elastic scalability. Nodes that boot up autodiscover one another and pool their resources dynamically. This kind of elasticity can be deployed into cloud configurations also, including Amazon EC2. So Hazelcast not only provides a competitive advantage to your application, but it can scale elastically with your application growth needs.
Was there a need for IMGD 2 years ago? Will there be a need 2 years from now?
There has always been a need for faster performance, elastic scalability and reliability in applications. What is causing an explosion of interest are several simultaneous events, one is the radical adoption of Mobile, so huge increases in clients and overall internet traffic, and therefore data. The emergence of a big data ecosystem. Also the emergence of virtualization technology to enable the software defined data center and to support radically distributed architectures. Finally, the exponentially dropping cost of RAM and doubling interval have cause these capabilities to become feasible for dozens of industries, not just Financial Services and Telcos. In two years from now, it will be unreasonable to consider deploying “Enterprise” software without an in-memory data grid. Consider that every leading ESB now embeds an in-memory data grid (WSO2 and Mule happen to embed Hazelcast). Commercial ESB providers are all embedding their own proprietary in-memory data grids. In-memory data grids are popping up inside and alongside databases and even some new programming languages. So we are going to see ubiquity of this technology.
Businesses classify IT investments as “run your business” versus “grow your business” solutions. Where does your IMGD fit in and why?
Both. One of the key properties of Hazelcast is that it is extremely resilient. Everyone knows that RAM is very performant, but also volatile. Hazelcast recovers from loss of nodes by protecting and recovering data quickly as each node performs backup functions for other nodes. Because of this, Hazelcast is in use in mission-critical systems in government, financial services and health care, areas where data losses can have extremely serious consequences. So Hazelcast is something to run your business on.
We always ask industry leasers to provide one piece of business advice to leaders who are investing in big data analytics. What is your Carpe Datum Prescription for business leaders?
My advice is, be careful about “Big Data” hype. We’re hearing about implementations that simply take too much time, between 9-12 months to produce value. The “Data Scientist” concept is often divorced from the domain expertise needed to produce real business insight. Listen to your developers and build solutions that are easy to use can scale like crazy.