The third big data myth in this series deals with how big data is defined by some. Some state that big data is data that is too big for a relational database, and with that, they undoubtedly mean a SQL database, such as Oracle, DB2, SQL Server, or MySQL.
To proof that such statements are being made, I present two examples. First, the following statement is from PredictiveAnalyticsToday.com: “Big data is data that is too large, complex and dynamic for any conventional data tools to capture, store, manage and analyze.” With the term conventional they mean, among other things, the well-known SQL databases. Here is the second one: “There are times when data is either being updated too quickly or the data are simply too large to be handled practically by a relational database.” Again, they probably mean a SQL database.
To be honest, this is a silly and non-constructive way to define big data and to distinguish it from “small” data. True, there are some SQL products that are really not designed to support big data workloads. Even for some of the more well-known products it’s a challenge to store hundreds of terabytes of data and still offer decent performance.
Big data systems can be developed with SQL database server technology. This has not only been proven on paper, but in real life projects as well. I will give two examples of categories of SQL products with which big data systems can be developed.
First, besides the traditional SQL database products, many so-called analytical SQL database servers exist today. These products have been designed and optimized to support analytics on big databases and they all use SQL. With some of them petabyte-large big data systems have been developed. For example, already in 2010 EBay operated a ten petabyte database supported by Teradata. Granted, not every SQL database product is suitable for every possible type of big data workload, but that is not different from NoSQL products. Most of them are also designed and optimized for specific big data workloads.
Second, don’t forget how popular SQL-on-Hadoop engines have become. Some claim that already more than thirty five of them exist. Now, if the interface of a big data system is SQL, then that system is a SQL-system. This is independent of whether the SQL interface is internally supported by a classic SQL database server or Hadoop. SQL-on-Hadoop engines running on Hadoop can support massive databases.
Conclusion, the myth “big data is too big for SQL systems” has never made any sense, and it isn’t making sense at all right now. It’s really a myth. SQL is definitely suitable for developing big data systems. Maybe not for all big data systems, but that applies to every technology. No database technology is perfect for every possible type of big data system.