With an array of new architectures -- parallelism, NoSQL, NewSQL, Hadoop, SQL on Hadoop, columnar RDBMS extensions and more -- it is an exciting time for data architecture, maybe too much so. We turned to longtime database management system (DBMS) industry observer Curt Monash, president of analyst company Monash Research, to help sort through the maze. In a series of interviews, we peppered him with data technologies questions, and he responded.
Tech decision-makers in corporations interested in ''real-time analytics'' face a great variety of potential architectures. How do they go about sorting out and solving this one?
Curt Monash: Well, to quote Mike Stonebraker, 'One size doesn't fit all.' Now, he said that in a slightly different context, but one that is closely related to this.
We could look at this as a horses-for-courses scenario, which holds that, for any particular task you might want to do, the ideal data management system is likely to be different. But people need to make tradeoffs, because nobody can stand to run as many different data management systems as that would require.
So the right choice is positioned at neither extreme. A large enterprise is not going to have one [database management system] for everything, but it is also not going to have the perfect system for each little task.
Look at this as a horses-for-courses scenario … for any particular task you might want to do, the ideal data management system is likely to be different.
president, Monash Research
We can ask what kind of architectures are going to commonly be good choices. But there is no one thing that is going to be the right choice for everybody.
Often, the answer is: 'Sometimes something isn't broken, and doesn't need fixing.' And just as often the answer is: 'Sometimes something is broken, and still doesn't need fixing.'
Of course, for a sufficiently small enterprise, most of that doesn't apply. If they are small enough, then they can put all their data in one DBMS, not even separating for data warehousing and OLTP. Actually, those types of organizations are probably using SaaS anyway; but for whatever they are doing in-house, one or two DBMSs should suffice.
Looking for a thread, is it fair to say that one common theme running through many of the newly competitive architectures is parallelism? It's a big part of Hadoop and of Spark, which seems to be a big part of Hadoop 2.
Monash: First let's back up. There are two major needs for parallel computing -- there is one, work on different cores on the same node, and two, splitting work among different nodes. Most people think in terms of the latter, which is fine, because if you can solve the latter you can also solve the former.
And basically, for certain magnitudes of data or work, you have to scale out to get it done. This is in essence because any one core on a CPU has stopped growing in performance. They had to stop just increasing the clock speed because that requires a lot of power and generates a lot of heat.
But improvement in chips has given you more cores, more cost-effectively. At a certain point you just end up having more nodes, and that just turns out to be the most cost-effective way to go forward.
More on data architecture
Read tales of data science team building
Learn how data consolidation works on French waterways
Find out about data in the city
The thing is that some tasks are incredibly easy to parallelize and some aren't. A term for the former -- 'embarrassingly parallel' -- is used a lot. You are basically doing the same simple job on many rows of data.
The MapReduce parallelization paradigm is very well suited for embarrassingly parallel jobs, and some others, but is not well suited for more general parallelization.
Hadoop 2 is trying to address that. And a consensus has arisen that the best next-generation parallelization paradigm is the one in Spark. It divides work and data into chunks more flexibly than MapReduce does. One thing that Spark seems particularly well suited for is machine learning of the sort that is quite iterative -- which MapReduce is pretty bad for. Also, MapReduce has a limitation in terms of having to write intermediate results to disk frequently -- that should be lifted in many cases, and Spark in particular can lift the limitation, it can keep operating in memory. As well, early on, a way was found possible for Spark to be helpful in handling streaming data.
MemSQL melds machine learning libraries into SQL database