michelangelus - Fotolia
In the early days of big data that followed the invention of Hadoop at Yahoo, proponents emphasized its potential for replacing bulging enterprise data warehouses focused on business intelligence.
Open source Hadoop data tooling was posed as an alternative to existing systems seen as expensive and ill-suited for the ever-larger volumes of arriving data.
That emphasis has shifted over time to complementing existing data warehouses and more. Hadoop applications have often come to be called data lakes, and the shifts just keep on coming.
Big data tooling has expanded far beyond mere data warehouses, according to Mike Matchett, an analyst and founder of the Small World Big Data consultancy.
"We are seeing increasing capabilities on the Hadoop and open source side to take over more and more of the corporation's data and workloads, including BI," Matchett said.
Today, complementing data warehouses means quite a few things, including the following:
- forging self-service descriptive analytics;
- creating analytics systems tied to the real-time operations working on many kinds of data;
- building-out enterprise management traits for big data analytics;
- supporting AI-oriented predictive analytics, such as machine learning; and
- providing cloud versions of big data analytics tools.
The original independent Hadoop distribution providers have had to be agile amid such major shifts.
A look at recent efforts of pioneer Hadoop vendors, such as Cloudera, Hortonworks and MapR, forms a backdrop as many look toward next week's Strata Data Conference in New York. While big data tooling is still a prominent focus of an event that tends to highlight progress in big data, the conference's emphasis -- like that of original Hadoop boosters -- has moved deeply into the management of processes and far beyond just Hadoop.
Self-service comes to big data
In August, Cloudera rolled out Workload XM management services for cloud-based analytics. Almost at the same time, the company made a hybrid Cloudera Data Warehouse and a Cloudera Altus Data Warehouse service generally available on both AWS and Microsoft Azure clouds.
The management services seek to bring visibility into diverse data workloads. Workload XM is also designed to help administrators provide dependable service-level agreements for self-service analytics applications, according to Anupam Singh, general manager of analytics at Cloudera, based in Palo Alto, Calif.
Meanwhile, Singh said, the cloud warehouse offering enables encryption for data at rest or in motion, and it provides a view into the lineage of data sets in analytics workloads. Such capabilities have grown in importance, as GDPR and other data privacy initiatives have gained momentum.
All these moves play to corporations' needs to increase use of big data analytics, Singh said.
"Customers don't look at buzzwords like Hadoop and cloud. But they do want more business units to access the data," he said.
Individual business units want to spin up data warehouses on the cloud, in no small part, because doing so as a capital expenditure is beneficial, Singh said.
Data as a moving target
Cloud has been a clear preoccupation for Hadoop player Hortonworks, too. In June, the company expanded its Google Cloud presence with Google Cloud Storage support. Improving the management of real-time data analytics on cloud and on premises has also been a goal.
In August, Hortonworks sought to improve the handling of streaming data, launching Streams Messaging Manager (SMM) to provide administrators better views into Kafka messaging clusters that have become increasingly prevalent in big data pipelines.
Such management tools are an important part of moving Hadoop-style big data analytics into production in one of the areas in which established data warehouses may stumble -- that is, in real-time applications that use incoming data to affect ongoing operations.
Recommendation engines and fraud detection are among the most cited application types that involve so-called data in motion.
Ted Dunningchief application architect at MapR
Differences between the data-in-motion analytics tooling and traditional enterprise data warehousing show up in the rate and the volume of data being ingested, according to Jamie Engesser, vice president of product management at Hortonworks, based in Santa Clara, Calif.
"Now, we have analytics not only on data at rest, but also on data in motion," Engesser said. That creates a need for administrators and others to view message streams as they move through operations.
Engesser said SMM's capabilities can help in this regard, providing a deep view of travelling Kafka data, while displaying the lineage of that data for purposes of governance.
At the same time as it advanced the Kafka-related capabilities in SMM, Hortonworks released Hortonworks DataFlow 3.2, with improved performance for streaming based on enhancements to an underlying Apache Hive 3.0 implementation.
AI as a glowing orb
Like its competitors, MapR has expanded beyond its original use as a data warehouse replacement. And earlier this year, it released a version of its MapR Data Platform with improved streaming data analytics and new object data services that work either on cloud or on premises.
As the platform in the name implies, it's intended as an all-purpose place for handling various types of data and analytics, including machine learning and deep learning -- predictive approaches that have gained headwind since Hadoop first appeared.
These AI-related approaches don't necessarily make analytics easier for users, according to Ted Dunning, chief application architect at MapR, based in Santa Clara, Calif.
"People get excited about AI as if it's a wonderful glowing orb in the sky, like it's magic," he added. "But it isn't magical. There is a lot of hard work that has to be done." Much of the work is on the data side of the process, he emphasized.
Like its fellow Hadoopsters, MapR supplements SQL data analytics tooling with notebook-style data science tooling for the machine learning crowd.
In MapR's case, it's called the MapR Data Science Refinery. It provides the data tooling for Python and R programmers working on machine learning and deep learning applications.
Dunning and Ellen Friedman, principal technologist at MapR, have co-authored a book, AI and Analytics in Production: How to Make It Work, that is scheduled for release at next week's Strata. The book forwards the idea that while a single platform can serve multiple analytics purposes, AI is truly a distinct discipline of analytics.
It's the analytics, foremost
Big data's potential has led vendors and users alike into fields beyond Hadoop. The horizons seem to be constantly expanding. Data tooling keeps changing, but there is a thread.
"Basically, it's analytics," said Tony Baer, an analyst at Ovum.
"If you were doing analytics a few years ago for big data -- that is, data in sizes beyond that of the largest traditional data warehouse -- Hadoop was your choice," Baer said.
But, today in the cloud, "the choice may be an Apache Spark service or a machine learning service," he said. "Hadoop is now just one of the options for doing big data analytics."
Baer noted, however, that Hadoop is unique as a multipurpose platform, and the data governance abilities that vendors are building in will increasingly be called on over time.