External storage might make sense for Hadoop

Using Hadoop to drive big data analytics doesn't necessarily mean building clusters of distributed storage; a good old array might be a better choice.

Using Hadoop to drive big data analytics doesn't necessarily mean building clusters of distributed storage -- good old external storage might be a better choice.

The original architectural design for Hadoop made use of relatively cheap commodity servers and their local storage in a scale-out fashion. Hadoop's original goal was to enable cost-effective exploitation of data that was previously not viable. We've all heard about big data volume, variety, velocity and a dozen other "v" words used to describe these previously hard-to-handle data sets. Given such a broad target by definition, most businesses can point to some kind of big data they'd like to exploit.

Big data is growing bigger every day and storage vendors with their relatively expensive SAN and network-attached storage (NAS) systems are starting to work themselves into the big data party. They can't simply leave all that data to server vendors filling boxes with commodity disk drives. Even if Hadoop adoption is just in its early stages, the competition and confusing marketing noise is ratcheting up.

High-level Hadoop and HDFS

In a Hadoop scale-out design, each physical node in the cluster hosts both local compute and a share of data; it's intended to support applications, such as search, that often need to crawl through massively large data sets. Much of Hadoop's value lies in how it effectively executes parallel algorithms over distributed data chunks across a scale-out cluster.

Hadoop is made up of a compute engine based on MapReduce and a data service called the Hadoop Distributed File System (HDFS). Hadoop takes advantage of high data "locality" by spreading big data sets over many nodes using HDFS, farming out parallelized compute tasks to each data node (the "map" part of MapReduce), followed by various shuffling and sorting consolidation steps to produce a result (the "reduce" part).

Commonly, each HDFS data node will be assigned DAS disks to work with. HDFS will then replicate data across all the data nodes, usually making two or three copies on different data nodes. Replicas are placed on different server nodes, with the second replica placed on a different "rack" of nodes to help avoid rack-level loss. Obviously, replication takes up more raw capacity than RAID, but it also has some advantages like avoiding rebuild windows.

Why enterprise-class storage?

So if HDFS readily handles the biggest of data sets in a way native to the MapReduce style of processing, uses relatively cheap local disks and provides built-in "architecture-aware" replication, why consider enterprise-class storage? For one, there is still a lurking vulnerability in the HDFS metadata server nodes. While each version of Hadoop improves on HDFS's reliability, there's still a good argument to place the HDFS metadata servers on more reliable RAID-based storage.

There are a lot of IT reasons for using external shared storage for the bulk of the data. First, while Hadoop can scale out to handle multiple petabytes of data, most big data sets are likely be in the 10 TB to 50 TB range -- multi-TB sizes traditional database offerings can't practically handle but are well within a cost-effective range of scale-out SAN and NAS solutions. And those shared datasets, often integral to a company's existing business processes, can be more efficiently mastered, managed and integrated on enterprise storage than in HDFS.

While there are evolving security-conscious components being built for the Hadoop ecosystem (e.g., Sentry, Accumulo), data security and data protection are other key reasons to consider external storage. Native HDFS isn't easy to back up, protect, secure or audit. NAS and SANs, of course, are built with great data protection and snapshots.

Using external enterprise storage, a highly available Hadoop application (becoming more common as Hadoop evolves more real-time query and streaming analytics capabilities) might never know disk failures have even happened.

And by architecting Hadoop with external storage, you can separate not only the storage management, but also take advantage of the separate "vectors of growth." It's easier to add storage or compute without adding other needless resources. There's some Capex benefit as well, as enterprise RAID solutions will use less disk footprint than Hadoop's "gross" replication.

Sharing is a big win for external storage, as moving big data into and out of a Hadoop cluster can be challenging. With external storage, multiple applications and users can access the same "master" data set with different clients, even updating and writing data while it's being used by Hadoop applications.

Virtualizing Hadoop

External storage may also offer advantages in a virtualized Hadoop scenario, which we expect will become a more common way to deploy Hadoop in enterprises. Deploying Hadoop scale-out nodes as virtual machines allows on-demand provisioning, and makes it easy to expand or shrink clusters.

Multiple virtual Hadoop nodes can be hosted on each hypervisor and can be easily allocated more or less resource for a given application. Hypervisor-level high-availability (HA)/fault tolerance capabilities can be tapped for production Hadoop applications. Performance is a concern, but more resources can be applied dynamically where needed to produce parity if not superior performance for certain Hadoop applications.

Virtually storing big data

One compelling reason to look at a physical Hadoop architecture is to avoid expensive SANs, especially as data sets grow larger. Yet in a virtual environment it may make even more sense to consider external storage. One reason is that provisioning compute-only virtual Hadoop clusters is quite simple, but throwing around big data sets will still be a challenge. By hosting the data on external shared storage, provisioning virtual Hadoop hosting becomes almost trivial, and hypervisor features like DRS and HA can be fully leveraged.

Since a single big data set can be readily shared "in place" among multiple virtualized Hadoop clusters, there's an opportunity to serve multiple clients with the same storage. By eliminating multiple copies of data sets, reducing the amount of data migration, and ensuring higher availability and data protection, Hadoop becomes more manageable and readily supported as an enterprise production application. The TCO of hosting virtualized Hadoop on fewer, but relatively more expensive, virtual servers with more expensive storage options could still be lower than standing up a dedicated physical cluster of commodity servers.

It's how you use it that matters

External storage is more expensive than the default DAS option, but it's the "other" things about storing data that even up the accounting. The decision regarding the use of external storage needs to be made on a TCO basis, including considering both the incoming source and end-to-end workflow of the datasets. Other workloads might be able to share a single data repository effectively, and existing assets and skills can be leveraged. On the other hand, there may be limits to the ingestion, performance, capacity or scalability of high-end storage.

There are a lot of choices, with more on the way. But a knowledgeable storage manager has plenty of experience that applies to big data whether it's in HDFS on local disk or hosted on external storage.

About the author:
Mike Matchett is a senior analyst and consultant at Taneja Group.

Dig Deeper on Storage management and analytics

Disaster Recovery
Data Backup
Data Center
and ESG