Can shared storage be used with Hadoop architecture?
That's controversial, I would say, at the moment. The original implementers of Hadoop didn't use shared storage for a couple of reasons. One, they were looking for performance. They were trying to squeeze as much latency out of these Hadoop clusters as they possibly could, so they embedded disks within each one of the processing nodes for a massively parallel processing architecture. Hadoop architecture uses a distributed file system so it is distributed across nodes, and the data is usually stored in just a bunch of disks (JBOD) storage embedded within each one of these nodes. This is done for performance reasons.
Another reason is the original developers were trying to hold costs of these frameworks to an absolute minimum, so the code was open source, the servers were commodity, the interconnect was standard Gigabit Ethernet and the storage was either just disk or a collection of disks. They would argue that using shared storage in any form number introduces another network to the cluster, which adds latency. It also adds expense unnecessarily. They would argue that some storage services you get from shared storage, data protection services like RAID, for example, are accounted for in the cluster itself and the HDFS code because the cluster maintains three copies of the data at all times.
But, you still might want to look at shared storage in the context of Hadoop architecture for other reasons. It depends on what you want to do with shared storage. If you come to the conclusion that HDFS has some storage and data management services, but they're not all of the ones you'd like to see (there's still no snapshot capability, deduplication or concept of storage tiering, for example), but you'd like to apply these extra storage services, then start looking to vendors' shared storage architectures. Think about how you can apply the services you're familiar with to Hadoop storage architecture.
One way to do that is to replace the direct-attached storage (DAS) that's embedded in each one of the data nodes with a shared storage layer. This is something that EMC has proposed with Isilon. Isilon acts as the shared storage layer for HDFS in the Hadoop cluster, and they've run HDFS as a protocol layer on top of the Isilon 1FS file system. Built into 1FS is all of those storage features that you might want to apply to the Hadoop clusters that aren't in HDFS. That's one way to do it.
When you've come to the conclusion that you need to keep a certain amount of data within the cluster at all times, but you don't necessarily need to keep all data from the inception of the cluster, there are a couple of ways to approach that. When data just keeps building up and building up, you might want to let data age for a certain period of time and then flush it out. Enterprises seem to have these policies of keeping everything forever and saving everything. But it doesn't really make a lot of sense to keep letting data build up in the cluster. You might want to have some way of siphoning some older, inactive data off, but keep it available to the cluster in case you need it.
One way to do that is to use shared storage to build a secondary storage layer under the Hadoop cluster that's using direct-attached storage. You'll have to write some code or make modifications in order to do that because HDFS natively doesn't understand the secondary storage layer or storage tiering. But that can be done. People are doing it.
People are also using clouds as a secondary storage layer. They implement an S3 application programming interface, Amazon S3 API, and siphon data off to the cloud. The data is maintained and seen as available to the cluster, but it's actually used as an active archive and, in a way, it's also backup storage.
A third way to do this is something that NetApp has been working on with Cloudera. Their idea is to use one of the E-Series storage boxes as direct-attached storage for up to four nodes in a Hadoop cluster. It looks on the surface architecturally like four nodes in a Hadoop cluster are sharing a single NetApp E-Series device, but, in fact, you're actually partitioning inside the E-Series box and making only certain disks available to certain nodes within the cluster. In a way the device is being shared across nodes, but it's still preserving the direct-attached storage nature of Hadoop. That's another implementation of shared storage with Hadoop.
There's also growing interest in running Hadoop in virtual machines, either under VMware or Hyper-V. There are a few projects going on right now. VMware's Serengeti, for example, is a project that will take Hadoop nodes and run them under VMware virtual machines. This is another application for shared storage.
About the author:
John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.