Manage Learn to apply best practices and optimize your operations.

Big data basics: Quick tips for prepping data

Ben Woo, managing director of Neuralytix, drove home the point to Storage Decisions conference attendees that storage managers are directly and inevitably tied to the big data phenomenon in his presentation titled: "What is big data and why should you care: Architecting storage for big data environments."

View another excerpt from Woo's big data presentation

Top requirements for a big data project


To get a clear understanding of some of the big data basics, Woo said, it's crucial to eliminate some of the myths associated with big data. For starters, many people mistakenly think that "big data is about Hadoop, or that Hadoop equals big data," Woo said. "They're related, but they're not exactly one for one."

Hadoop is a part of the big data stack, Woo explained, and essentially has a couple of functions. It has a storage or data management function, which is the Hadoop Distributed File System, and it also has a process management function, which is typically the MapReduce function. "Always remember that Hadoop is a set of applications that work together, and it's a framework. It's not a singular solution." Hadoop is, though, required to run big data environments -- along with applications that "understand" Hadoop, Woo said.

To prepare your data for a big data transformation, your data must be considered centralized, multiprotocol and shareable, Woo said. When talking about multiprotocol access, Woo said, "I'm talking about REST APIs, and other forms of APIs, where you can get to the data natively. You can extract the data out -- for analytics purposes but not necessarily for transformation purposes -- along the way."

Woo offered this short list of big data basics designed to give storage pros an idea of what is necessary to transform existing systems into a big data framework. For hardware, he said there are four areas to focus on:

  • Most likely, new big data "clusters" will have to be created. "Most organizations are not set up to have clusters of storage equipment. That is an evolutionary process," Woo said.
  • Could leverage Hadoop Virtual Extensions to keep everything virtualized, and have high availability. "Again, we're extracting different layers out where they should be extracted."
  • Storage that supports HDFS
  • Faster storage networking
View All Videos