agsandrew - Fotolia

Three ways to build a big data system

In a book excerpt, author Dale Neef outlines and compares different approaches organizations can take when trying to bring a big data system into their IT environments.

This is an excerpt from Chapter 10, "Doing Business in a Big Data World," from Dale Neef's book Digital Exhaust: What Everyone Should Know About Big Data, Digitization and Digitally Driven Innovation. Neef is a technology consultant, speaker and author who focuses on big data management, electronic monitoring and reporting.

In the chapter, Neef explores architectural, organizational and security issues that organizations must take into account when planning a big data system and integrating Hadoop clusters, NoSQL databases and other big data technologies with their current systems.

Many organizations are in a quandary about whether the advantages of big data search and analytics justify an infrastructure upheaval and are wondering about the best approach to combining these two different frameworks for their particular organization. There are three broad configuration choices available to them.

1. Do-it-yourself 'build-on' to a company's current enterprise IT structure

Even if they are not completely satisfied with the current level of data capture and analysis, most companies considering a move toward adopting big data technologies already have a well-staffed and relatively modern IT framework based on relational database (RDB) management systems and conventional data warehousing.

Copyright info

This excerpt is from the book Digital Exhaust: What Everyone Should Know About Big Data, Digitization and Digitally Driven Innovation, by Dale Neef. Published by Pearson FT Press. ISBN 978-0133837964. Copyright 2014.

Any company already managing a large amount of structured data with enterprise systems and data warehouses is therefore fairly well versed in the day-to-day issues of large-scale data management. It would seem natural for those companies to assume that, as big data is the next big thing happening in the evolution of information technology, it would make sense for them to simply build a NoSQL-type/Hadoop-type of infrastructure themselves, incorporated directly into their current conventional framework. In fact, ESG, the advisory and IT market research firm, estimated that at the beginning of 2014, more than half of large organizations will have begun this type of do-it-yourself approach. As we've seen, as open source software, the price of a Hadoop-type framework (free) is attractive, and it is relatively easy, providing the company has employees with the requisite skills to begin to work up Hadoop applications using in-house data or data stored in the cloud.

digital exhaust

There are also various methods of experimenting with Hadoop-type technologies using data outside a company's normal operations, through pilot programs, or what Paul Barth and Randy Bean on the HarvardBusinessReview blog network describe as an "analytical sandbox," in which companies can try their hand at applying big data analytics to both structured and unstructured data to see what types of patterns, correlations or insight they can discover.

But experimenting with some Hadoop/NoSQL applications for the marketing department is a far cry from developing a fully integrated big data system capable of capturing, storing and analyzing large, multistructured data sets. In fact, successful implementation of enterprise-wide Hadoop frameworks is still relatively uncommon, and mostly the domain of very large and experienced data-intensive companies in the financial services or the pharmaceutical industries. As we have seen, many of those big data projects still primarily involve structured data and depend on SQL and relational data models. Large-scale analysis of totally unstructured data, for the most part, still remains in the rarified realm of powerful Internet tech companies like Google, Yahoo, Facebook and Amazon, or massive retailers like Wal-Mart.

Although cloud-based tools have obvious advantages, every company has different data and different analytical requirements.

Because so many big data projects are still largely based on structured or semistructured data and relational data models that complement current data management operations, many companies turn to their primary support vendors -- like Oracle or SAP -- to help them create a bridge between old and new and to incorporate Hadoop-like technologies directly into their existing data management approach. Oracle's Big Data Appliance, for example, asserts that its preconfigured offering -- once various costs are taken into account -- is nearly 40% less expensive than an equivalent do-it-yourself built system and can be up and running in a third less time.

And, of course, the more fully big data technologies are incorporated directly into a company's IT framework, the more complexity and potential for data sprawl grows. Depending on configurations, full integration into a single, massive data pool (as advocated by big data purists) means pulling in unstructured, unclean data to a company's central data reservoir (even if that data is distributed) and potentially sharing it out to be analyzed, copied and possibly altered by various users throughout the enterprise, often using different configurations of Hadoop or NoSQL written by different programmers for different reasons. Add to that the need to hire expensive Hadoop programmers and data scientists. For traditional RDB managers, that type of approach raises the specter of untold additional data disasters, costs and rescue work requests to already overwhelmed IT staff.

2. Let someone else do it in the cloud

The obvious alternative to the build-it-yourself approach is to effectively rent the key big data applications, computation and storage using a cloud-sourced, Hadoop-like solution, pulling data from your own organization into a common repository held in the cloud and accessed (or potentially even fully administered) by your own data engineers. In this scenario, that cloud-based repository can consist of both structured and unstructured data and can be held entirely separate from the structured day-to-day operational, financial and transactional company data, which would remain ring-fenced in the company's enterprise and relational database management system. This approach takes a bit of thought and data management at the front end, but once the cloud repository of structured and unstructured data is available, companies can experiment with large data sets and cloud-based big data analytical technologies, oblivious to the underlying framework.

The best thing about this approach -- apart from the fact that companies don't have to buy and maintain the hardware and software infrastructure -- is that it is scalable. Companies can experiment with different types of data from different sources, without a huge up-front capital investment. Projects can be as small (analyzing a handful of products or customers or social media sites) or as complex as a company wants. And, most importantly, a company doesn't have to modify its current systems or run a parallel internal system on its own.

It sounds like the perfect solution, but, as always, there are drawbacks. First, even if the rental technologies are really able to cope with hugely varying data, it doesn't mean that the resulting patterns or correlations will mean anything unless a thorough process of data cleansing and triage happens first. Although cloud-based tools have obvious advantages, every company has different data and different analytical requirements, and as we've seen in the past, one-size-fits-all tools are seldom as productive or as easy to use as advertised. And, of course, when the reports come back with distorted findings (and after a futile effort at resolving the technical issues themselves), users from marketing or sales will very likely turn to the IT department for help, anyway. That essentially means that a good portion of IT staff still needs to be engaged in big data management and trained in the tools and data schema preparation that will allow this approach to work. And as noted before, ultimately, using small subsets of data, even when that data is from a variety of sources and is analyzed with Hadoop or NoSQL technologies, is really more conventional business intelligence (with bells and whistles) than it is big data.

Cloud-based providers are obviously aware of these issues. They know that to make this model work, cloud-based companies need to make their offering as easy, flexible and powerful as possible. A good example of this is the strategic alliance between Hortonworks and Red Hat (Hortonworks provides the Hadoop and Red Hat provides the cloud-based storage), which they say includes preconfigured, business-friendly and reusable data models, and an emphasis on collaborative customer support.

3. Running parallel database frameworks

A third configuration involves building a big data system separately and in parallel (rather than integrated with) the company's existing production and enterprise systems. In this model, most companies still take advantage of the cloud for data storage but develop and experiment with enterprise-held big data applications themselves. This two-state approach allows the company to construct the big data framework of the future, while building valuable resources and proprietary knowledge within the company. That provides complete internal control in exchange for the duplication of much of the functionality of the current system and allows for a future migration to a full-fledged big data platform that will eventually allow both systems (conventional and big data) to merge.

The problem with this approach is that, in many ways, the very nature of a big data framework is different from conventional IT. Traditional IT still involves applications, operating systems, software interfaces, hardware and database management, whereas big data involves some database work but is mostly about complex analytics and structuring meaningful reports -- something that requires a different set of skills than is found in most IT departments today. Although this side-by-side configuration assumes some level of savings in economy of scale (sharing existing computing power, utilizing current staff and so on), the reality is those savings may come only at the expense of complicated interfaces between old and new systems that have to be designed and managed.

Next Steps

Get expert insights into big data technologies

18 top big data tools and technologies to know about

Dig Deeper on Data management strategies

Business Analytics
Content Management