Why Are Data Warehouses Growing So Fast?

Richard Winter provides observations on the causes and effects of the explosion of digital information.

This article originally appeared on the BeyeNETWORK.

We have gotten used to hearing about bigger data warehouses every year. Recently I did some work for a customer – a large commercial entity – that will soon have about 300 TB (terabytes) of data in its enterprise warehouse. That is expected to double over the next three years, growing to 600 TB by 2011! I have met privately with other users that have over 200 TB in production in one integrated data warehouse – and who anticipate continued rapid growth in data volume through 2008.

Figure 1: Exponential Data Warehouse Growth
(size in terabytes of user data)

We know that this roaring growth is happening all around us – in fact, it has been happening for a while. WinterCorp primary research, from the Winter TopTen Program, shows a consistent trend since 1998: the size of the largest data warehouse we validate triples approximately every two years.

This is a remarkable growth rate within the universe of measured data warehouse sizes, a factor of 9 in four years. Not much else has such a high, consistent growth rate for the last ten years. Even the growth in processor capacity predicted by Moore’s Law is at a somewhat lower growth rate.

This explosion of digital information is affecting us profoundly in many ways. But, because it is business that is paying for the creation, storage and management of much of this information, the growth of data warehouses leads most obviously to a financial question. Why are these data warehouses growing so fast? And, from a business point of view, is this really necessary?

So, from my experience in consulting with users, here are a few observations on the subject.

Business Strategies Require Comprehensive Atomic Data

A wide variety of business strategies leads one to analyze data at the most detailed level possible. Some examples are:

  • Acquiring profitable customers;

  • Fraud detection;

  • Rewarding the best performing suppliers;

  • Investing in the most profitable products/services, channels and market segments;

  • Determining which cost reductions detract the least from business performance;

  • Triaging customers to get the best return on customer retention expenditures (aka managing churn);

  • Quantifying the relationship between manufacturing process variables and product quality; and,

  • Replenishing stores to optimize profit.

Each of these strategies rewards the analysis of data at the most detailed level available.

How profitable is a given customer? That can only be determined by analyzing the full detail of the customer’s behavior – not only how much the customer purchased but what products and services were purchased. What price was paid? Did the customer return purchases for refunds? Did the customer make changes after ordering? Did the customer require exceptional service and support? And so on. So, you need full detail on all aspects of the customer’s behavior interacting with the enterprise.

Figure 2: Impact of Fundamental Business Uses

But it goes beyond customer behavior, because you also need full detail on all the associated costs. What did it cost to deliver each product or service purchased? What did it cost to deal with the returns?

The story is similar in strategies related to suppliers; products; channels; employees; and design, engineering and manufacturing processes.

So, in many enterprises, you need to warehouse the most detailed data you can get – and you need to do so on a wide variety of subjects.

Business Strategies Require Long-Term Retention

To have the detailed data to support this analysis means collecting and retaining a large volume of information every day. But it is not just daily volumes that are driving up data warehouse size – it is also the length of time that data must be retained.

Many companies are now retaining – or working toward retaining – seven years of data on virtually every type of business activity. This is in sharp contrast to practices five years ago, when a three-year retention period was far more common.

How did we get to longer retention periods?

The first answer you get from many is “regulation.” New regulations since 2001, including Sarbanes Oxley, require that many types of data be retained for seven years. This includes: data that supports financial statements; data made available to investors; data supporting securities trading; even, it turns out, data supporting energy trading (e.g., companies that use large volumes of petroleum products – whether for manufacturing, transportation, heating or power generation – often make complex long term trades in energy markets). But data retained strictly for regulatory reasons is used relatively infrequently.

From a data management perspective, I believe the more significant trends in long-term data retention have to do with fundamental business needs. Every business needs to know how to acquire and retain profitable customers. And, for many businesses, it is the long term, recurring customer that generates most of the profit. Understanding the needs of such customers – how to satisfy them efficiently and how to acquire more of them – is one of the key challenges of business management. And, in virtually all businesses, this must be done analytically.

But, taking one example, analysis of the development of a profitable customer requires long-term data. In many businesses, it takes a year or more for a customer to develop his or her purchasing pattern. For example, in a men’s clothing store (or chain), a customer may visit to look at the merchandise on several occasions before buying anything. Then, he may purchase a small item. Over a period of time, purchases become more frequent. Initially, perhaps he buys casual clothes at the store. But, one day he looks at the suits. It may take two or three years before the customer develops the habit of buying most of his clothes – both casual and dress – from that store. Analysis of how profitable customers develop, then, requires access to multiple years of data.

This pattern is present in many industries. In financial services and banking, the longest lasting – and typically most profitable – customer relationships are the ones in which the customer uses multiple services (not just checking, but also mortgage, car loan, and funding for education and retirement). These patterns, also, can develop over a period of years.

This phenomenon does not apply only to customers. Similar long-term analysis requirements exist in regard to products, to manufacturing capabilities, to suppliers or supply chains, and to the productivity of employees (e.g., in sales).

So, some types of products – vehicles, appliances and computers are three examples – have a usual lifetime of three or more years. Understanding which products are most profitable, which design or manufacturing options worked out best and for which uses certain products are best are all important factors in product success. And, an analytical capability to answer any of these questions – or to manage product profitability, warranty cost, repair frequency or customer satisfaction – all require the retention and ongoing analysis of long-term data.

Long-term data is with us now and is not likely to go away, whether or not regulations and regulatory methods change. Some enterprises have begun warehousing data for seven years or more for regulatory reasons. But, in my experience, most find that the data – if cleaned and integrated – is essential for other business uses as well.

Other Sources of Data Volume Growth

Figure 2 identifies some other sources of growth in data volume in data warehousing. We have already discussed how fundamental business uses are driving up the need for full atomic detail, for long range retention and data on an increasing array of subjects. Also, the drive to understand and address customer needs and interests has resulted in a financial interest in analyzing all types of customer interactions with the enterprise – not just transactions, but service requests, complaints, questions and so on. In connection with this interest, there are also new types of data available (text from e-mail and web interactions, audio from the call center, GPS data from cell phones and other sources, etc.) that are further increasing data warehouse sizes.

Summing Up

What can we conclude from these trends? I believe that virtually all organizations are facing continued rapid growth in data volumes – typically 1.5 to 2.5x per year – and for business reasons that are more or less inescapable. Executives are keenly aware of this ongoing growth already as they grapple with its year-to-year budget effects.

However, at any given point in time, I believe the more profound effects are in the 5-year time frame. In that time frame, the growth factors are in the 10x – 100x range. In that range of growth, you need to anticipate architectural change. Data growth rates outstrip even Moore’s Law, which means that some architectures can’t keep up. The growth facing data warehouse architectures is in several dimensions in addition to data volume. My next article will discuss these other aspects of requirements growth.

Dig Deeper on Data warehousing

Business Analytics
Content Management