How much historical data is enough?

There are many factors that impact the amount of historical data a corporation requires.

There is an old maxim about how much historical data the end user wants. The end user wants the historical data for two years more than what he/she currently has. If end users have no historical data, then they want the data for the past two years. If they have historical data for 3 years, then they want historical data for 5 years, and so forth. The interesting thing about this old maxim is that it is neither an exaggeration nor an underestimation. It is pretty much correct.

So let’s examine what is behind this desire for an additional two years of historical data. Another way to look at this proposition is that the end user wants to have two full cycles of data on which to do analysis. And in most corporations, a business cycle is usually measured by the year. There are peak periods in the year and slack time in the year. Having just one year’s worth of data means that the analyst can look at just one cycle of business. Analysts are always concerned that if they study just one cycle’s worth of numbers, that somehow there will be a bias in that cycle. Looking at two cycles of data reduces the chances that a freak year has snuck into the system. So there is actually a rationale for wanting to have at least two years' worth of data.

However, some businesses do not operate on the basis of an annual cycle. Some businesses operate on the basis of very different business cycles. Consider life insurance companies. Life insurance companies examine the life of people. In order to understand the life cycle of people, it is necessary to get 90 to 100 years' worth of data. And other businesses have different life cycles as well. A real estate life cycle may be ten years long. An inflationary period may be five or more years long. A period of economic prosperity may be twenty years long.

In any case, there are plenty of circumstances where there is no annual cycle. In these cases, having two years' worth of data does not reflect the life cycle of business at all. So there are companies that will want to store a whole lot more data than two years' worth.

A big part of how much historical data a corporation needs depends on the type of user that will be using the data. There are two basic types of users – farmers and explorers. Farmers are those analysts who know what they want. They do the same type of analysis repeatedly. Typically, farmers submit many requests in a day’s time and are satisfied with only a small amount of data. The only thing that changes for a farmer is the actual content of the data that is being analyzed, not the type of analysis. Farmers are very predictable people. They often find little flakes of gold – little nuggets of wisdom. They seldom find nothing.

Explorers are people who don’t know what they want. Explorers are people who think outside the box. They are very unpredictable and they usually look at very large amounts of data. Explorers often find nothing at all. They have the attitude – “I don’t know what I want, but I will know it when I see it.” Explorers may go six months and submit no requests. Then, in the next week, the explorer may submit ten requests. When explorers find something useful, it may be spectacular. Occasionally, explorers find unexpected huge nuggets of wisdom.

Farmers traditionally do not need a great deal of history. One to two years' worth of history usually meets the needs of a farmer (depending of course on the business cycle).

Explorers, on the other hand, need a lot of historical data. Explorers do the kind of processing that occasionally needs to look at a very lengthy amount of history. Explorers look for patterns of data. And often, there simply is no pattern to be sought and studied. On other occasions, there is a pattern of interest, but that pattern only becomes apparent over a lengthy period of time. If an organization has a lot of explorers, then a great deal of historical data is needed in order to satisfy the curiosity of the explorer.

Once the organization has gathered its historical data, it makes sense to periodically monitor the usage of the data. It is absolutely normal for current and very current data to be used frequently. However, the older the data becomes, the less frequently the data is needed. This true even when there are explorers in the mix of analysts.

When historical data reaches an age where there is very infrequent access of the data, the data can be removed from the system. By removing unused historical data and by placing that data in a remote part of then environment, performance is enhanced. In addition, the cost of the environment is lowered.

There are then many considerations to how much historical data the organization should plan on keeping.

Bill is universally recognized as the father of the data warehouse. He has more than 36 years of database technology management experience and data warehouse design expertise. He has published more than 40 books and 1,000 articles on data warehousing and data management, and his books have been translated into nine languages. He is known globally for his data warehouse development seminars and has been a keynote speaker for many major computing associations. Bill can be reached at 303-681-6772.

Close