This article originally appeared on the BeyeNETWORK
If the enterprise information management function is to succeed where the data administration function did not, then it must address the issue of the enterprise data landscape. One of characteristics of data administration was that it lived and moved and had its being in the logical data layer. This is the realm of logical data models, and there is much value in logical data models. However, legacy data administrators saw the world as divided between logical and physical, and they were never going to go to the physical side. An attitude of “we design databases, but what the users put in them is their business” was the rule, and very often it was openly expressed.
Inconveniently, data actually exists in physically implemented databases, and business users actually care about the stuff. These business users were never impressed by data administrators who had an attitude to physical data resembling that of pre-revolutionary French aristocrats to their uncouth peasantry. Those days are now over and physical data is beginning to be addressed. However, as data professionals engage more openly with the need to manage an enterprise’s physical data assets, they face a lack of any mature methodological or conceptual frameworks. It is hardly surprising given two decades or so of neglect of, if not downright revulsion at, the physical data layer. Nevertheless, areas likedata governance and master data management are attempting to address physical data, but they are broad and they are not yet very mature.
Perhaps a more productive way to deal with the problem domain of physical data is to decompose it into its parts and examine them one at a time. A particularly interesting concept that has recently emerged is the notion of the physical data landscape and the necessity of mapping it.
The Data Landscape
One of the principles introduced in the Manifesto for Enterprise Information Knowledge Management is that knowledge workers should be able to know what data an enterprise manages. Another principle is that knowledge workers should know where the data is stored. This second principle implies that the data is located somewhere. A set of physical locations, and their interrelationships, is a good candidate for presentation in the form of a map. If the physical data can be mapped in this way, then its totality, since it is not bounded by political frontiers or geomorphological features, is best described as a landscape. This description is also fitting because individual IT personnel and knowledge workers typically cannot see out very far across this landscape. They can usually see data out to some horizon that is very near to them, as on a small planet with a large radius of curvature. The term “landscape” is thus appropriately neutral and very apt in this context.
The data landscape is the totality of the enterprise’s physical data assets. It is not the logical view of how the enterprise sees, or should see, its data in terms of its business and which is – in theory – captured in logical data models. However, the idea of being able to look at a map of the data landscape and just see what is out there, and then be able to zoom in on an area of interest, is incredibly powerful. It correlates to the way we now think of our own planet. The first pictures of the Earth as a distant orb rising over the moon that were taken by Apollo 8 in 1968 introduced this revolution. Satellite imagery grew more commonplace in the following decades, and by now, most of us have been captivated by Google Earth. The notion of virtually hovering over the terrestrial landscape and being able to see it broadly from a high level, and then to be able to magnify it to any level of detail, is almost an expectation today. It is exactly the kind of approach we need to work with the enterprise’s physical data assets.
How Do We Map the Data Landscape?
Unfortunately, while the idea of a data landscape is very powerful, we do not yet have a good way to represent it in diagrammatic form. To do that requires somebody to think about how to present the constructs that exist within the landscape in a graphical manner. Such a representation, if it is ever achieved, is likely to be rich and complex. I well remember the long map reading skills classes I went thorough at school, and the satisfaction that came with understanding how to do it well. That satisfaction was repeated when I was able to read and understand logical data models. Since we have not reached that point for the data landscape, the next best thing is to treat the components of the landscape as structured metadata items stored in a repository.
If representing the data landscape is one problem, figuring out how to map it is another, and much larger, problem. There is rarely any documentation that is available for such an exercise. Of course, “documentation” is nearly always done to be in compliance with some kind of directive. In my experience, it is never kept up to date, and it is never trusted. If it is used at all, it is as a kind of starting point. An additional problem is that documentation is frequently put into directories that are structured like the labyrinthine tombs of some of the Egyptian pharaohs. Once it is in this final resting place, it often lies undiscovered and undisturbed for a very long time.
Data models should in theory be helpful, but if they are logical data models, they are of little help in understanding physically instantiated databases. Also, there is more to know about a physical database than just its data model. The platform, the server, the IP address, and a host of additional metadata items are required. Sometimes when a physical database is found, an attempt is made to reverse-engineer its data model. The results are rarely satisfactory. It is usually difficult to understand what the column names mean with any degree of certainty, and relationships are often missing altogether.
The Scale of the Landscape
At this point, most data professionals will attempt to map the landscape by forward human analysis. Individuals are given the tasks of looking at tables and columns in particular databases and figuring out what it all means. The analysts may use SQL to query the data values, and may even have data profiling tools to help them in their work. However, such efforts are doomed from the start if they go beyond just a handful of columns.
Imagine an organization with 50 databases, each with an average of 100 tables. This is not a big organization, and 100 tables per database is not huge. If each table has an average of 10 columns, we now have 50 x 100 x 10 = 50,000 columns spread out across the data landscape. No organization has enough analysts to crawl through such a landscape and map it out. Not only are there too many columns, but the tasks required on a per column basis are too complex and large.
For instance, a column could be considered as a key, or candidate key, if every value in it is unique. However, it may not be declared as a key in the database table. I have seen plenty of examples of this. Suppose we are also dealing with a database that has no declared foreign keys – something very common in my experience. If there is another table with a column that is actually a foreign key of the column that functions as a key in the first table, how will we ever find it? A human analyst will have to make guesses and then test them. The testing will involve examining the contents of the two columns in question to see how related they are.
The Way Forward
Not only is such an approach impractical, it is also unreliable because human analysts may not have the time or capacity to think about all the pattern matching they need to perform. It is also unsustainable. Suppose a team of human analysts were able to map out a tiny portion of the data landscape. As they moved onto the next area, there is a good chance that the ground they just covered is going to change. The longer the time that passes, the greater the doubt there will be about the current accuracy of what they have mapped.
The answer to this is that mapping the data landscape cannot be done by forward human analysis. It needs tools. These tools are now appearing in the marketplace, although the concepts behind them are poorly understood and the discipline of enterprise information management is not mature enough to utilize them fully. Nevertheless, the direction is clear. The enterprises that understand their data landscapes will be able to manage their data orders of magnitude more effectively than those who do not. The data landscape is a new paradigm of extraordinary power, and it is coming