Organizations generate metadata so quickly that it's turning into today's big data challenge. However, the best way and location to store all this information is not always clear.
Much depends on how organizations manage and use the metadata and whether they separate it from its primary data. Many organizations move their metadata into a central repository to improve operations and better use the information. As a result, the issue of metadata storage is front and center.
What is metadata storage and why is it important?
Simply put, metadata is data about data. It can include a variety of information about a data file, depending on the type of file and the type of storage. For example, a file's metadata might include the following:
- file name;
- type of file;
- date and time it was created;
- GPS coordinates of where it was created;
- copyright information; and
- data lineage.
Although applications often generate metadata automatically, organizations can add it manually and customize it.
As data volumes have grown, so too has the important role that metadata plays in managing and optimizing data. Metadata makes it easier to index, find, sort and categorize data. It helps to better understand the data through advanced analytics. Metadata also helps to improve data quality, optimize data management, simplify storage administration and facilitate greater productivity, all of which can lead to more efficient operations and lower costs.
To realize these benefits, however, an organization must put into place a system for effective metadata storage and management. Metadata storage must meet the needs of the larger metadata management strategy by providing a safe and efficient system for hosting the data. Without a carefully planned and implemented storage system, performance can suffer, data resources can be difficult to find, and metadata can even get lost. The storage system must ensure that the metadata is continuously available.
Best practices for metadata storage and management
For a metadata management system that relies on a central repository separate from the source data, storage teams need to consider several factors, including how they will implement and distribute the platform.
1. Don't go it alone
The team's metadata storage strategy should be part of the organization's larger metadata management strategy, which in turn should be part of the organization's larger data governance policy. Effective metadata management requires participation across the entire organization, including the team responsible for metadata storage.
Be involved in defining metadata objectives and adopting standards. In this way, the team can bring an important perspective to the discussion and start working on the ground floor.
2. Look at the big picture
Although the storage team focuses primarily on metadata storage, have a good grasp of the underlying infrastructure and technology that will support the metadata effort. Understand what components the organization will deploy, how those components fit together and how the metadata will move between those components before finally landing in storage.
The team needs to know these details:
- how the organization will implement the catalog;
- which database system to use;
- additional information (other than metadata) that needs storage in the catalog;
- how to deploy supporting applications;
- whether there will be an abstraction layer;
- which third-party management tools to use; and
- any other information about the various systems that could affect storage.
3. Think big and then think bigger
The storage team should have a clear sense of how the metadata management platform will scale out to meet user demand. Consider such issues as how many sites to deploy and how many storage nodes per site. Understand how the organization will distribute the metadata. Have a clear sense of what it will take to scale storage systems up or out to meet future demand.
4. Don't treat metadata as a second-class citizen
Until recently, metadata was barely on the radar for most organizations, but the onslaught of big data volumes and improvements in analytical technologies have made them recognize its value. As a result, ensure that the storage system will meet the expected performance demands, regardless of the platform. The metadata repository plays a critical role in accessing resources, so metadata storage that does not perform well can represent a significant bottleneck in data access.
The performance stakes go even higher as organizations move from a passive metadata model to an active one. Passive metadata is relatively static. Active metadata is intelligence-driven and operates in real time, so it is constantly collecting metadata from across the network.
5. Get a handle on data requirements
The storage team needs a complete picture of the data that it will store -- not only the metadata, but also any data that supports the metadata management platform. The total amount of data is the most important piece of that picture. Provide enough capacity to support the operations carried out by the metadata platform, including metadata extraction; other extract, transform and load processes; and other supporting tools or systems that require storage space.
Object storage use is on the rise. Account for the fact that object storage metadata is highly customizable, which can add to the total amount of data. Determine whether to store the metadata through binary or text, how long to retain it, whether to archive it and the amount of storage for analytics.
6. Protect the metadata like any other corporate data
Metadata can contain sensitive information and provide inroads for cyber attacks. Take the steps necessary to implement a secure storage environment and comply with applicable laws and regulations. Protect against data loss that can result from natural disasters, cyber attacks, mishandling of data or other threatening circumstances. Use tools such as replication, backups or air-gapped archives. As a side note, an organization can also use its metadata to help safeguard data and stay in compliance with applicable regulations -- if the metadata itself is current and reliable.
Other considerations and examples of metadata storage
An organization might store metadata along with the source data or in a separate location. When stored with the data, the metadata is often embedded in the same file as the primary data, in which case metadata storage considerations are much the same as those for the primary data. Sometimes metadata is stored in external files that accompany the main data files, but in this case, too, the storage considerations are much the same, except for perhaps requiring more space.
Keep the metadata close to the data to provide a simple way to deal with metadata and the storage that goes with it. The metadata stays with the primary data when it moves and can be easily read and updated. However, if the metadata is stripped from the data file or the external metadata file is removed, the advantages of either approach are lost. Neither approach enables central management across the network, which has become a growing concern as data volumes have expanded and metadata has become more valuable. The growing data volumes also make it more difficult to search for specific data when the metadata is stored with the data.
As a result of these limitations, many organizations now store their metadata in a central repository separate from the source data. A central metadata repository or catalog is typically part of a larger metadata management strategy in which the metadata is extracted from the source data and stored in the repository. A central repository makes it easier to search for specific types of data across the entire organization, no matter how large the volume or location of the data. This approach also streamlines management, which results in more efficient operations and more consistent metadata across the organization.
Separate the metadata from the data to deploy storage that can best accommodate metadata-specific workloads. A centralized repository can facilitate advanced analytics to derive more value from the metadata. The metadata is separate from the data, so storage can accommodate metadata-specific workloads. In some cases, an organization might take a hybrid approach to metadata management, building a central repository but leaving the metadata embedded in some files.
A centralized approach comes with other challenges. If the metadata becomes out of sync with the data, the metadata could be less useful. The management system must be able to continuously sync the metadata with the source data to ensure ongoing accuracy -- a process that can affect storage resources. A metadata management system might not be able to understand the metadata in certain files, in which case the system might need to save the metadata to binary large object storage for access by a third-party tool. Even if these factors are not an issue, the storage team must still ensure they have the right storage in place to support the type of read-heavy workloads typical of a metadata repository.