In this era of digital transformation, organizations of all sorts are becoming data-centric information managers. Technologies such as AI, IoT, 5G and edge computing are creating data at unprecedented rates. That data is being used to deliver more insights, services, customized services, products and innovation. And stricter privacy laws and regulations on personally identifiable information with harsh financial penalties for noncompliance are complicating the situation.
For organizations that want to derive value from data that's greater than the cost of storing and using it, effective data management and storage is becoming more important than ever. The abstraction of data management from storage systems to run on its own is one approach to better data management.
Data management systems today
Data management has several meanings depending on the vendor. It has been defined as ingesting, storing, organizing and maintaining the data an organization creates. But that definition is outdated. It's adequate as data management for legacy storage systems; however, even that falls short for modern storage systems.
Data management today means considerably more, including:
- classifying data;
- aggregating, harvesting and parsing the data's metadata;
- protecting data and metadata from naturally and human-caused outages;
- moving data locally and geographically for sharing, archiving, replication, data protection, storage system technical refresh and migrations, and access for the required analytics engines that do deeper dives on that data;
- maintaining user and application transparent access to their data after one or more moves;
- providing user-definable policies that automatically move, replicate, copy and delete data;
- deploying AI and machine learning to optimize and automate most data management functions;
- searching data and providing actionable information and insights;
- keeping the data compliant with personally identifiable information laws and regulations; and
- scaling data management to the hundreds of petabytes and even exabytes of rapidly expanding data.
Data management and storage challenges
That's quite a bit for a data management system to do -- and do well. Remember, the most important responsibilities of a data management and storage system are ingesting, storing, organizing and maintaining the data. All those other data management capabilities are resource-intensive and negatively impact the system's primary responsibilities.
And most storage systems don't generally work well with other storage systems. That's not news to storage admins. And many systems have problems working with cloud storage, too. Few -- and that's being generous -- actively work with tape systems.
Multivendor heterogeneous storage is a bigger problem. Storage vendors rarely work seamlessly with one another, which is why storage system-centric data management tends to focus on a single vendor. This approach bypasses the multivendor problem while locking in users to that specific vendor's data management and storage products.
Another data management storage issue is the complicated data management licensing structure. There's the data management software licensing, of course, but it doesn't stop there. There are typically other licensing fees, such as a capacity license fee on the data the storage system moves to cloud storage or other storage systems. Then, there's the cloud storage capacity license fee and potentially egress fees for data access. In addition, when users and applications access moved data, it often must be rehydrated back to the originating storage system. Data movement takes time adding substantial latency to each access request. It makes more sense to access the data where it resides.
One storage-centric approach to this problem is to put all an organization's data in a single scale-out, all-encompassing storage system, historically referred to as a "god box." This system would have all the storage performance and cost tiers, data protection, archiving and so on, along with all data management.
Even if that storage system could meet every performance requirement, scale every tier to meet hundreds of petabytes or exabytes of data, and do everything data management needs to do today, there are other intractable problems. The data management software would still be a heavy draw on the storage controllers negatively affecting performance. More importantly, data still must be moved or migrated from where it currently is to this system. And it fails to solve multiorganizational data sharing problems.
Separating data management from storage
These issues have led to a new approach where data management is abstracted from storage systems. The data management software runs on its own server hardware. It sits out of band, in band or a combination of the two.
Abstracted data management captures the data and metadata in one of three ways. It can mount all storage systems with administrative privileges; Dell EMC ClarityNow, Hammerspace, iRODS (open source), Komprise, Spectra Logic StorCycle, Starfish Storage and StrongBox Data Solutions StrongLink do this. Or it can sit in front of the storage looking and acting like a high-speed network switch -- think InfiniteIO here. And with the third approach, it uses clients, or agents, like Aparavi does. Most of these systems have some level of AI or machine learning built into the software that optimizes operations. Each methodology and vendor have their own pros and cons that will be covered in a future article.
These abstracted data management systems have an outsized positive impact on IT organizations. They commoditize the storage system, reducing both the amount and the cost of storage for each tier. They do that by right-sizing the data to the proper tier and eliminating vendor lock-in.
Simplified data management and storage operations
Storage system compatibility is no longer an issue. Storage systems are merely the containers in which the data resides based on their locality, performance and cost characteristics. These abstracted data management systems also simplify operations in several ways:
- They streamline data protection via automatic data archiving. Archived data doesn't need to be recovered during an outage, and less data to recover accelerates recovery time objectives.
- They facilitate data access with global namespace. It doesn't matter whether the data is moved to another storage system, cloud storage or tape; users and applications will still see and have direct access to their data typically without rehydration.
- They make storage system technical refresh easy and a nondisruptive online process.
- They provide data copies for DevOps, DevSecOps and test/dev.
- They migrate petabytes and even exabytes of data with eye-popping speed. (One organization was able to migrate 12 petabytes of data from one tape system to another in an hour because of the metadata harvesting.)
- They share big data among organizations -- something that previously wasn't possible.
Scaling and licensing requirements
Each of these data management systems scales differently. Some are designed to scale to hundreds of petabytes and exabytes. Others scale from terabytes to dozens of petabytes. It depends on their architecture, and most are, by definition, storage-agnostic.
Vendors have different licensing requirements. Some license by terabyte of capacity managed. Others vary that capacity licensing by hot and cold data. Still others license according to the number of servers and server cores required to run their software at the performance level an organization requires.
What does external abstracted data management mean? It means IT organizations can choose storage based on cost and performance, not data management functions. It means storage vendors will no longer have an incumbent advantage. It means simplified IT operations and lower costs. When it comes to data management and storage, all that bodes well for the future.