Unstructured data is the fastest growing data around. It's increasing at a compound annual growth rate of 61%, according to IDC, and will account for 80% of worldwide data by 2025. For many large IT organizations, it passed that mark a while ago.
Unstructured data growth is no longer being driven by the usual suspects -- documents, spreadsheets, presentations, photos, videos and audio. The impetus behind its growth today is sources such as logs, IoT devices, social media, CCTV, sensors, metadata and even search engine queries.
Dragon Slayer Consulting's own survey revealed that most unstructured data in the enterprise is cool data (more than 30 days old and infrequently accessed) or cold data (more than 90 days old and rarely accessed). And yet, there it sits on expensive primary storage, constantly consuming budget.
The challenge with managing unstructured data is how to do it cost-effectively. Unstructured data isn't easily classified or indexed, nor is it easily stored in traditional databases. Additionally, it typically doesn't originate in databases equipped to analyze it, such as JSON, key-value and XML databases. That means the data must be extracted, transformed and loaded into a useful database. It's a labor-intensive, time-consuming and error-prone process that requires scripts or an outside service provider. Moving data around can also create multiple copies of it, meaning more storage, rack space, switch ports, software licenses, power, cooling, cables, transceivers, allocated overhead and administrators. That doesn't make financial sense.
Managing unstructured data -- or not
The common approach to unstructured data management is simply to not manage it at all. Many IT shops opt to add capacity to their primary storage systems rather than classify, manage, analyze or even archive unstructured data. They figure the data's there if it's ever needed, though, it may be difficult to find. The problem with this method is it's financially unsustainable for several reasons.
The first reason is that data consumes capacity -- often, primary storage capacity. And, once consumed, that capacity isn't available for other data. Primary storage is the most expensive storage, usually consisting of some type of flash SSD media. Storage system software, and many other types of software, such as backups and replication, are licensed or subscribed to based on capacity, increasing the cost of the unstructured data even when it's not being accessed.
All storage systems must be refreshed every three to five years. When a system is upgraded, the new system must include capacity for all existing unstructured data, as well as any that will be stored over the new system's life, adding more infrastructure and costs. In addition, the data must be migrated from the old to the new storage system. That takes time, effort and software or scripting. And it's not just primary storage being consumed. Secondary storage is needed, too, because all that stored unstructured data must be backed up. Besides the cost of backing up unstructured data, a bigger cost can be recovering data from an outage. The time it takes to restore cool and cold data can delay getting systems back up and running, adding even more costs to this outdated process.
Another reason why keeping unstructured data on primary storage creates a problem is global privacy laws and regulations such as the California Consumer Privacy Act, the European Union's GDPR, Japan's Act on Protection of Personal Information and Thailand's Personal Data Protection Act. Compliance isn't optional, and there are significant financial consequences from failing to comply. That means IT organizations must know if there's personally identifiably information (PII) in the unstructured data they're keeping around and what it is.
Unstructured data management tools
The key to managing unstructured data to optimize performance and lower costs is capturing, harvesting, parsing and analyzing the metadata. In some cases, such as PII, that means analyzing the content itself. Several companies have products and services aimed at managing unstructured data and its costs. These products include Aparavi, InfiniteIO, open source iRODs, Komprise, Spectra Logic StorCycle, Starfish Storage and StrongBox Data Solutions StrongLink.
When unstructured data management is done correctly, everything changes in a good way. Data is moved, archived or deleted from costly primary storage to more cost-effective secondary, cloud or tape storage. The data management software determines where to move it based on the unstructured data's characteristics and performance requirements. Access is maintained either through client software, symbolic links, global namespace or combinations of those.
These intelligent and autonomous data management systems have different ways of accessing and classifying unstructured data. They either mount the file or object storage with administrative privileges (iRODs, Komprise, Spectra Logic, Starfish, StrongBox), sit in the data path looking like a switch (InfiniteIO), or run in the computational systems (Aparavi) capturing the metadata, classifying the content, copying, moving, archiving and deleting data. This reduces the capacity consumed in the primary storage and backup or replicated data in the secondary storage.
How to pick an unstructured data management system
After data is moved from costly primary storage to lower cost storage, it can be easily accessed often without rehydrating the data in the original storage. This is huge. It classifies the data, enables policy-based movement and storage, and it commoditizes storage systems.
Picking the best intelligent or autonomous unstructured data management system for a situation requires knowledge and research. You'll want to answer the following five questions about your requirements and the products you're looking at:
- How much data will be moved or migrated upfront and over time?
- Do you require both metadata and data indexing?
- What levels of scalability and performance are required? Will you need a system that scales into exabytes or will one that goes into low petabytes be sufficient?
- How automated, simple and intuitive do you want the management system to be?
- And, finally, how is each system licensed or subscribed to? Most of them charge per terabyte, though, one goes by the number of cores in the physical or virtual machines where the software runs. This matters for total cost of ownership.
Done right, the total cost of managing unstructured data should be less than the previous approach of not managing it at all.