Data classification is a helpful tool to protect sensitive data and guard against data leakage. Despite organizations using it for years to safeguard data, its potential use in backup and compliance may not be as widely known.
Data classification comes in many forms. Generally speaking, data classification refers to the practice of automatically identifying specific data types. This is usually done to avoid accidental data exposure or leakage.
For example, financial institutions routinely scan outbound email messages to ensure that those messages do not contain certain types of data such as social security numbers, account numbers, or tax ID numbers. The classifier prevents a user from sending sensitive data through email. The outbound messages pass through a filter that looks for certain patterns. In the case of a social security number, for example, the classifier would look for a nine-digit number with dashes appearing after the third and fifth digits. If an email is found to contain data that matches one of the patterns specified by the classifier, then the email is automatically intercepted for review.
How data classification applies to backup
Although data classification tools are probably best known for preventing data leakage, organizations also apply them for backups. Data classification for backup is primarily designed to aid an organization in maintaining regulatory compliance.
Many organizations are subject to regulations such as GDPR, PCI DSS, or CCPA. These and other similar laws stipulate how organizations must handle sensitive data. Not only must a covered entity take steps to prevent data leakage, but such organizations are typically required to ensure data privacy and to retain data for a specific length of time.
With that in mind, consider the ways in which data backups have evolved in recent years. While tape backup was once the norm, it has largely given way to cloud backup and to disk-based continuous data protection. Both of these platforms are capable of long-term data storage, but there is a direct cost associated with any stored data.
Without data classification, it is difficult for backup operators to differentiate between regulated data and non-regulated data. If an organization in a regulated industry is required to store customer data for five years, it might put in place a blanket five-year retention policy to make sure that the regulated data is retained for the required period of time. The problem with this approach is that it means that the organization is also retaining non-regulated data, which incurs needless costs.
Data classification for backup essentially does two things: First, it identifies regulated or sensitive data so that the organization can handle it appropriately. Admins can make sure, for example, that backups of sensitive data are encrypted and retained for the required period. Second, data classification for backup gives organizations the opportunity to reduce their backup storage costs by purging non-regulated data once it outlives its usefulness, rather than simply retaining it for the same period as regulated data.