Data analytics pipeline best practices: Data classification
Data analytics pipelines collect a variety of data categories requiring efficient data organization. These data classification best practices can help improve pipeline performance.
The concept of classified and declassified data when it pertains to security agencies such as the NSA is familiar, but there is a broader type of data classification for businesses which factors into the success of a data analytics pipeline.
There is a hierarchy of data classification levels, depending on sensitivity, that determines who can access what. Some classifications are required by law, for instance when dealing with employee personal information.
Independent of the legal and security aspects, there are many reasons why a company would want to create a data taxonomy. This article discusses the various types of data categories, with a focus on best practices and how to automate this process.
Typically, company data breaks down into the following categories: public, internal, restricted and confidential. Internal data is available to internal employees with access. It includes internal email and communications, lists of employees or internal reports (financial, sales, vendor list, etc.).
Confidential data includes M&A documents, information protected by non-disclosure agreements and sensitive personal information protected by law (HIPAA, GDPR) such as personal medical or financial records, Social Security numbers, personal addresses and so on. Restricted data is critical to the company's survival -- leaks or lack of appropriate protection could lead to hijacking or criminal charges.
The status assigned to specific data depends on the context (metadata, source, format and timestamp) and the content. The format includes Excel, video, PDF and raw text. Organizations can share restricted data with select employees after proper encryption. While the original data is restricted, the encrypted version is internal.
For instance, credit card transactions include the location of user and merchant, merchant category, date, item purchased, item category, card issuer (the bank), dollar amount, type of transaction (online or point of sale) and status (failed or accepted). However, the cardholder names are absent and the credit card numbers encrypted.
Frequently, data category links to particular fields rather than to the data as a whole. It also depends on the level of aggregation. Summaries may or must be public -- such as quarterly Wall Street reports sent to analysts -- while the granular data (full list of clients ranked by sales volume with contact information and buying history) is internal or restricted.
If restricted, government agencies such as the SEC, the IRS or a potential acquirer may still want to access part of it. The fight between Facebook and the justice department asking for personal information in criminal probes is an example of a potential issue. Organizations should address it well before the case presents itself.
How to automate data categorization
Data categorization was traditionally performed manually, typically by the IT, finance or legal departments. Given the increasing volume of documents requiring storage, modern approaches involve automation, at least to some extent.
One way to do this is to automatically detect sensitive fields, such as email address, credit card or Social Security numbers and date of birth, especially when a document contains many of these elements. Natural language processing (NLP) can categorize documents -- structuring unstructured data -- to automatically assign a particular label to a document.
This is a supervised classification problem. The method uses training and validation sets. Techniques such as ensemble methods (such as XGBoost) are particularly efficient. Naive Bayes is a basic algorithm routinely used in this context, typically with good performance. It was first used to detect spam in email data.
There's also a simple ensemble method used in fraud detection and to detect good performing articles, for example.
The first step is to create a list of all attributes attached to a document. They are the features in an NLP algorithm to classify the documents. Such attributes include type (PDF, Excel, etc.), author of the document (job title, company or organization, and email address), source, date received or created and last updated, who it was sent to initially, the size of the document and the presence of specific keywords in the text or subject line.
It is a good strategy to use an algorithm with parameters that minimize false negatives, or documents erroneously classified as public. Documents labeled as non-public by a black-box algorithm can be manually reviewed to eliminate false positives.
It is also important to constantly update the list of people allowed to access specific data based on the category.
For instance, in a previous position, I was running a Perl script against live databases -- including personal data -- to produce summaries, show trends and make predictions. When the company was acquired, the purchasing company believed I was a hacker (the issue was compounded by the fact I was working remotely).
At no point did the company change access privileges and I was never told to stop running these scripts or accessing these live databases. They were probably unaware it was part of the job prior to the acquisition. Also, the acquiring company never changed the passwords. The issue was quickly resolved, but it is a reminder of all the required precautions, especially during mergers and acquisitions. The situation could have been a lot worse: Imagine if someone hacked my computer and accessed the live database to extract large chunks of data.
Data categorization should be an important component of any organization dealing with sensitive data. It is not expensive to do, with automation or a hybrid approach and using natural language processing techniques or products. It can free the legal or IT team of some cumbersome work. The risks of not following data classification best practices are not insignificant -- it can result in security issues, loss, theft or alteration of data and potential litigation.