Sergey Nivens - Fotolia

The NoSQL challenge: What's in store for big data and security

Big data offers horizontal scalability, but how do you get your database security to scale along with it?

Security teams at many organizations are past the point where they can call big data projects a fad. They recognize that they are staring straight into the face of the next generation of data storage platforms. It's not just Google, Facebook and the NSA that find big data critical to their business models; thousands of companies, including startups, have already jumped in with both feet to take advantage of big data architectures.

Many security professionals who encounter big data environments for the first time don't understand why security is a big issue. They assume they are going to rely on the same tools and techniques that they use to protect their relational databases.

You can't. Most security features built into RDBMSes are not present in big data platforms. Many of the third-party security tools you own don't work with big data either. These non-relational data management platforms are different enough -- architecturally -- that you need to rethink how you approach database security.

What makes big data technology attractive is that it is open source, providing virtually free analysis and data management capabilities. And unlike data warehouses of old -- which cost millions of dollars and run on proprietary systems -- big data runs atop commodity hardware. Couple that with the huge volumes of data generated by every type of electronic device imaginable, and firms of all sizes can derive valuable information at very low cost.

But as with all types of databases -- and make no mistake, big data systems are databases -- high-quality data produces better analysis results. Senior management is not usually briefed on how the sausage is made: They only care about the quality of the results and how it helps the bottom line. So big data architects grab customer data, transactional information, intellectual property, supply chain details and financial information to support business intelligence and high-quality analytics. It's not really a question of whether sensitive data is stored within the cluster, it's how that information is being used. With broader adoption of big data projects, rapidly advancing platforms and growing concerns about sensitive data, it's time to re-examine how to secure these systems and the data they store.

The trouble with big data

Of the many differences between traditional databases and big data architectures, three stand out. The first, and most important, is big data is comprised of a cluster of servers, each with a slice of the stored data. Applications do not speak with a single node in this cluster; instead, they communicate with hundreds or even thousands of nodes. Protecting big data is closer to securing an entire data center than a single server.

While some commercial distributions of big data platforms are closing the gaps, security with open source variations is largely an afterthought, offering limited capabilities.

Second, there really is no such thing as a standard cluster. There are more than 150 big data variants available, each specializing in something different -- graph data, tuple stores, document stores and wide columnar stores just to name a few. Many of these variations allow you to swap components: The data model, resource manager, data access layer, query facilities and orchestration tools are interchangeable. And because these platforms are reasonably new, the architects who built them focused on scalability, performance and accomplishing a specific data processing task in a unique way. Security was not on the agenda. While some commercial distributions of big data platforms are closing the gaps, security with open source variations is largely an afterthought, offering limited capabilities.

The third, and final, difficulty with big data security is that some traditional tools don't work well with big data technologies. With the multi-node architecture, the sheer scale, variety and velocity of data outpace the capabilities of traditional security products. Some forms of security either have trouble scaling (row-level encryption, masking, packet analysis) or they simply don't work (content filtering, query monitoring) with many NoSQL clusters. Taken together, these issues make big data security challenging.

NoSQL security reference architectures

Organizations take different approaches to securing their big data clusters. Most enterprises adopt a modicum of security, but many companies back into a security model based on how they deployed the cluster -- which is to say they implemented what they could after it was launched. Four common approaches address security while still allowing big data to horizontally scale (as it's designed to do). It's important to note there is no right approach here, as each model leverages the skills, infrastructure and risks an organization deems important.

A quick note on terminology: NoSQL is the term used by the development community to differentiate these platforms from relational databases. It was first used as derogatory slang for "Not SQL" (not relational), which later morphed into the politically correct "Not Only SQL." You'll hear other terms in the coming months, such as the currently fashionable "data lake," but they all refer to the same clustered database paradigm.

The singularly most common approach to big data security is a "walled garden" security model.. You can think of this as the "moat" model used for mainframe security for many years; IT places the entire cluster onto its own network and tightly controls logical access through firewalls and access controls.

With this security model, there is virtually no security within the NoSQL cluster itself, as data and database security are dependent upon the outer "protective shell" of the network and applications that surround the database. The advantage is simplicity: Any firm can implement this security model with existing tools and skills without performance or functional degradation to the database. On the downside, security is fragile; once a failure of the firewall or application occurs, the system is exposed. This security model also does not provide the means to prevent credentialed users from misusing the system or viewing and modifying data stored in the cluster. For organizations that are not particularly worried about security, this is a simple, cost-effective approach.

Another approach to big data security, shown in Figure 1, leverages security tools or third-party products built into the NoSQL cluster. Security in this case is systemic and part of the base function. Tools may include SSL/TLS for secure communication, Kerberos for node authentication, transparent encryption for data-at-rest security, and identity and authorization (groups, roles) management just to name a few. This cluster security approach is more difficult, requiring setup of several security functions targeted at specific risks to the database infrastructure. And, as third-party security tools are often required, it's typically more expensive. However, it does secure a cluster from attackers, rogue admins and the occasional witless application programmer. It's the most effective, and most comprehensive, approach to NoSQL security.

NoSQL cluster security
Figure 1. This cluster security approach, which relies on integrated tools, is more difficult, requiring setup of several security functions targeted at specific risks to the database infrastructure.

Data-centric approach to security

Big data systems typically share data from dozens of sources. As firms do not always know where their data is going, or what security controls are in place when it is stored, they've taken steps to protect the data regardless of where it is used. The third NoSQL security model is called data-centric security because the controls are part of the data, not the database. Data is protected before it is moved into a big data repository.

The three basic tools that support data-centric security are tokenization, masking and data element encryption, as shown in Figure 2. You can think of tokenization just like a subway or arcade token; it has no cash value but can be used to ride the train or play a game. In this case, a data token is provided in lieu of sensitive data. Tokenization is commonly used in credit card processing systems to substitute credit card numbers. The token has no intrinsic value other than as a reference to the original value in some other (more secure) database. Masking is another tool commonly used to protect data elements while retaining the aggregate value of a data set. For example, firms may substitute an individual's social security number with a random number, or change their name arbitrarily from a phone book, or replace a date value with a date within some range. In this way the original sensitive data value is removed entirely, but the value of the data set is preserved for analysis. Finally, data elements can be encrypted and passed without fear of compromise; only legitimate users with the right encryption keys can view the original value.

Data-centric security
Figure 2. The three basic tools that support data-centric security are tokenization, masking and data-element encryption.

The data-centric security model provides a great deal of security when the systems that process data cannot be fully trusted. And many enterprises, given the immaturity of the technology, do not fully trust big data clusters to protect information. But a data-centric security model requires careful planning and tool selection, as it's more about information lifecycle management. You define the controls over what data can be moved, and what protection must be applied before it is moved. Short of deleting sensitive data, this is the best model when you must populate a big data cluster for analysis work but cannot guarantee security.

Re-architecting for the cloud

The final big data security model organizations use with growing regularity is to leverage the security facilities embedded in the services provided by cloud providers for both platform and infrastructure as a service. Cloud computing environments provide many advantages in secure deployment options, security orchestration, and built-in features to secure platforms and the network. For example, the ability to run trusted server images and dynamically preconfigure and validate the server instance prior to deployment provides a trusted environment for the cluster to run. Dynamic and on-demand, cloud services do not limit the performance of big data databases.

Tools such as identity management, session encryption, data encryption, monitoring and logging are built into the cloud. Many big data platform providers, such as Cloudera, DataStax, Hortonworks, MapR and Zettaset, offer trusted bundles of Hadoop or Cassandra variants as part of their service. Even if the NoSQL variant you choose is devoid of security capabilities, you can close most of the gaps with the tools the platform providers offer. And with metered, on-demand services you only pay for what you use. The downside is it requires companies to understand the cloud and re-architect their deployments to leverage these facilities. You can't go into the cloud thinking a network monitoring and firewall security model will address your risk; your focus must realign onto orchestration, management plane and data security. If you've not yet jumped into big data, this approach is worth a serious look.

Time to rethink database security

Big data is the database of the future: it's what companies select when they start new projects. It's fast, easily customized for specific tasks and still cost-effective. That said, these are very early days for big data and NoSQL security, and no one said it's going to be easy. There will be growing pains as the platforms mature, and you'll discover many facets of big data that alter how you accomplish security tasks you thought you had already mastered.

First, and foremost, never assume that you don't have sensitive data stored -- you probably do or will at some point in the near future. Second, examine the NoSQL security reference architectures and see which best fits your organization's deployment model, employee skills and budget. It does not do you much good to select a clustered security approach if you don't have the budget or staff to set up security controls. Finally, leverage audit logs as much as possible. It's how you'll collect data in order to monitor database security, but the data is also applicable to performance, operations and business analytics. It will help you discover and understand what you don't yet know.

About the author
Adrian Lane is a Securosis senior security strategist who specializes in database architecture, data security and secure code development. Prior to joining Securosis, he was the chief technology officer at database security firm, IPLocks, vice president of engineering at Touchpoint; and CTO of the secure payment and digital rights management firm, Transactor/Brodia. He holds a degree in computer science from the University of California at Berkeley and did post-graduate work in operating systems at Stanford University.

Send comments on this article to [email protected].

Next Steps

Get the lowdown on SQL Server 2016 dynamic data masking


Dig Deeper on Data security and privacy

Enterprise Desktop
Cloud Computing