photobank.kiev.ua - Fotolia
This is an excerpt from Chapter 15 from the book NoSQL for Mere Mortals by Dan Sullivan, an independent database consultant and author. In the chapter, Sullivan takes a look at the four primary types of NoSQL databases -- key-value, document, column family and graph databases -- and provides insights into which applications are best suited for each of them. He also discusses the differences between relational and NoSQL database design, and the need for coexistence between relational and NoSQL technologies in many organizations.
In relational database design, the structure and relations of entities drive design -- not so in NoSQL database design. Of course, you will model entities and relations, but performance is more important than preserving the relational model.
The relational model emerged for pragmatic reasons -- that is, data anomalies and difficulty reusing existing databases for new applications. NoSQL databases also emerged for pragmatic reasons -- specifically, the inability to scale to meet growing demands for high volumes of read and write operations.
In exchange for improved read and write performance, you may lose other features of relational databases, such as immediate consistency and ACID transactions (although, this is not always the case).
Throughout this book, queries have driven the design of data models. This is the case because queries describe how data will be used. Queries are also a good starting point for understanding how well various NoSQL databases will meet your needs. You will also need to understand other factors, such as:
- The volume of reads and writes
- Tolerance for inconsistent data in replicas
- The nature of relations between entities and how that affects query patterns
- Availability and disaster recovery requirements
- The need for flexibility in data models
- Latency requirements
The following sections provide some sample use cases and some criteria for matching different NoSQL database models to different requirements.
Criteria for selecting key-value databases
Key-value databases are well-suited to applications that have frequent small reads and writes along with simple data models. The values stored in key-value databases may be simple scalar values, such as integers or Booleans, but they may be structured data types, such as lists and JSON structures.
Key-value databases generally have simple query facilities that allow you to look up a value by its key. Some key-value databases support search features that provide for somewhat more flexibility. Developers can use tricks, such as enumerated keys, to implement range queries, but these databases usually lack the query capabilities of document, column family and graph databases.
Key-value databases are used in a wide range of applications, such as the following:
- Caching data from relational databases to improve performance
- Tracking transient attributes in a Web application, such as a shopping cart
- Storing configuration and user data information for mobile applications
- Storing large objects, such as images and audio files
In addition to key-value databases you install and run on premises, there are a number of cloud-based choices as well. Amazon Web Services offers SimpleDB and DynamoDB, whereas Microsoft Azure's Table service provides for key-value storage.
Use cases and criteria for selecting document databases
Document databases are designed for flexibility. If an application requires the ability to store varying attributes along with large amounts of data, then document databases are a good option. For example, to represent products in a relational database, a modeler may use a table for common attributes and additional tables for each subtype of product to store attributes used only in the subtype of product. Document databases can handle this situation easily.
Document databases provide for embedded documents, which are useful for denormalizing. Instead of storing data in different tables, data that is frequently queried together is stored together in the same document.
Additionally, document databases improve on the query capabilities of key-value databases with indexing and the ability to filter documents based on attributes in the document.
Document databases are probably the most popular of the NoSQL databases because of their flexibility, performance and ease of use.
These databases are well-suited to a number of use cases, including:
- Back-end support for websites with high volumes of reads and writes
- Managing data types with variable attributes, such as products
- Tracking variable types of metadata
- Applications that use JSON data structures
- Applications benefiting from denormalization by embedding structures within structures
Document databases are also available from cloud services such as Microsoft Azure Document and Cloudant's database.
Use cases and criteria for selecting column family databases
Column family databases are designed for large volumes of data, read and write performance, and high availability. Google introduced Bigtable to address the needs of its services. Facebook developed Cassandra to back its Inbox Search service.
These database management systems run on clusters of multiple servers. If your data is small enough to run with a single server, then a column family database is probably more than you need -- consider a document or key-value database instead.
Column family databases are well-suited for use with:
- Applications that require the ability to always write to the database
- Applications that are geographically distributed over multiple data centers
- Applications that can tolerate some short-term inconsistency in replicas
- Applications with dynamic fields
- Applications with the potential for truly large volumes of data, such as hundreds of terabytes
Google demonstrated the capabilities of Cassandra running the Google Compute Engine. Google engineers deployed:
- 330 Google Compute Engine virtual machines
- 300 1 TB Persistent Disk volumes
- Debian Linux
- Datastax Cassandra 2.2
- Data was written to two nodes (Quorum commit of 2)
- 30 virtual machines to generate 3 billion records of 170 bytes each
With this configuration, the Cassandra cluster reached 1 million writes per second, with 95% completing in under 23 milliseconds. When one-third of the nodes were lost, the 1 million writes were sustained, but with higher latency.
Several areas can use this kind of big data processing capability, such as:
- Security analytics using network traffic and log data mode
- Big Science, such as bioinformatics using genetic and proteomic data
- Stock market analysis using trade data
- Web-scale applications such as search
- Social network services
Key-value, document and column family databases are well-suited to a wide range of applications. Graph databases, however, are best suited to a particular type of problem.
Use cases and criteria for selecting graph databases
Problem domains that lend themselves to representations as networks of connected entities are well-suited for graph databases. One way to assess the usefulness of a graph database is to determine if instances of entities have relations to other instances of entities.
For example, two orders in an e-commerce application probably have no connection to each other. They might be ordered by the same customer, but that is a shared attribute, not a connection.
Similarly, a game player's configuration and game state have little to do with other game players' configurations. Entities like these are readily modeled with key-value, document or relational databases.
Now, consider examples mentioned in the discussion of graph databases, such as highways connecting cities, proteins interacting with other proteins and employees working with other employees. In all of these cases, there is some type of connection, link or direct relationship between two instances of entities.
These are the types of problem domains that are well-suited to graph databases. Other examples of these types of problem domains include:
- Network and IT infrastructure management
- Identity and access management
- Business process management
- Recommending products and services
- Social networking
From these examples, it is clear that when there is a need to model explicit relations between entities and rapidly traverse paths between entities, then graph databases are a good database option.
Large-scale graph processing, such as with large social networks, may actually use column family databases for storage and retrieval. Graph operations are built on top of the database management system. The Titan graph database and analysis platform takes this approach.
Key-value, document, column family and graph databases meet different types of needs. Unlike relational databases that essentially displaced their predecessors, these NoSQL databases will continue to coexist with each other and relational databases because there is a growing need for different types of applications with varying requirements and competing demands.
Get more on using the different types of NoSQL databases in this guide to NoSQL software
Expert tips for selecting a NoSQL DBMS
NoSQL DMBSes are the fastest-growing DBMS category -- but is it right for you?