What is a columnar database?
A columnar database is a database management system (DBMS) that stores data in columns instead of rows. The purpose of a columnar database is to efficiently write and read data to and from hard disk storage in order to speed up the time it takes to return a query. Columnar databases store data in a way that greatly improves disk I/O performance. They are particularly helpful for data analytics and data warehousing.
Columnar database vs. row-oriented database
Column-oriented databases and row-oriented databases are both methods for processing data in data warehouses. However, they have different approaches: While column-oriented databases store data in columns, row-oriented databases store data in rows. Instead of keeping a record of every column in a table in a single row, a column-oriented database will store the data for each column in a single column.
The main benefit of a columnar database is faster performance compared to a row-oriented one. That's because it accesses less memory to output data. Because a columnar database stores data by columns instead of rows, it can store more data in a smaller amount of memory. And because the initial data retrieval is done on a column-by-column basis, only the columns that need to be used are retrieved. This makes it possible for a columnar database to scale efficiently and handle large amounts of data.
Columnar database example
In a columnar database, all the values in a column are physically grouped together. For example, all the values in column 1 are grouped together; then all values in column 2 are grouped together; etc. The data is stored in record order, so the 100th entry for column 1 and the 100th entry for column 2 belong to the same input record. This enables individual data elements, such as customer name to be accessed in columns as a group, rather than individually row-by-row.
Here is an example of a simple database table with four columns and three rows.
|Purchase (in dollars)
In a columnar DBMS, the data would be stored like this: 0411,0412,0413;Moriarty,Richards,Diamond;Angela,Jason,Samantha;52.35,325.82,25.50.
In a row-oriented DBMS, the data would be stored like this: 0411,Moriarty,Angela, 52.35;412, Richards,Jason,325.82;0413,Diamond,Samantha,25.50.
Benefits of using a columnar database
Columnar databases have been around for decades but offer benefits for modern business applications, such as data analytics, business intelligence (BI) and data warehousing -- but that's not all. Here are three key advantages of columnar databases:
- Multipurpose. Columnar databases receive a lot of attention with big data applications. They're also used for other purposes: running online analytical processing (OLAP) cubes, storing metadata and doing real-time analytics. Columnar databases are efficient for these tasks because they excel at loading new data quickly.
- Compressible data. Data can be highly compressed in a columnar database. The compression permits columnar operations -- like MIN, MAX, SUM, COUNT and AVG -- to be performed fast.
- Self-indexing. Another benefit of a column-based DBMS is self-indexing, which uses less disk space than a relational database management system containing the same data.
- Speed and efficiency. Columnar databases perform analytical queries faster than other database methodologies. They are also quick and efficient at performing joins, a way of combining data from two tables in a relational database. Although it's a standard way of combining data, a join can be inefficient and slow performance. A columnar database can join any number of data sets quickly, and it can aggregate the results of a query into a single output.
Columnar database limitations
Traditional databases are more suitable for incremental data loading than columnar databases. Incremental data loading is a technique that implements a bulk data load into a database by loading only a subset of the data.
The data is loaded according to a trigger, which is a point where the data can be loaded more efficiently. An example of a trigger is when another user adds data or when a certain time of the day occurs. If the trigger occurs, then the subset of data before the trigger point is loaded into the database. This technique is ideal for loading historical data or recently updated or created data.
Online transaction processing (OLTP) applications are also not suitable in column-oriented databases. Row-oriented databases work better for OLTP applications because they have better concurrent processing and isolation capabilities, and they use disk space more efficiently.
OLTP is a type of DBMS that handles large amounts of data and short-term queries. Data is stored until it is modified or deleted, and the OLTP system will typically remain operational during this storage period. In contrast, OLAP system data is aggregated and analyzed to provide a strategic business view. The data storage needs of OLTP and OLAP are different. OLTP mainly stores insert, update and delete operations, while OLAP mainly stores aggregated data.
As the use of in-memory analytics increases, the relative benefits of row-oriented versus column-oriented databases may become less important. In-memory analytics is not concerned with efficiently reading and writing data to a hard disk. Instead, it allows data to be queried in random access memory.
Databases are a key part of data management. Find out what you need to know to effectively manage data in today's enterprise.