Despite all the hoopla about Hadoop, NoSQL databases and other big data technologies, relational database management systems continue to be the cornerstone of the IT infrastructure for processing, storing and managing data in most organizations. Companies of all sizes, from those with less than $10 million in sales to ones with $10 billion and more, rely on relational databases to handle their growing data volumes. But if the relational database design process isn't handled properly, the systems are at risk of choking on all the data being fed into them -- and that can have severe consequences for business operations and IT teams.
One of the fundamental considerations in designing databases so they can meet performance expectations is what kinds of applications they'll be supporting. There are three broad usage categories to factor into the design planning and data modeling stages:
- Transactional or business workflow applications;
- Business intelligence applications;
- Data integration processes for funneling information into data warehouses, master data management hubs and other systems.
Each of those categories can be effectively supported by relational databases, but different data modeling approaches are called for to optimize database performance and application throughput.
Take the right road in modeling data
From a database standpoint, transactional applications primarily involve inserting and updating data. As a result, the data structure needs to be designed to eliminate, or at least minimize, redundant records; the goal of which is to ensure that inserts and modifications are processed only once, which can help boost performance and also avoid data inconsistencies. That's usually accomplished through a logical data model built in third normal form, or 3NF.
BI applications, on the other hand, typically are read-only, with users querying data for various analytics uses. But query patterns and database loads may vary quite broadly based on these different uses. Dimensional models are the best performance option for relational databases that support BI systems, though care must be taken not to build them too simplistically.
Data integration applications are essentially a cross between the transactional and BI sides and are best implemented using a hybrid data model. Such hybrids incorporate elements from the 3NF format to support data updates, then layer on advanced dimensional modeling techniques, such as the use of slowly changing dimensions methodologies for updating information in data warehouses while preserving the full historical record.
Time to get physical on database design
Building the appropriate type of logical data model establishes a blueprint for effective relational database design. Next comes a physical data model that describes the specific structures, entities and elements to be incorporated into a database. Below are some database design best practices for doing so. There will be variations in syntax and, more importantly, in implementation, based on which database technology is being used, but the basic constructs are broadly applicable.
Create lots of indexes -- but not too many. Indexes are used in databases to speed up the processing of SELECT statement queries. Columns that are used to select rows within a table, or to join tables together, are prime candidates to be indexed. A healthy amount of indexing should be part of any good database design. But the number of unique column values and total rows in a table will affect whether an index will improve performance.
And, of course, there's no free lunch. Although queries are typically faster with indexes, data inserts and updates, including modifications and deletions, are often slowed down. And in some database products, update operations may lock out queries to specific tables or rows. When deciding what to index and what not to, there needs to be a trade-off between query and update performance based on business priorities.
Take advantage of partitioning. The data and indexes of partitioned database tables are split horizontally into separate zones based on criteria such as insert dates or column values. Partitioning schemes vary considerably between databases, but their typical benefits include faster inserts, updates and queries plus easier management of large tables since many administrative functions, such as backups and rollbacks, can be applied to subsets of the overall table.
Tap a database's built-in query optimizer. Relational database software typically includes an embedded tool for optimizing query performance. Examining the optimizer's execution plan for important queries can help in performance tuning. The plan will show how tables are joined and what indexes are used; with that information, either the queries or the underlying table structures can be altered in an effort to further boost performance. It sometimes can also be useful to provide optimization hints in queries when the optimizer needs to be guided to better performance.
There are also plenty of lower-level design tactics and techniques that can be used to tease more throughput out of relational databases -- favoring primary keys with integer data types over character-based ones, for example. What works for one organization or application might not be as effective for another. But finding the right relational database design formula will help keep corporate systems humming -- and business executives and users happy.
About the author: Rick Sherman is the founder of Athena IT Solutions, a consulting and training company that focuses on business intelligence, data warehousing and data management. He's also an adjunct faculty member at Northeastern University's Graduate School of Engineering and the author of Business Intelligence Guidebook: From Data Integration to Analytics. Email him at [email protected].
Get caught up with our RDBMS guide
Is a relational database management system for you?
Learn more about the top RDBMS offerings