Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unambiguous and as intended. Such normalization is intrinsic to relational database theory. It may have the effect of duplicating data within the database and often results in the creation of additional tables.
The concept of database normalization is generally traced back to E.F. Codd, an IBM researcher who, in 1970, published a paper describing the relational database model. What Codd described as "a normal form for database relations" was an essential element of the relational technique. Such data normalization found a ready audience in the 1970s and 1980s -- a time when disk drives were quite expensive and a highly efficient means for data storage was very necessary. Since that time, other techniques, including denormalization, have also found favor.
Data normalization rules
While data normalization rules tend to increase the duplication of data, it does not introduce data redundancy, which is unnecessary duplication. Database normalization is typically a refinement process after the initial exercise of identifying the data objects that should be in the relational database, identifying their relationships and defining the tables required and the columns within each table.
Data normalization example
|Customer||Item purchased||Purchase price|
If this table is used for the purpose of keeping track of the price of items and the user want to delete one of the customers, he or she will also delete the price. Normalizing the data would mean understanding this and solving the problem by dividing this table into two tables, one with information about each customer and the product they bought and the second with each product and its price. Making additions or deletions to either table would not affect the other.
Normalization degrees of relational database tables have been defined and include:
First normal form (1NF). This is the "basic" level of database normalization, and it generally corresponds to the definition of any database, namely:
- It contains two-dimensional tables with rows and columns.
- Each column corresponds to a subobject or an attribute of the object represented by the entire table.
- Each row represents a unique instance of that subobject or attribute and must be different in some way from any other row (that is, no duplicate rows are possible).
- All entries in any column must be of the same kind. For example, in the column labeled "Customer," only customer names or numbers are permitted.
Second normal form (2NF). At this level of normalization, each column in a table that is not a determiner of the contents of another column must itself be a function of the other columns in the table. For example, in a table with three columns containing the customer ID, the product sold and the price of the product when sold, the price would be a function of the customer ID (entitled to a discount) and the specific product. In this instance the data in the third column is said to be dependent upon the data in the first and second columns. This dependency does not occur in the 1NF case.
The column labeled customer ID is considered a primary key because it is a column that uniquely identifies the rows in that table, and it meets the other accepted requirements in standard database management schema: It does not have NULL values and its values won't change over time.
In the example above, the other column headers are considered candidate keys. The attributes of those candidate keys that make them unique are called prime attributes.
Third normal form (3NF). At the second normal form, modifications are still possible because a change to one row in a table may affect data that refers to this information from another table. For example, using the customer table just cited, removing a row describing a customer purchase (because of a return, perhaps) will also remove the fact that the product has a certain price. In the third normal form, these tables would be divided into two tables so that product pricing would be tracked separately.
Extensions of basic normal forms include the domain/key normalized form, in which a key uniquely identifies each row in a table, and the Boyce-Codd normal form (BCNF), which refines and enhances the techniques used in the 3NF to handle some types of anomalies.
Database normalization's ability to avoid or reduce data anomalies, data redundancies and data duplications, while improving data integrity, has made it an important part of the data developer's toolkit for many years. It has been one of the hallmarks of the relational data model.
The relational model arose in an era when business records were, first and foremost, on paper. Its use of tables was, in some part, an effort to mirror the type of tables used on paper that acted as the original representation of the (mostly accounting) data. The need to support that type of representation has waned as digital-first representations of data have replaced paper-first records.
But other factors have also contributed to challenging the dominance of database normalization.
Over time, continued reductions in the cost of disk storage, as well as new analytical architectures, have cut into normalization's supremacy. The rise of denormalization as an alternative began in earnest with the advent of data warehouses, beginning in the 1990s. More recently, document-oriented NoSQL databases have arisen; these and other nonrelational systems often tap into nondisk-oriented storage types. Now, more than in the past, data architects and developers balance data normalization and denormalization as they design their systems.
Database normalization tools
Data modeling software can incorporate features that help automate preparing incoming data for analysis. IT managers still need to develop a plan to address common problems, including data normalization. Vendors in data normalization include 360Science, ApexSQL and many other smaller niche developers.