A data warehouse is a repository for data generated and collected by an enterprise's various operational systems. Data warehousing is often part of a broader data management strategy and emphasizes the capture of data from different sources for access and analysis by business analysts, data scientists and other end users.
Typically, a data warehouse is a relational database housed on a mainframe, another type of enterprise server or, increasingly, in the cloud. Data from various online transaction processing (OLTP) applications and other sources is selectively extracted and consolidated for business intelligence (BI) activities that include decision support, enterprise reporting and ad hoc querying by users. Data warehouses also support online analytical processing (OLAP) technologies, which organize information into data cubes that are categorized by different dimensions to help accelerate the analysis process.
Basic components of a data warehouse
A data warehouse stores data that is extracted from internal data stores and, in many cases, external data sources. The data records within the warehouse must contain details to make it searchable and useful to business users. Taken together, there are three main components of data warehousing:
- A data integration layer that extracts data from operational systems, such as Excel, ERP, CRM or financial applications.
- A data staging area where data is cleansed and organized.
- A presentation area where data is warehoused and made available for use.
A data warehouse architecture can also be understood as a set of tiers, where the bottom tier is the database server, the middle tier is the analytics engine and the top tier is data warehouse software that presents information for reporting and analysis.
Data analysis tools, such as BI software, enable users to access the data within the warehouse. An enterprise data warehouse stores analytical data for all of an organization's business operations; alternatively, individual business units may have their own data warehouses, particularly in large companies. Data warehouses can also feed data marts, which are smaller, decentralized systems in which subsets of data from a warehouse are organized and made available to specific groups of business users, such as sales or inventory management teams.
In addition, Hadoop has become an important extension of data warehouses for many enterprises because the distributed data processing platform can improve components of a data warehouse architecture -- from data ingestion to analytics processing to data archiving. In some cases, Hadoop clusters serve as the staging area for traditional data warehouses. In others, systems that incorporate Hadoop and other big data technologies are deployed as full-fledged data warehouses themselves.
Data warehouse benefits and options
Data warehouses can benefit organizations from both an IT and a business perspective. For example:
- Separating analytical processes from operational ones can enhance the performance of operational systems and enable data analysts and business users to access and query relevant data faster from multiple sources.
- Data warehouses can offer enhanced data quality and consistency for analytics uses, thereby improving the accuracy of BI applications.
- Businesses can choose on-premises systems, conventional cloud deployments or data-warehouse-as-a-service (DWaaS) offerings.
- On-premises data warehouses offer flexibility and security so IT teams can maintain control over their data warehouse management and configuration; they're available from IBM, Oracle and Teradata as an example.
- Cloud-based data warehouses such as Amazon Redshift, Google BigQuery, Microsoft Azure SQL Data Warehouse and Snowflake enable companies to quickly scale up their systems while eliminating the initial infrastructure investments and ongoing system maintenance requirements.
- DWaaS, an offshoot of database as a service, provides a managed cloud service that frees organizations from the need to deploy, configure and administer their data warehouses. Such services are being offered by a growing number of cloud vendors.
Types of data warehouses
There are three main approaches to implementing a data warehouse, which are detailed below. Some organizations have also adopted federated data warehouses that integrate separate analytical systems already put in place independently of one another -- an approach proponents describe as a practical way to take advantage of existing deployments.
- Top-down approach: Created by data warehouse pioneer William H. Inmon, this method calls for building the enterprise data warehouse first. Data is extracted from operational systems and possibly third-party external sources and may be validated in a staging area before being integrated into a normalized data model. Data marts are then created from the data stored in the data warehouse.
- Bottom-up method: Consultant Ralph Kimball developed an alternative data warehousing architecture that calls for dimensional data marts to be created first. Data is extracted from operational systems, moved to a staging area and modeled into a star schema design, with one or more fact tables connected to one or more dimensional tables. The data is then processed and loaded into data marts, each of which focuses on a specific business process. The data marts are integrated using a data warehouse bus architecture to form an enterprise data warehouse.
- Hybrid method: Hybrid approaches to data warehouse design include aspects from both the top-down and bottom-up methods. Organizations often seek to combine the speed of the bottom-up approach with the data integration capabilities achieved in a top-down design.
Data warehouses vs. databases vs. data lakes
Databases and data lakes are often confused with data warehouses, but there are important differences between them. While data warehouses typically store data from multiple sources and utilize predefined schemas designed for data analytics, an operational database is generally used to capture, process and store data from a single source, such as a transactional system, and its schema is normalized. Such databases typically aren't designed to run across very large data sets, as data warehouses are.
By contrast, a data lake is a central repository for all types of raw data, whether structured or unstructured, from multiple sources. Data lakes are most commonly built on Hadoop or other big data platforms. A schema doesn't need to be defined upfront in them, which allows for more types of analytics than data warehouses, which have defined schemas. For example, data lakes can be used for text searches, machine learning and real-time analytics.
Data warehouse innovations throughout history
The concept of data warehousing can be traced back to work conducted in the mid-1980s by IBM researchers Barry Devlin and Paul Murphy. The duo coined the term business data warehouse in their 1988 paper, "An architecture for a business and information system," which stated:
"The [business information system] architecture is based on the assumption that such a service runs against a repository of all required business information that is known as the Business Data Warehouse (BDW). ... A necessary prerequisite for the physical implementation of a business data warehouse service is a business process and information architecture that defines (1) the reporting flow between functions and (2) the data required."
Bill Inmon, as he is more familiarly known, furthered data warehouse development with his 1992 book Building the Data Warehouse, as well as by writing some of the first columns about the topic. Inmon's top-down design method for building a data warehouse describes the technology as a subject-oriented, integrated, time-variant and nonvolatile collection of data that supports an organization's decision-making process.
The technology's growth continued with the founding of The Data Warehousing Institute, now known as TDWI, in 1995, and with the 1996 publication of Ralph Kimball's book The Data Warehouse Toolkit, which introduced his dimensional modeling approach to data warehouse design.
In 2008, Inmon introduced the concept of data warehouse 2.0, which focuses on the inclusion of unstructured data and corporate metadata.