Data warehouses offer a window into an organization's historical performance and ongoing operations. To help drive decision-making, they provide business intelligence teams, data analysts and business users with information on things such as customer behavior, business trends, operational efficiency and sales. Despite the emergence of data lakes based on big data technologies, the growing need to capture and analyze business data from various source systems is keeping the data warehouse as relevant as ever, if not more so.
But before investing in a data warehouse platform as part of your data management architecture, the first step is to examine whether your organization really needs one and what business benefits it can get by implementing one. In connection with that, you must consider the different data warehouse deployment options -- enterprise-wide or departmental, as well as on premises or in the cloud.
You also need to determine if the unstructured and semistructured data commonly stored in big data systems will be a component of the data warehouse environment -- and decide whether to integrate traditional data warehousing for business intelligence (BI), enterprise reporting and online analytical processing (OLAP) with data processing and management for big data analytics. Finally, you must match your data warehousing use case to the most appropriate type of data warehouse platform.
Why are data warehouses needed?
The general concept of data warehousing is quite simple: Copies of data are regularly extracted from the operational systems that support business processes and loaded into a separate repository, where the data is consolidated and then made available for analysis and reporting. Business users and analysts can then access the data via self-service BI and analytics tools, dashboards, portals and prepared reports.
The following conditions may indicate that your organization could benefit from a data warehouse:
- The company is struggling to report effectively on business activities because required data isn't readily available.
- Business data is being copied for analysis by different departments and groups of users into spreadsheets that aren't always consistent with one another.
- Uncertainties about data quality and accuracy are causing corporate executives and business managers to question the veracity of reports.
- BI reporting against production databases delays nightly or end-of-the-month processing of transaction data, requiring the processing windows to be extended.
- Running ad hoc queries against the databases during the workday slows down operational systems, affecting internal users and customers, suppliers and other external ones.
A properly implemented data warehouse can help your organization accurately answer questions about business operations, such as what happened and why. Effective data warehousing improves data accessibility for analytics applications because it combines data from disparate locations and sources into a central repository. As data is moved to the data warehouse, it typically is also cleansed and transformed so it's consistent for analysis, which should help improve the quality of the information in query results and reports.
In addition, once the data warehouse is in place and actively being used, operational workflows likely will become more efficient because BI, reporting and analytics activity has been moved out of production databases.
Data warehouse deployment options
A data warehouse environment can differ greatly from organization to organization. From an architectural standpoint, deployments can follow multiple paths -- an enterprise data warehouse (EDW), a group of smaller data marts or a combination of those two approaches.
An EDW is architected to contain all the pertinent data from an enterprise's operational systems, and perhaps some collected from external data sources (see Figure 1). It's a single, unified repository for BI and analytics data, and it's meant to be used across all departments and business units. As a result, building an EDW is often a big undertaking, especially in large companies.
In an EDW architecture, organizations can also implement an operational data store (ODS) as an interim step between their operational systems and the enterprise data warehouse. Operational data is copied to the ODS and then extracted and loaded into the data warehouse. The ODS serves as a staging area for data that has yet to be transformed for analysis, and it can be used to run near-real-time queries that require more detailed data about recent business operations than is available in the data warehouse.
Data marts are downsized data warehouses that focus on individual business units and subject areas. Organizations often opt to build data marts when meeting specific departmental needs for BI and reporting capabilities is a priority. Instead of requiring an expansive project that encompasses the entire enterprise, data marts are more focused and can provide business benefits more quickly.
As a result, adopting the data mart approach enables an organization to develop a data warehouse architecture in an iterative way by tackling separate parts of the business one at a time instead of constructing a monolithic EDW in a single large initiative. One data mart or many of them can be deployed, depending on the organization's size and structure.
Different data marts can then be integrated with one another to create a virtual EDW or used to physically populate an EDW in organizations that decide to combine the two approaches (see Figure 2). Alternatively, organizations that start with an EDW can feed subsets of the warehoused data to data marts that they set up later for discrete business operations.
On-premises vs. cloud data warehouses
The different data warehouse deployment options are possible with each type of platform that's available to be used: conventional database management system (DBMS) software, usually based on relational database technology; specialized analytical DBMSes; data warehouse appliances that bundle together the necessary hardware and DBMS software in a single package; and cloud data warehouses.
Using the cloud for data warehousing has become a more viable choice. Several vendors have introduced cloud-specific data warehouse platforms, and the vendors that originally developed data warehouse databases for on-premises installations now offer cloud-based versions of them. The product offerings include conventional data warehouse systems that organizations run in the cloud themselves and data warehouse as a service (DWaaS) technologies that vendors deploy and manage for users.
Like other types of cloud-based systems, cloud data warehouses can reduce deployment time and provide easier scalability and increased flexibility compared to on-premises data warehouses. In addition, DWaaS environments eliminate system installation and management tasks and routine data warehouse administration work for users. There's also the potential for cost savings, but that isn't guaranteed -- IT managers need to keep a close eye on the use of data warehouses in the cloud to make sure costs don't end up being higher than expected. Also, data security and privacy issues may be a factor in decisions to deploy data warehouses on premises.
Data lake vs. data warehouse
After Hadoop and other big data technologies first emerged, the market was awash in buzz -- and confusion -- about whether they were on the verge of replacing data warehouses. But in most organizations, data warehouses aren't going away. Instead, they and big data systems usually coexist, each supporting different types of analytics uses.
A data warehouse platform typically is based on a relational DBMS and contains structured data that originates in an organization's operational and transaction processing systems. Data warehouses are primarily accessed by business analysts and executives looking to run basic SQL-based BI queries, and by BI developers who create dashboards and reports that present analytics information to business users.
Big data systems, on the other hand, typically are based on nonrelational technologies such as Hadoop, Spark and NoSQL databases. The data they contain can be structured, unstructured or semistructured and can originate in all types of internal systems, plus social networks and other external data sources. Big data analytics is aimed at pattern discovery, correlations and similar insights -- for example, predicting future trends and customer behavior based on past activities. It's usually done with complex analytical models developed by data scientists and statisticians in languages such as Python, R and Scala.
Big data technologies also enable the concept of the data lake, a repository of raw data from various sources that can be stored as is and then filtered and prepared for analysis as needed.
In the end, though, data lakes aren't direct replacements for data warehouses and data marts. In fact, there's some synergy between data warehouses and data lakes. For example, sets of structured data collected in a data lake can be moved to a data warehouse for conventional OLAP storage and analysis. Vendors have also integrated big data access into data warehouse platforms, enabling data stored in different systems and formats to be accessed from a single place.
Data warehouse use cases and scenarios
Now let's examine some use cases and scenarios that could influence your organization's data warehouse strategy and decisions on what types of platforms to deploy for warehousing projects.
First-time vs. longtime data warehouse users. Companies that are looking to expand an existing data warehouse deployment typically would do well to continue using the platform they already have in place. For organizations implementing their first data warehouse, a data warehouse appliance can be a compelling option. The chosen vendor will work with you to determine the appropriate size and configuration of the appliance, which is delivered ready-to-go. But the different types of platforms are all viable options for first-time users, and you should examine the additional use cases in this section to develop your specific requirements and deployment plans.
Company size. Large companies might choose a mainstream relational DBMS, an analytical database or a hybrid transaction/analytical processing (HTAP) system, either on premises or in the cloud. These organizations have the infrastructure and staff to successfully use such technologies. In the HTAP approach, the same DBMS can be used both to process operational transactions and for BI and analytical modeling. To support HTAP, a relational database engine may offer built-in columnar access or an add-on appliance for analytical processing.
On the other hand, smaller organizations should consider a data warehouse appliance or a cloud data warehouse service. Both options minimize administrative needs: The appliance comes preconfigured, and the data warehouse in the cloud that a DWaaS environment offers is set up and managed by the cloud service provider.
Size of the IT department. Organizations with relatively small IT departments tend to have fewer database administrators and other IT pros who can set up, tune and manage a data warehouse system. For them, it also makes sense to consider a data warehouse appliance or DWaaS offering.
Cloud usage. If your organization runs a lot of applications in the cloud, a DWaaS deployment or a self-managed cloud data warehouse platform may be your best option. Keeping business data that's generated and stored in the cloud there for data warehousing and analysis is a logical approach.
Data latency requirements. For organizations that require minimal data latency in analytics applications, optimized data warehouse appliances likely make the most sense. It isn't impossible to achieve near-real-time latency with other approaches, but you may need to augment your data warehouse environment with additional high-speed hardware and software to meet your performance objectives.