alphaspirit - Fotolia
Book excerpt: How the SAP HANA platform stores data
This excerpt from 'Implementing SAP HANA, 2nd Edition,' explains the ins and outs of row and column storage, and how SAP HANA's approach to data storage can improve compression and performance.
The SAP HANA platform holds promise for a wide variety of business use cases, but that means companies have a lot of choices to make about how to implement HANA to best suit their needs.
In the second edition of the SAP Press book, experts from SAP partner Decision First Technologies explain how the SAP HANA platform can work in the real world, with implementation advice and case studies. The book covers steps from the initial implementation to connecting HANA to a business intelligence platform.
This excerpt, from Chapter 3: Data Storage in SAP HANA takes an in-depth look at in-memory data storage in the SAP HANA platform.
3 Data Storage in SAP HANA
In this chapter, we’ll go into great detail on how data is stored in SAP HANA.
Understanding data storage in SAP HANA is an important foundation because data storage differs from traditional database management systems in a number of ways. First, we’ll start with on overview of data storage in SAP HANA to highlight these differences, and then we’ll move into all of the components that make this possible (Section 3.1 and Section 3.2, respectively). We’ll then discuss physical data modeling for SAP HANA in Section 3.3 to draw clear differences between traditional database systems and techniques and tools that are available in SAP HANA, and why it makes sense to actually think backward about a data model in certain cases. This chapter ends in Section 3.4 with a case study for data modeling using our sample organization, AdventureWorks Cycle Company.
This excerpt from Implementing SAP HANA, 2nd Edition by Jonathan Haun, Chris Hickman, Don Loden and Roy Wells is reprinted here with the publisher's permission. Learn more about this book by downloading the PDF.
3.1 OLAP and OLTP Data Storage
Storing data in SAP HANA is quite different from doing so in a traditional disk-based database. The first and most obvious point is that SAP HANA is a relational database management system (RDBMS), where data is stored entirely in memory, instead of relational data being stored entirely on spinning disks.
Storing data entirely in memory was once a revolutionary concept that first had its detractors making statements such as, “Data for an entire application or data warehouse structure would never all fit into memory.” In fact, it was such an unconventional idea that it took some time to gain ground. However, many leading vendors now have in-memory solutions and are touting both the in-memory platform and stance for the same reason SAP sought to use this strategy in the first place -- unbelievable performance. Data loaded into SAP HANA and consumed by external applications performs at an unbelievable speed -- almost as if the data were staged for a demonstration. The response times are simply too fast.
SAP HANA Real-World Performance: Exhibit A
In our lab at Decision First Technologies, we took data from a customer paired with the SQL produced by an SAP BusinessObjects Web Intelligence report and placed the supporting data in SAP HANA. We then took the underlying query provided by the Web Intelligence report and ran it at the command line against the SQL Server database. The original SQL Server–based query runtime? More than an hour. The query was tuned, and the data was optimized in the SQL Server database, but the query was, frankly, quite complex, and the data volume was large. The report was critical to the customer’s business, so more than an hour of runtime was simply too long to wait for the data.
As a proof of concept, we moved the data to SAP HANA for the customer, used the same exact SQL from the Web Intelligence report. We did not tune the database tables or structures for SAP HANA; we merely ported the data from SQL Server to SAP HANA.We did not tune the query. This was simply a copy-and-paste exercise. The new SAP HANA query runtime? Four seconds.
Although we did absolutely nothing to the data or the report, the runtime was immediate. Needless to say, this was a compelling story for the customer, even before we invoked the modeling techniques that exploit the storage and engine processes in SAP HANA (we’ll discuss these later in this chapter).
The example in the preceding box is a real-world result that this particular customer would benefit from immediately just by porting its data to SAP HANA. These are the incredible performance benefits of in-memory computing that SAP has not been shy about touting -- and rightfully so.
However, as with any great software platform, a developer must consider the needs of the platform and embrace techniques that envelop all of its strengths. This is where a gap has existed in the discussion of SAP HANA. SAP HANA simply performs so well that it allows some sloppiness in the design and still performs at an incredible pace. We believe that you can avoid this sloppiness by merely taking a step back and catering the pillars of the development effort to the needs and special characteristics native to the SAP HANA platform. As you weigh design considerations at the onset of the project, begin by considering how you want to store the data in the architecture that is unique to SAP HANA. In this section, we’ll prepare you for these considerations by introducing you to the spinning disk problem, and then talk about how this problem can be combated with some of the unique features that SAP HANA brings to the development effort.
3.1.1 The Spinning Disk Problem
Spinning disks have been a performance bottleneck ever since they were introduced. The closer the disk is to the CPU, the faster data is rendered, searched, sorted, and processed; in SAP HANA, you take the physically spinning disk completely out of the equation to fully maximize this realization. Take, for instance, the process flow of information in a typical system and database:
1. Data is collected from an application via user input from a screen or form.
2. Data is passed to the database in a process known as an input/output (or I/O) transfer of information.
3. Data may be written to or read from a cache in memory on the database server.
4. Data is finally stored on a spinning disk.
I/O transfers performed without the use of a cache can take much longer to process. Factors that contribute to extra time include physical disk platter spinning rates, time needed to move mechanical components of the drive heads to read the disk platter, and numerous other factors that are inherent to this disk-based process and that add additional latency. This is a rather archaic process that hasn’t changed greatly since the onset of computing. Conventional database systems try to improve on this by targeting specific systems that provide disk caching controllers.
Caching data is a method used to speed up this process of data access from a spinning disk, and all of the major database vendors work closely with the disk manufacturers to tune the needs and specific requirements of the database I/O processing needs. In most cases, the database vendors seek to exploit caching techniques to limit that final disk interaction as much as possible. This is simply to avoid the native issues present with disk seek and write times by using the various optimizations of the caching controllers. This is all an effort to work around the slowness of the disk, whose performance can be maximized only so far.
3.1.2 Combating the Problem
Many technologies that we rely on today were invented to work around the inherent slowness caused by the disk. Take, for instance, online analytical processing (OLAP) technologies (which enable faster read performance by physically restructuring the data), online transaction processing (OLTP) technologies (whose goal is to make writing data to disk as fast as possible), or, finally, column storage technologies (whose goal is compression to both minimize storage and increase the speed of access to the data). The important thing to keep in mind is that all of these technologies, at their core, were designed around the spinning disk and its native challenges. We’ll introduce each of these technologies briefly and then talk about how they all fit into SAP HANA.
OLTP Storage Methods
An OLTP, or relational database, stores data in a normalized fashion at its core. Data is normalized to reduce redundant data and data storage patterns to optimize precious disk space and make the writing of that data to disk as fast as possible. Without techniques to minimize the storage factor, relational databases, by nature, use lots of space to store these redundant values. Consider Figure 3.1, which shows a typical normalized RDBMS table structure that’s been designed to reduce redundant data storage.
Data is normalized or reduced into multiple tables so that repeating values are removed into multiple tables to store repeating values once and contain a pointer to those repeating values. For example, in Figure 3.1, SALE_HEADER records are normalized into their own table instead of just storing the columns into the SALE_HEADER table. This concept is the pinnacle of an OLTP system. This is simply the design principal on which OLTP systems are based.
There is nothing wrong with this design for inserting or storing data in conventional RDBMS systems. In fact, for this purpose, it’s quite good. (There is a reason this methodology is the way the world stores its data!) However, there is one fundamental problem with this system: getting data out.
Retrieving data from an OLTP system requires multiple joins and combinations of various related tables. This is expensive in terms of processing in these database designs. Often, reporting in these systems is certainly an afterthought. It is problems like this one—combined with the slowness and natural speed impediment—that many technologies evolve to solve. Techniques such as OLAP technologies were invented to solve this problem.
OLAP Storage Methods
OLAP data storage methods were conceived to combat slowness caused by both data access to disk and the way that data was stored in conventional relational databases, as just described. Technologies such as OLAP data storage physically store the data in a different way because traversing a relational database on disk isn’t exactly the fastest solution for reading or retrieving data. Figure 3.2 shows this alternative data storage in an OLAP database, in a typical star schema (named so because of the shape the related tables resemble).
In an OLAP database, data is organized into concepts called facts and dimensions. The facts and dimensions are just standard tables, but their names denote what they store. Facts are the heart of the star schema or dimensional data model. For example, FACT_SALE is the fact table in Figure 3.2. Fact tables store all of the measures or values that will be used as metrics to measure or describe facts about a business concept. Fact tables may also contain foreign keys to the date dimension tables to allow pivoting or complex date metrics. Fact tables will be arranged with differing granularities. Fact tables could have a high granularity and be at an aggregate level, aggregating measures by calendar week or a product line, for instance, or a fact table could be at the lowest level of granularity: a transaction line from a source system or combined source systems. Fact tables also contain foreign keys that refer back to dimension tables by the primary key of the dimension table. A fact is the “many” side of the relationship.
Dimension tables are the ancillary tables prefixed with “DIM_” in Figure 3.2.
Dimension tables are somewhat the opposite of fact tables because dimensions contain descriptions of the measures in the form of accompanying text to describe the data set for analysis by labeling the data, or the dimensions are often used to query or filter the data quickly. In Figure 3.2, the DIM_CUSTOMER table provides details about customer data or attributes and is used to filter and query sales from the prospect of customer data. The same can be said for DIM_PRODUCT.
This is a dramatic solution because an entirely different table structure had to be established and created. If the modeling task symbolized in Figure 3.2 isn’t enough, another element adds to the complexity: a batch-based process is created out of necessity.
A batch-based process is needed to both load and transform the data from the
OLTP normalized data structure into the denormalized OLAP structure needed for fast querying. That batch process is typically called extract, transform, and load (ETL). An ETL process physically transforms the data to conform to this OLAP storage method.
Typical ETL Process Workflow
1. After data is extracted from one or multiple source systems, the data loads to a staging database, where multiple transformations occur.
2. Staging is a key layer where the data loses the mark of the source system and is standardized into business concepts.
3. Data is loaded into the data warehouse tables and finalized into an OLAP structure to allow for both high-performing reads and flexibility in analytical methods and ad hoc data access.
SAP’s solution for ETL data integration is SAP Data Services. SAP Data Services is a fully functional ETL and data quality solution that makes building very complex processes relatively straightforward. SAP Data Services is used to extract and transform data through complex transforms with many powerful, built-in functions.
Because it’s the primary means to provision non-SAP data into SAP HANA,
SAP Data Services plays a pivotal role in setting up data models and data storage the right way for maximum performance in SAP HANA. We’ll discuss this tool’s capabilities at length later in this book.
OLAP data structures like those shown in Figure 3.2 are the focus and primary use case of ad hoc analytical tools such as SAP BusinessObjects BI. The OLAP or star schema structure allows the tool to be quite flexible with the data in terms of drilling if hierarchies exist in the data or if you are using a date dimension (in the preceding example, this is DIM_DATE) to not only search and sort but also effortlessly provide running calculations or more complex, date-based aggregations. Analytic activities like these would be quite difficult to address in a fully normalized OLTP system. Certainly, this data storage and system design eases the burden placed by the slowness of the platform, as well as adding nice features for analytics.
Columnar Data Storage Technologies
One final data storage technology, and the one most relevant to SAP HANA, is the columnar database architecture. Columnar databases also take on the problem of working around the slowness of the disk by changing the way that data is stored on the disk. We’ll walk through the SAP HANA specifics later in this chapter, but it’s important to have a basic understanding of columnar architectures now.
Columnar databases have been around for quite some time, and the concept was certainly not invented with SAP HANA. Rather, this was a design methodology that was integrated into SAP HANA for the unique features and data storage aspects that a columnar database brings to the data storage equation. Columnar databases still use tables to store data as logical components, but the way that the data is laid out on the disk differs considerably from standard, row-store tables. Data values are gathered and stored in columns instead of rows. A very simple example is a product table with colors and product descriptions. In Figure 3.3, the data is stored in rows as it’s represented in the logical tables in the database.
Data is organized into rows in the physical storage of the data on disk. This is a great design for OLTP systems and is the standard for most of the world’s data. So, data in a column-store table would be arranged quite differently on the disk. Data is arranged by the columns of the data in Figure 3.4.
Notice that the repeating values are stored only once, to minimize the physical footprint of the data on the disk.
Column-store tables can still be relational tables and data. The difference lies in the way the data is arranged on the disk.
For reasons like these, porting an existing structure to a columnar form—while not an insurmountable task—certainly has more considerations than simply moving the data over to a different platform. As mentioned, SAP HANA mitigates some of these issues because in memory storage is so much faster. In a sense, SAP HANA masks some of these issues, but you should still consider them when you’re porting very large existing data warehouse structures that require some type of ETL process with, most often, non-SAP data.
Inserts in Disk-Based Column-Store Tables
In our lab at Decision First Technologies, we recently ported data for a data warehouse
OLAP structure from SQL Server to SAP (Sybase) IQ to take advantage of the superior compression and read technology available in columnar SAP (Sybase) IQ tables. However, we did notice some considerations that should be made in this port. These considerations are somewhat alleviated by the in-memory storage in SAP HANA, but they are still worth considering because they are in the domain of a column-based database:
- SELECT statements or reads are much faster than with a conventional row-based database. The data then loads to a staging database, where multiple transformations occur.
- Using BULK INSERT uploading data is considerably faster and should be used whenever possible, especially with large record sets.
- UPDATES or MERGE target operations are considerably slower than a conventional row-based database.
- DELETE inserts are faster when updates are needed.
The main takeaway is that SELECT SQL statements or reading data for operations such as a report do not need to be altered too much, but the ETL process will most likely require INSERTS, UPDATES, and DELETES to be altered, especially for delta or change-based loads.
Solutions Used by SAP HANA
We’ve discussed OLTP, OLAP, and columnar data storage methods and the reasons they were introduced, and SAP HANA is unique in the sense that it can be a bit of a chameleon. SAP HANA can act as any of these platforms by first physically storing data in both row and column fashions; however, even more than that, it can also act as an OLAP structure and even process data by interpreting multidimensional expressions (MDX query language). It also has a common and conventional SQL interface.
In essence, SAP takes advantage of the best of all of these platforms natively. This adaptable nature has been great for SAP because it allows SAP HANA to quickly and seamlessly be addressed under many conventional applications. If a multidimensional, cube-based application, such as SAP BW or SAP Business Planning and Consolidation (SAP BPC), needs MDX to interface data, then no problem. SAP HANA has an interface layer to behave just like a cube to the application. Most applications interact with a database via SQL, and SAP HANA is just as comfortable interfacing as SQL.
It’s important to note that most of these technologies were invented to combat the slowness of disk-based data access. But SAP HANA is different. Even though it can masquerade as any of these technologies, it’s taking on the performance problems directly. Instead of working around the issues of the platform, SAP HANA simply changes the platform altogether. It skips the disk, and data is stored directly in memory close to the CPU, where it performs better. That SAP HANA works natively as any of these technologies is merely a product-related strategy to foster adoption of SAP HANA as a platform capable of replacing existing, underlying database technologies while offering developers new and exciting ways to both access and model data. We’ll cover both accessing and modeling data in later chapters of this book.
SAP HANA goes even further in rethinking the developer’s platform by moving the various data processing layers in an application so that a developer must re-envision what he or she is trying to achieve. It’s truly a revolutionary platform.
© 2015 by Rheinwerk Publishing, Inc. Implementing SAP HANA / Jonathan Haun, Chris Hickman, Don Loden, Roy Wells. ISBN 978-1-4932-1176-0
About the authors
Jonathan Haun currently serves as the lead SAP HANA consultant and consulting manager with Decision First Technologies. He is an SAP Certified Application Associate and SAP Certified Technology Associate for SAP HANA 1.0.
Chris Hickman is a certified SAP BusinessObjects BI consultant and principal consultant at Decision First Technologies. His specific areas of expertise include reporting, analysis, dashboard development, and visualization techniques.
Don Loden is a principal consultant at Decision First Technologies with full lifecycle data warehouse and information governance experience in multiple verticals. He is an SAP Certified Application Associate on SAP BusinessObjects Data Integrator and is very active in the SAP community, speaking globally at numerous SAP and ASUG conferences and events.
Roy Wells is a principal consultant at Decision First Technologies, where he uses his 15 years of experience in system and application architecture to lead clients in the successful implementation of end-to-end BI solutions.
Ten traits to look for in an SAP consultant
Sapphire attendees react to the S/4 HANA roadmap
Tips for upgrading SAP business intelligence
SAP releases more componentized HANA