Download the presentation: Storage Processors Accelerate Database Operations
00:02 Mark Mokryn: Hi. Welcome to "Session D-5: Storage Processors Accelerate Data Workload." Our company is Pliops. We're based in Tel Aviv and San Jose, California. We're about 60 employees with deep knowledge and experience in databases, storage and semiconductors and we're currently in evaluation by over 10 tier-one cloud server and storage providers with our initial product.
00:33 MM: I am Mark Mokryn. I am VP Product of the company and Eddy Bortnikov is VP Technology and we'll be presenting this today.
00:46 MM: What's the problem that really we're addressing? In the past few years, we've been reaching a new storage compute bottleneck. Just a few years ago, storage was based on hard drives, which were capable of about 100-200 IOPS and were giving latencies of about 10 milliseconds and now, people are just used to NVMe SSDs with a half million IOPS and latencies in the 10ths of microseconds. So, the performance of the underlying storage has gone up by many orders of magnitude.
01:22 MM: On the other hand, the CPU performance, well it used to be doubling just not too long ago, right now, it's flattening out and at the current rate, CPU performance is doubling about once every 20 years. So, this phenomenon, the increasing gap, the greatly increasing gap between storage performance and CPU performance, is creating a need for essentially new storage architectures.
01:49 MM: Essentially, most storage software today is really based on hard drive technologies and this is just not keeping up with flash. Now some differences between hard drives and flash. Flash is expensive while hard drives are, of course, a lot cheaper. So, number one, it wants you to compress your data on flash in order to save costs. However, compression and object management are very expensive in both CPU and DRM cards typically.
02:24 MM: Additional, flash does not behave like hard drives. Flash is asymmetrical in performance, in read performance and write performance. Generally, the write performance is much slower and especially the random write performance, which generates a tremendous amount of underlying garbage collection inside the SSDs.
02:43 MM: So in order to gain the max performance, what we really want to do is; number one, to compress the written data in order to write less and to write the data sequentially, which increases the SSD performance since it decreases the amount of garbage collection that needs to be done by the drive.
03:04 MM: The problem is exacerbated when we get to denser and denser technologies. For example, QLC experiences much worse random write performance, of course, than TLC and of course with PLC, it'll be even much, much worse than that. If you tack on protection such as RAID 5 or RAID 6, you increase the random write workload due to read-modify writes of the RAID and the performances become unboundable, which really means that people are not using RAID 5 and RAID 6 for flash with flash technologies.
03:42 MM: Now, due to the expense of flash, desegregation is touted as a good solution for that and it is in many cases. However, it must also be taken into account that desegregation for flash comes with a serious cost. If we compare, for example, desegregated storage with hard drives, then the network latency is really a very small fraction of the hard drive latency. We're talking network latencies in few . . . In 100 microseconds, something like that. That's very small compared to a hard drive but for flash, this is a doubling of the flash latencies so desegregation does have its cons as well.
04:30 MM: As far as data efficiency, if we look here in gray on the left-hand side, we can see the size of the initial raw data set. When it's stored on fixed sized block storage, such as drives or flash, there is typically quite a bit of data fragmentation, for example, lots of free space inside B-Tree pages.
04:54 MM: Then additionally, on top of that, we over-provision the storage for future growth and also for free space and for the file system and so forth. Then if we want to protect the data with for example, RAID 1, then we double that capacity. So, what happens in most desk deployments, the actual storage capacity greatly exceeds the data set size. Now desegregated storage often compresses the data and is also thinly provisioned but latencies go up as mentioned earlier.
05:39 MM: So, what are some current approaches to flash to optimizing data access on top of flash? Number one, there are of course . . . There are software such as RocksDB, which is written with awareness of flash capabilities. For example, with RocksDB, the technology knows to take advantage of the high parallelism capabilities of flash, with many outstanding read requests.
However, still with RocksDB, you cannot fully take advantage of the sequential performance of flash since there are multiple software processes going on. For example, there could be multiple databases running on the same server, there could be multiple compactions, simultaneous compactions is happening in RocksDB. So, we're not really optimizing for flash performance since we're doing, essentially from the FTL perspective, random writes.
06:43 MM: Additionally, the data management that is required for software such as RocksDB is extremely intensive, as far as CPU utilization in order to gain the space efficiency, for example, the concurrent compactions and of course, this type of software, it requires application-specific integration. So, if you want to use a technology such as RocksDB, you need to integrate this into your database or into your application.
07:14 MM: Another approach is using a transition to flash-specific interface, such as Zoned Namespaces and an open channel. So, these interfaces really are essentially, they're trying to take advantage or actually, they even mandate using flash optimally, writing to flash sequentially. However, they require custom integration. You cannot run standard file systems such as XFS and send your applications on top of ZNS or open channel drives.
07:50 MM: So essentially, there is no good solution today for high-performance, cost-effective standard applications, standard file systems, on top of flash but that is until now. With Pliops, we're coming out with a storage processor, which enables high-performance capacity expansion, data protection, enhancing the endurance of flash devices.
08:23 MM: The Pliops card, essentially it is a key value-based technology. We provide KV User Space Library API and we also provide a standard block interface, which can be used by any application and using any file system. Inside the processor, we do compression. We're doing zstd compression which is fully hardware offloaded and the core IP, really, of the Pliops Storage Processor is the KV store, where we do the indexing, the merging of objects, the packing, sorting, garbage collection and in addition, we do a RAID 5 or a RAID 6 at no performance cost and also encryption with AES 256.
09:23 MM: As far as the solution overview so it is KV-based hardware acceleration and extremely low cost as far as DRAM required per object with only two bytes per object, which is about 85% lower than the best computing solution . . . Software solution that we're aware of and what this really means is that we can index a tremendous amount of objects with very low memory footprints and thus, we can guarantee single flash access per read, which leads to very low tail latencies.
10:05 MM: There is hardware offloaded indexing, garbage collection, the compression, the encryption and the key thing is that we support any SSD. We support TLC, QLC, Optane, as well as ZNS and open channel. Essentially, all common flash technologies and SSDs from any vendor are supported. We do log structured writes. There are no random overwrites. So essentially, we perform random application writes at sequential SSD performance and because we do the log-structured writes, we can also do RAID 5, RAID 6 without any read-modified writes.
10:52 MM: Now, since we're providing a block device driver, on top of standard key-value, essentially, we can support any application in any file system and we're seeing a lot of benefit with relational databases, with NoSQL databases, with providing back ends to software-defined storage devices and also analytics, analytics applications as well. So, we do have the bulk device driver and we also support direct integration of the key/value API for application-specific enhancements, as for example, we did with Redis, as will be discussed later.
11:45 MM: Now, as far as dynamic capacity expansion, we of course . . . We compress the data and then we store the data. We pack it, essentially with no gaps between the objects so there is no internal fragmentation and on top of this, when using the block storage API, we also thinly provision the volume as well. So really, we enable you to use the full capacity of the SSDs and to oversubscribe on the SSDs in order to utilize the full capacity and importantly, this is done at maximum performance.
12:27 MM: So unlike for example, if most people take an SSD and then consciously choose to use just a fraction of that SSD in order to get good performance, you don't need to do that with us. You can use the full available capacity at a very high and consistent performance.
12:55 MM: As far as the drive failure protection, we can support RAID 5, RAID 6. For multiple drive failures, we have what we call virtual hot capacity. Essentially, there is no standby drive. All the drives are used and then we just use the free space in the entire group to rebuild on top of that and essentially, because we are indexing the data and we are aware of the data, then we only need to rebuild the actual data content so we don't need to rebuild the full capacity of the drive.
13:40 MM: So, this is our cloud-optimized architecture and now . . .
13:48 Edward Bortnikov: Hi, my name is Edward Bortnikov. I am VP Technology for Pliops and today, I'll walk you through a number of studies that present the performance advantages provided by our Pliops Storage Processor platform or PSP in short.
14:07 EB: Let's start from something really simple: raw block I/O. The performance of random writes and especially their scalability with the growing parallelism of the load on top of flash drives is long known to be a sore point of the storage systems. So, in this case, we studied two systems. The first one is a RAID 5 managed by the PSP and the second one is a RAID 0 managed by a standard software solution. Both run on top of a system comprised of four 2-terabyte SSDs. We gradually changed the load from 4 to 64 concurrent tasks and looked at both the bandwidth and the four nines tail latency metrics.
15:14 EB: As we see, the software RAID capability to scale is limited and it only can deliver about 1/3 of the bandwidth delivered by the Pliops RAID 5 device, under the maximum load. Likewise, with the system's tail latency, we see that the Pliops-managed storage system provides a much faster tail latency, which of course drives higher system interactivity.
16:02 EB: All this happens thanks to the Pliops write-optimized architecture that, as you heard in Mark's presentation, transforms random writes to flash-friendly sequential writes. Needless to say, we achieve all this performance on the RAID 5 system, that delivers high availability that is not part of the RAID 0 features. We'll revisit the RAID 5 assistant configuration in other use cases.
16:43 EB: Next up is MariaDB, a very popular SQL database which we study under the industry standard TPC-C workload that simulates an online transaction-processing environment. We run MariaDB under a challenging workload. Mainly, we run eight instances of the database on a 40-core computer . . . Excuse me, 80-core computer over a single 2-terabyte SSD. All in all, 1,000 concurrent clients issue queries that hit the systems.
What we see here is that the Pliops-managed system delivers an unparalleled throughput and tail latency. On the throughput side, we deliver over 230 transactions per second, which is almost 20 times more than in the software-driven system and the tail latency follows suit. The three nines tail latency is more than seven times faster than in the baseline.
18:30 EB: You can also see an illustration of this run over time. As you see, the latencies provided by the Pliops-managed systems are not only low but also have a very low variance in contrast with those delivered by the baseline software system. So, we don't only manage a very high load but also can provide very predictable interactive performance.
19:15 EB: Let's get back to the RAID system. In the next experiment, we look at the . . . Again, at a system comprised of four two terabyte SSDs, once managed by the PSP RAID 5 configuration and the other time by the standard software RAID 0. The basic content is about four times compressible so as you see, our system consumes ultimately over three times less disk space, which is close to the contents compression potential.
20:10 EB: While it's clear that the RAID 5 implementation consumes part of the system cycles and it's also much harder to saturate the system of four flash drives, we still deliver almost twice as many queries per second as the parallel software system and our tail latency is 2.4X lower.
20:50 EB: The next experiment zooms into RAID 5 in action. Once again, we run the two systems side by side, the Pliops-managed RAID 5 and the software-managed RAID 0. In the first experiment, after a little bit more than an hour of continuous execution, we crash one of the SSDs in a controlled way. What we see is that immediately the recovery process is triggered in the background and the whole system is rebuilt in less than two hours. All this happens with less than 10% loss in the overall client perceived throughput while in the normal operation mode, the system delivers 2.5X sustained throughput compared to software RAID 0.
22:05 EB: Ultimately, that means that the client applications can run uninterrupted with only a minor impact on their perceivable performance, all that is not possible in the world of software-managed RAID systems. It's notable to say that the balance between the rebuild speed and the loss of the application throughput is a parameter that can be tuned in either direction. In our case, it's tuned to rebuild of one terabyte of data in an hour, but it can also be either increased or reduced in accordance with the customer requirements.
23:14 EB: We now turn to a different use case. MongoDB is a very popular NoSQL database, a document database that is used by many developers in multiple use cases. In this case too, we use the data set that is 4X compressible and we track their overall system space as well as the throughput and latency metrics. In the setting, we are interested to study the performance of MongoDB on top of two hardware configurations. The first one is a slow QLC flash drive managed by PSP and on the other hand, a fast TLC flash drive managed by the . . . software.
24:21 EB: First of all, we know that we deliver about 3X capacity savings similar to the MariaDB use case thanks to our built-in compression algorithms.
24:43 EB: On the performance side, we see that the throughput delivered in two write-intensive workloads, the first containing 100% of reads followed by writes named YCSB-F and the second is a 50:50 mix of reads and writes named YCSB-A. In both cases, the throughput delivered by the Pliops-managed system is above 250%, the baseline, despite the fact that we are actually running on top, over slower, much slower flash drive.
25:46 EB: As we zoom in to the tail latencies delivered by the two systems, this time, both running on top of our QLC SSD, we also see that there is a decisive gap of over three times faster latency both for puts and for gets provided by the Pliops-managed system. So once again, it's not only that the system is overall faster in the sense that it can handle a lot more transactions, it also provides a much higher interactivity and predicted speed.
26:46 EB: Alright, one might ask the following question, many databases including MongoDB can use compression software in order to mitigate the speed versus disk space trade off. We decided to study what does it mean in terms of performance. Well, we compared the PSP-managed system versus the software with the internal compression algorithm namely the zstd algorithm applied. While the two systems now produce approximately a system image of . . . The data image of similar size, the performance gap is once again decisive, as you see, the Pliops-managed system delivers up 2.78x throughput advantage compared to the MongoDB that applies software-based compression.
28:19 EB: Let's sum up all these examples and see what you get by using the Pliops block device system. Overall, your product becomes much faster and cheaper. You get decisive improvements in capacity, throughput and latency across a range of very diverse workloads as you see today, we started raw block IO SQL and NoSQL databases and so the performance gains across the board. Your product also becomes more reliable thanks to the built-in driver protection feature in PSP RAID 5, you can enjoy the best of the both worlds. The system is both more reliable and faster than the software managed RAID 0.
29:16 EB: Moreover, the online rebuild process lets your application run uninterrupted with only minor impact of the data reconstruction process that is going on in the background. Last and not least with PSP, you can use slower hardware and achieve a higher speed. We saw that through the experiment conducted through MongoDB running on a slower QLC hardware under PSP.
So, at the end of the day, my message would be, there's no need to optimize for any specific performance metric when you're using the Pliops storage processor. You get all of them in one, the capacity, throughput and latency, all one nice package, without the need for concessions or tradeoffs. Thank you very much.