Tech Accelerator Flash Memory Summit 2020
Using Hardware Acceleration to Increase NVMe Storage Performance GPUs, CPUs and Storage: Bringing AL and ML to Data Centers Everywhere
Guest Post

The Evolution of Data Centers in the Data-Centric Era

Data processing units (DPUs) can break through storage roadblocks. Hear more from Pradeep Sindhu, CEO and co-founder of Fungible.

Download the presentation: Evolution of data centers in the data-centric era

00:05 Pradeep Sindhu: Hi, I'm Pradeep Sindhu, CEO and co-founder of Fungible Inc. Let's begin by recounting the top data center problems today. Number one is that power and space footprint of these data centers is way too large. Second, there are way too many server variants, and as a result of these variants, there's a lot of complexity in managing data centers. Also, volume economics suffers as a result of all these variants. Finally, security is either weak or expensive or both.

Now, we can trace the root cause of all of these problems to an outdated compute-centric server architecture. The fact is that in 1987, CPUs were some 20 times faster than networks and it made complete sense to put I/O drivers in x86 software. Well, if you fast forward to 2020, CPUs are some 30 times slower on a per-core basis than networks. And I/O drivers are still largely in software, inside the operating system.

01:08 PS: Now, this makes absolutely no sense, given the fact that the I/O to CPU speed ratio has increased somewhere between 60 times and 600 times over the last 33 years. The consequences of this outdated compute-centric architecture in a data-centric world is that software cannot keep up with the I/O.

In fact, over half the cores are lost doing either inefficient things or simply being multiplexed to death. Also, it is the case that an I/O workload is an interfering workload. In other words, it thrashes caches and applications, even if they're not interacting heavily with I/O, will not perform well because the contents of caches, both the code and data, is churned over by I/O transactions.

Now, another problem in this architecture is that local resources are stranded. They can be accessed very efficiently by the local CPUs, but they cannot be accessed by remote CPUs, which is what you want to do when you're in a scale-out environment.

02:14 PS: There are also many serious security problems caused by vulnerabilities inside operating systems and the hypervisor. And of course, as we pointed out earlier, given the large number of resources inside these servers and the desire to optimize each server to a given workload, it causes the number of server variants to absolutely explode.

It turns out that certain headwinds in applications and technology are going to make this problem much worse. First of all, network and storage continue to increase in speed, faster than CPU speeds. Secondly, new workloads like AI, machine learning and analytics, they need to access increasingly larger data sets. What this does is it actually exacerbates the amount of East-West traffic, because this data cannot be kept on one node, but has to be sharded across many, many nodes. The number of node types continues to increase as people try to optimize this compute-centric architecture more and more, and security threats are still accelerating.

03:10 PS: Finally, it's been clear now for almost a decade that Moore's law is slowing down, and probably within the next two nodes of CMOS, will flatten. These headwinds actually threaten to destroy the agility and volume economics of scale-out architectures. This is actually ironic given the fact that the very reason for going to scale-out architectures was to provide agility, as well as much better economics.

So we'd like to introduce the concept of Fungible data centers. In these data centers, the idea of data-centricity will drive the architecture. What you saw in the evolution of computing is that, prior to 1947, people used to put computers together, wire them up by hand, and then you had this explosion of different machine types, went to a single type of microprocessor, the x86. And then about 15 years ago, people invented the idea of scale out of general-purpose CPUs, which we call compute-centric.

04:25 PS: And then in the data-centric era, we believe that you will have hyper-disaggregated compute nodes, each of which is specialized to do a particular kind of task, but the total number of node types will still be very, very small. So a Fungible data center is a data center that is hyper-disaggregated in that the resource types in the data center are divided into four or five different kinds, and the server nodes that implement these resources are separated by a high performance fabric that we call TrueFabric. These resources are put together by a piece of software, and that act of putting these resources back together, we call "composition."

Fungible data center is really one in which hardware provides the disaggregation of resources, and software provides the composition. You might be wondering, "If hyper-disaggregation is so interesting, why has it not happened up until this point?" Well, the reason is that the current compute-centric architecture poses two fundamental problems, and neither of these problems has been solved prior to Fungible coming on the scene.

05:35 PS: The first problem is that data interchange between server nodes in a data center is inefficient. Second problem is that the execution of certain critical data-centric computations inside server nodes is inefficient.

Now, it's important for us to talk about the definition of what a data-centric computation is in order to be precise. So, by this definition, data-centric computation is one where all work arrives at a server in the form of packets on a wire. And packets from many, many different contexts are intermixed. I/O dominates arithmetic and logic, unlike regular applications. And then finally, this workload requires modifying large amounts of state. The Fungible DPU solves both of these problems, and also provides fundamental improvements to agility, security, and reliability. A good way to look at the Fungible DPU is that it is an infrastructure processor, the third socket.

06:45 PS: There are two types of applications processors already in data centers. One is called the CPU, and the other the GPU. And these two microprocessors are responsible for running applications, but they don't perform infrastructure workloads efficiently. Examples of these infrastructure workloads are the network stack, the security stack, the storage stack and the virtualization stack.

The Fungible DPU, on the other hand, is designed to do just those applications, and to do them much more efficiently than either CPUs or GPUs, perhaps as much as 10 to 20 times more efficiently. So, what's very important to say is that the Fungible DPU is able to do these things without having to change a single line of either application code or operating system code. Secondly, the Fungible DPU also comes with industry-standard interfaces, so in fact, insertion is very easy. And finally, because the Fungible DPU is fully programmable, we don't compromise anything at all. We also provide agility. So, to summarize, the Fungible DPU enables hyper-disaggregation.

07:56 PS: What I've shown here is an example where a data center is built using four different server types, each of which contains a Fungible DPU connecting that server type to the network and forming a TrueFabric to which all these servers are connected.

The first server type is a general-purpose x86 with DRAM and a DPU. The second one is a high-performance storage node with SSDs and a DPU. The third one is a set of JBODs and a DPU. And the last one is a GPU server, an AI or ML server, which has some number of GPUs and a DPU. Now, you can see that with these four building blocks, I can make an entire data center by replicating each of these node types by the appropriate number of times.

08:46 PS: As I mentioned, hardware is responsible for hyper-disaggregation, but it is software that enables composability. So, the key element that enables data center composability is what we call a Fungible Data Center Composer. This is a piece of centralized software whose job it is to understand all of the resources that are available in the data center, and then to combine these resources into what we call bare-metal virtualized data centers, which are kind of ships in the night that operate exactly the same way that a physical data center would operate, with the same levels of performance.

But the important thing is that these bare-metal virtualized data centers could be stood up in less than a few minutes, and you could increase and decrease the resources that are inside these as you like. So, what you get as a result is a simple, standardized infrastructure with no wasted resources. You also get very rapid deployment, essentially one-click deployment from a marketplace of virtualized resources, and you get an infrastructure that is enabled for multiple tenants, where these tenants are separated by very, very strong security guarantees.

10:02 PS: Let's turn next to the value proposition of a Fungible data center. Fungible data centers have the potential of providing TCO improvements up to 12x. A factor of three of this comes from improved efficiency, and a factor of four comes from eliminating resource silos that exist, especially in enterprise private data centers.

So, let's look at where the factor of three comes from. If you take a look at standard data center today and look at how the Capex of this data center is distributed between networking, compute and storage, it's roughly 45% in compute, 45% in storage and then 10% in networking. Now, what the Fungible DPU does is it improves the network economics, in fact, the price-performance by a factor of three, because it enables us to run the network at least three times more efficiently than standard TCP/IP, where you're using ECMP as a technique for sending packets from one node to another node.

11:08 PS: On the compute server-side, the Fungible DPU improves the efficiency by approximately a factor of two, because all these data-centric applications, for example, virtualization, for example, the storage stack, for example, the network stack, and the security stack. All of these computations are offloaded from general-purpose CPUs, and they are run much, much more efficiently, perhaps as much as 20 times more efficiently on the DPU. So, in fact, you end up freeing up the resources from the x86 general-purpose CPU, almost half the resources are freed up. So, the $45 spend drops to $22.50.

11:47 PS: And then finally, for storage, is where the Fungible DPU provides the most benefit out of all of these three. For high-performance storage, especially, it's customary to do durability by making multiple copies. Typically, people make three copies because they want to be resilient to two failures. What the Fungible DPU does is since it has built-in inline erasure coding, and this erasure coding is present at very, very high-performance, we do not need to spend a factor of three on replication. We can actually do it much, much more efficiently. For example, if you use something like a 20,2 or 16,4 erasure code, then only have to spend about 1.1 or 10% more overhead. So that saves you almost a factor of three.

12:38 PS: The Fungible DPU also has inline compression, which is lossless, of course. And this lossless compression algorithm compares with the best algorithms which are known, which run in software with very, very large histories, so we get a compression ratio anywhere between a factor of three, and sometimes a factor of eight or more. But the difference is that this compression algorithm we have is implemented in hardware, so it runs very, very fast.

So, what happens is that your cost per bit actually drops by a combination of this factor of three from compression, and a factor of almost three, let's say two and a half, from erasure coding. And so that's a factor of seven and a half, so your $45 spend drops to about $6.

Now, when you sum all these things up, you come up with something around $30. So, your $100 spend in aggregate drops to about $30, that's a 3x improvement. In today's world, getting a 3x improvement on Capex is actually a stunning amount of improvement, but what is more, this same level of improvement will also be seen in the power consumption portion of Opex. Now, these improvements do not take into account other potential benefits you could get by deduplication, that is of course something that will come on top.

14:01 PS: Now, I mentioned earlier that you get almost a factor of four by removing resource silos, by doing hyper-disaggregation over a TrueFabric. Well, where do I get these numbers from? If you take a look at hyperscale deployments, typically they run at utilization, somewhere between 30% and 40%. If you look at a typical enterprise data center that runs somewhere between 5% and 10% utilized, despite the fact that people are running these data centers virtualized. So, there's clearly sitting out there a factor of four of potential improvement. And the reason the hyperscalers are able to get these improvements is because they have eliminated some of these resource silos, not all of them, and therefore they're able to run their facilities at higher utilizations.

I mentioned to you that there's a factor of four which is possible by doing resource pooling, and I gave you an example of enterprise data centers versus hyperscale data centers. Well, there's another way to look at this, which is, if you go back to Peter Denning's Ph.D. thesis, he actually proved that under fairly liberal conditions, that if I have N resource silos and I eliminate these silos and combine that resource into a single pool, you can get a benefit of almost a square root of N. And so that's another way to look at how pooling provides resource efficiencies.

15:35 PS: Latency is also very important in data centers, particularly in regards to how it affects user perception, that is a perception of end users as they're using these services offered by a data center. When you get an end-user request into a data center, that typically sparks off a flurry of internal request to microservices, and if your internal latency, the East-West network latency, has a long tail as is shown in the picture on the bottom left, the end-user perception of latency will be very poor.

So, these long tail latencies are highly undesirable. What Fungible's TrueFabric is capable of doing is actually having very, very deterministic latency, which means that you don't see this long tail and you see very predictable latency as far as end-users are concerned.

Something that should not be underestimated is that using a Fungible DPU makes the infrastructure much, much simpler. You, first of all, end up with far fewer server variants, which means that you get to leverage fully volume economics, there are also fewer components per server variant. Take a look at a hyper-converged server versus any of the four server types that we talked about, a general-purpose server with a DPU, a GPU server with a DPU, an SSD server with a DPU, or a hard drive server with a DPU.

17:06 PS: Each of these latter server types is much, much simpler because it has fewer components and therefore to be much more reliable. The presence of the DPU also enables the network elements themselves to be simplified, which means that the network itself will be much more reliable. And finally, by having a uniform architecture, it means that the management operations of data centers will be simpler, and therefore the Opex will also potentially be lower.

Security is something we thought about very, very carefully right from the beginning when we were building the DPU. So, first of all, there is pervasive encryption over the network. Any time one DPU talks to another DPU, it is over an encrypted tunnel. We also encrypt all traffic which is at rest. So, whenever we write to a storage medium, traffic is encrypted and is decrypted when it is read back. The DPU also provides a secure root of trust in every node, it provides secure key storage, it has facilities for anti-cloning, so for example, if the DPU is somehow copied illegally, then our software simply won't run.

Additionally, all software that runs either on the DPU or on the server inside which the DPU is kept can be signed binaries. In other words, no software will run without first checking to see that the software is actually authentic.

18:19 PS: Finally, the DPU also implements what we call "secure partitions." And these secure partitions are done using two different mechanisms. One is at the level of TrueFabric itself, where we have bare-metal virtualized data centers which are separated by completely bulletproof security firewalls. And then the second one is software firewalls, which are East-West firewalls that run inside the DPU. And then we also provide extensive telemetry to the Composer to show what is actually going on inside each server.

When it comes to reliability, the DPU actually improves the reliability of both the storage infrastructure, as well as the network infrastructure fundamentally. Now for storage, what it does is it provides data durability at a low cost by the use of erasure coding. On the network side, we fundamentally improve the reliability of the network by making sure that the network is fully scaled out, not only at the spine layer, but also at the top-of-rack layer.

19:39 PS: We also enable the network to recover from any failure within a very, very short period, essentially by re-transmitting all the packets that were lost and by eliminating the path that failed, independently of what reason it failed for.

So then to summarize all of the benefits that accrue from using the Fungible DPU, you end up with a single, straightforward, simple architecture that scales all the way from very, very small scales to very large scales. You get a factor of three TCO savings for a hyperscale deployment which is already doing resource pooling. You get over a 10x TCO improvement for enterprise-scale data centers. You get low predictable latency. You get very high levels of security. You get fundamental improvements to reliability. And as a bonus, you end up with infrastructure that is simpler to manage.

20:34 PS: Now, it's dangerous to make predictions, but I'd like to make a few predictions over the next five, 10 years. I believe that a majority of the servers will have something like a Fungible DPU embedded inside. Servers will be heterogeneous and disaggregated, and connected by something like a TrueFabric.

In terms of disaggregation, storage disaggregation is happening as we speak, GPUs are coming next, and DRAM disaggregation is likely to happen in the next two to three years. Data center resources will be allocated on demand, and rapidly composable into what we call bare-metal data centers.

Two weeks ago, we rolled out our first system solution to the market. We called it the Fungible Storage Cluster. This is a revolutionary product. It has extraordinarily high performance, it's designed to be scale out from the get-go. It's also very, very secure, standards-based, and it is a complete solution. It's not just piece parts. And of course, it is powered by the Fungible DPU. Please join me in the next keynote live session, which describes the Fungible Storage Cluster in a much greater detail. I'm sure you'll find it very interesting.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center
Sustainability and ESG