Next Great Breakthrough in Flash Memory Performance at Scale for Model Training
Guest Post

Open Source Processors for Next-Generation Storage Controllers

Open source software tools are enabling a high level of innovation for storage controllers and data center architectures. Here is an introduction to storage controller SoC devices and more.

Download this presentation: Open Source Processors for Next-Generation Storage Controllers

00:03 Zvonimir Bandic: Hi, my name is Zvonimir Bandic, and I work as a senior director in Western Digital Corporation research, managing the NextGen platforms technologies department. Our focus is a new hardware technologies for Western Digital on the development side. These are the RISC-V embedded processor cores, RISC-V firmware toolchain. And on the more research side, these are machine learning accelerators, as well as new smart networking technologies, such as Omnia Xtend controllers for very low-latency memories and neuromorphic computing. So, today at Flash Memory Summit, I will be presenting our open source RISC-V processors for storage controllers.

01:20 ZB: So, the agenda for the presentation is introduction to storage controller SoC devices, explanation of the typical flash control implementations requirements for the embedded CPU cores. And then our RISC-V SweRV Cores roadmap. And we'll go through use cases, such as acceleration of legacy code, software and firmware toolchain support; multithreading; and IPC improvements. Ultra-low interrupt latency, and our small EL2 cores targeting hardware accelerators. So, typical storage controller SoC device, what it needs to do is to provide a familiar SATA, SAS or NVM Express interface to the host and actually manage the persistent media present in a device. But it could also be a magnetic media or some of the new media, such as phase-change memory or NVRAM. The requirements on a controller are, in any particular order, reliability because, obviously, storage customer expects the no loss of data and expects high reliability of storing the files; security, such as boot from our security features that guarantee that firmware is not compromised; and data rest protection, which encrypts data in real time; performance, in case of SSDs, consumers expect high IOPS, high data rates; and, finally, power because shift to mobile devices require saving or shaving every watt. And same as in the data center, there's no difference.

03:14 ZB: So, typical flash controller implementation will have a hardware that connects to the host, which can be SATA, SAS or NVM Express, a logical protocol that operates on a PCI Express physical layer. It will have a main CPU complex that will run main firmware, communicate with the host, receive the read and write I/O commands. And then it will have a datapath CPU that manages datapath complex, which in case of the SSD devices, needs to manage individual end channels. So, you will have, depending on a class of device, two, four, eight, or even more channels behind which you will have an end devices. And each channel will need to have a dedicated IP that will generate the sequencing. That's another opportunity for the embedded core. And the datapath CPU will also have to manage the encryption engine, meaning when it's writing data to the channels, it will have to add ECC code. When it's reading, it will verify ECC code, and if necessary, correct the errors before it communicates error-free data to main CPU. And then main CPU talks to the host and delivers a proper NVM Express formatted data back to the host file system.

04:42 ZB: There is also a root of trust and encryption management, another opportunity for the core. This will verify firmware authenticity on the boot, and it will also manage the user credentials for security. And it will only unwrap encryption keys used for real-time encryption if user provides a correct credential, for example, with the TCG security protocol. And, finally, there will be a power management CPU that will manage power consumption.

So, what are then the CPU core requirements? On the main CPU and datapath CPU on the upper part of this diagram, these are high performance, typically evaluated on Dhrystone, CoreMark or Embench benchmark. And this is important because you want to run legacy code at a higher performance. You had your previous score, you had certain performance level. If you're upgrading hardware, the expectations are that power will not go up, but you'll be able to run it faster. Low power consumption is always the key. Scalability is very important if you want to actually have schemes to bring more cores.

05:53 ZB: For example, the cache coherence. Software toolchain support is of extreme importance, especially when you're transitioning to new architecture like RISC-V, firmware developers do not want any disruption, they want to continue coding in C++ or any other higher level language that's typical for the platform -- for flash controllers, it's mostly C++. And, finally, interrupt latency, very important, these are real-time systems, if there's an interrupted session you want to react in a nanosecond. For auxiliary cores, requirements are low power, even more important because these small cores are often used to implement finite-state machines that do a variety of services and hardware acceleration to a main processor. They're often replicated multiple dozens of times, so every milibyte saved is of utmost importance. Efficiency instructions per clock cycle -- you want that to be as close to 1 as possible, and finally, performance is still very important.

07:08 ZB: This is the RISC-V cores roadmap. These cores are available with their source code in the organization called Chips Alliance that has many members in our industry and many important companies such as Google, Alibaba, SiFive, Esperanto Technologies, Intel, Samsung and many others. There is two generations of cores, SweRV EH1 is our first core. It's a 32-bit nine-stage, two-issue core. Then SweRV EH2, which is a dual-threaded version of EH1 upgraded for performance and targeting 16nm TSMC. And then there's the SweRV EL2, which is a small four-stage core, single-issue also targeting 16nm TSMC, which targets small FSMs in storage controllers. The best digital partner with Codasip, which is another Chips Alliance member, and Codasip is currently providing design verification and integration services for SweRV EH1 commercially. The core is actually an open source, but Codasip will work with customers and help them with integration and verification.

08:35 ZB: This is the SweRV EH1 core architecture. This is a nine-stage pipeline, and it implements RISC-V 32-bit architecture, including integer, multiply and compression. Compression is important for core density, and it has only three stall points, so we keep pipeline busy with minimum stalling, stall points are only on the fetch, align and decode. It's superscalar, it has a dual-issue pipeline and this is very important because you can get approximately double amount of work done by having two execution pipelines.

09:19 ZB: And, more importantly, inside each of the pipes there are dual execution engines. So, if you look carefully in this pipe, when we get an instruction that comes after the load, which is very common, your loader register might want to do some addition, the data loader in the register may not be available immediately due to the memory latency, so they have a second opportunity. We go through the pipeline and then two clock cycles later, there is a second execution unit after the register actually has been successfully loaded. So, this gives another performance boost. An overall CoreMark per megahertz for this core is approximately 5, and for those of you who are maybe not familiar with this benchmark, we provided this data comparing some of the popular industry cores CoreMark scores. Obviously, Intel toward their processors will be very high on theses scores. These are renormalized for a single thread and then renormalized to frequency.

10:26 ZB: If you look at aggregate CoreMark scores, then the big chips from Intel and AMD will have very high aggregate results, but this is renormalized for a single thread. What's important here is that this type of core gives about 50% boost on average compared to the single-issue cores such as RISC-V cores that came from . . . This is obviously important if you're accelerating legacy code you want to get a boost, and for those of you who are interested to learn more details, just go to the and download the source code.

11:06 ZB: Firmware and software toolchain support is of utmost importance. Western Digital has open sourced a lot of RISC-V toolchain and infrastructure. RISC-V supports two main open source toolchain, GNU, which is a GCC Compiler at LLVM, GCC stable upstream versions now include working 32- to 64-bit RISC-V support. And LLVM 9 supports RISC-V its upstream recently. On code density issues, they have launched the new workgroup ISA RISC-V. Western Digital is vice-chairing this group. And a lot of effort has been put into this, and I can say in the last 12 months, significant effort in community has been done and RISC-V code density is now in parity with other commercial cores. You can basically say within 5%, and in some cases actually better.

12:04 ZB: We worked hard on this and, for example, in Western Digital we presented this new code last year, we did a common goldstar patch for GCC where we took a better advantage of compressed instructions. It's important to provide the software for embedded IT applications. Western Digital has launched a newcomer overlay workgroup, inside RISC-V we're sharing it. It's an old-fashioned technique, very applicable to IOT and embedded, where instead of having a more complex . . . that takes a lot of space, you use this technique to load real-time, to load code real-time at the moment when it's needed. And it's threaded with a toolchain and it allows significant savings in ASRAM, leading to small footprint, lower power, lower cost, important for IRT.

13:03 ZB: Then we came with the SweRV EH2 Core offering even more performance. So, this is the first RISC-V embedded core fully verified that supports two threads. We support RISC-V 32-bit integer multiply atomic instructions, important for embedded compression and also the bit manipulation instructions with these funky names, Zbb and Zbs. The typical throughput gains will be depending on a code between 1.6 to even 2x -- I will show some data on that -- above a liner which basically figures out compressed instructions in RISC-V and gets them ready for decode. Above a liner threading is vertical, there's plenty of throughput in the fetch engines and then in a liner and below we actually have simultaneous multithreading.

14:01 ZB: This core supports fast interrupt -- we'll talk about it later in hardware -- but most important, it brings a significant performance improvement. So, in the single-threaded mode, if you run one thread only, now we get about 6 CoreMark per megahertz and in a dual-threaded mode, we get 7.8 CoreMark per megahertz combined. This is like a 3.9 CoreMark per megahertz, per thread. So, this is very interesting because it basically . . . The industry averages around 4, so this core on the area of a single core will give performance of two cores and still support very low latency in terms. The target frequency is 1.2 gigahertz and 16nm in a worst-case core. If you look at the . . . In the best cases, the target frequency goes very high up.

14:53 ZB: Multithreading is a key feature of this embedded core, so instead of two distinct cores, multithreading allows significant efficiency improvement by running two software threads, we call them harts in the RISC-V lingo, on the top of already fairly capable super-scaler core. And this diagram on the top roughly explains what's happening. If you have use cases, like in data firmware where there's many excursions to the bus that may take multiple clock cycles or maybe accessing slower memories for controller operations, with a single core you will have a lot of stalling. With a dual-threaded core, you are running two independent threads and another thread can take advantage of the first thread stalling and execute it at full speed. Overall, you can, if you have this kind of a use case, overall the CPU utilization can become very high and in practice you get two cores for the area in power of one.

15:57 ZB: This is an example of a stream compare running from the close coupled memories. You can see, if it's being run as a single thread, we are getting about 1.4 IPC executing from instruction cache and a data close coupled memory. If you run in a multithread mode, so we ran two instances of string compare, we are literally getting IPC of two because it's a dual-issue pipeline. So, this kind of tells you that in some use cases, you literally get double performance. Another interesting feature is fast interrupts redirect. So, this is a complex topic to explain in one or two minutes per chart, but for details please see Chapter 7.61 of the SweRV EH2 core documentation which is available on the Chips Alliance GitHub.

16:58 ZB: RISC-V architecture has a so-called vector external interrupts and when the core takes an external interrupt, it initiates a specific sequence of events that basically require us to save registers that are used in the handler, which is standard; the storing to the meicpct control/status register Control Status Register to capture a persistent claim ID, which is standard; load, and they have control status register, load the memory location; and finally jump to the address to start executing the specific interrupt for a specific handler.

What we have done in a fast interrupt redirect, we have took this first part of the architecture, including external interrupt vector table, and we put everything in hardware. And hardware gates will actually execute this whole process, capture a claim ID, look through the external interrupt vector table -- which is now in a closely coupled ASRAM -- and with this kind of architecture we got the ISR latency down to maybe three clock cycles. So, if you're executing at 1.2 gigahertz, we are looking at a 1 to 2 nanosecond latency in the interrupt. This is implemented in all the hardware, and it's a build argument for EH2 core -- so you may or may not want to use it, depending on the use cases -- and it supports up to 255 interrupt handlers.

18:42 ZB: Big cores are not all that's important. Small cores, you could argue, are even more important especially if you're using a larger number of them for different hardware acceleration function in the SSC. We have released EL2 cor,e which is a second-generation core, also supports atomic and compression and bit manipulation structures. It's a simple, highly optimized four-stage pipe with nonblocking microarchitecture, zero-cycle branch target buffer and with worst-case branch mispredict penalty of only one cycle, and basically it's designed to run very close to one IPC on wide range of workloads. Also supports as a build option fast interrupt redirects report and it has a fully pipelined DMA support, so DMA accesses our pipeline. We get about 3.6 CoreMark per megahertz from this core and rising. Actually, more recent measurements are trending toward 4.2.

19:46 ZB: The target frequency is 600 megahertz, and this is for 16 nanometer TSMC worst-case core. These are some of the IPC results on this core. As you can see on a wide range of benchmark, Dhrystone CoreMark Matrix Multiply shot 256. Most of the cases, IPC is between 0.95 and 1, and that's not surprising. That's exactly what we have designed this core for.

And this is an interesting chart that tells you a little bit about the breadth of the acceleration that you can get with SweRV EH2. On the X axis, we basically put a IPC for the single-threaded code execution. And on the Y axis, we put a multithreaded instructions per clock cycle. And you can see, for those applications that have relatively low IPC, like 0.4, if you run them on the EH2, you will get roughly double the acceleration, which is exactly what's expected. And then, obviously, as you cross, and so you get really nearly ideal acceleration. Obviously, now, if you cross into range, between 1 and 2, you cannot expect doubling, but it actually stays for wide range of workloads. You are getting, as you can see, between 1.6 and 2 average speed. And that's pretty awesome. And I think we see, in storage applications, closer to 1.9.

21:40 ZB: So, we have described a range of SweRV RISC-V customized scores. They are finding new applications in flash controllers, like main CPU, like datapath CPU, like finite-state machines, sequencers for individual land channels, power management and root of trust. And I should point out that open source root of trust project called OpenTitan is also using RISC-V core called Ibex.

So, to conclude, we describe the use cases for embedded and storage controller applications. They have very high requirements on reliability, performance, power and security. And we have delivered two generations of SweRV embedded cores, EH1, EH2 and EL2, that meet or exceeds demands of most embedded designs for storage controllers. And I would say, especially in performance, power and a ratio of performance to power, significantly exceed cores from other architectures. We have demonstrated very high benchmarks and actual usage performance improvements through superscale architecture and through multithreaded designs. And it's important to emphasize that this is the first multithreaded, fully verified RISC-V cores for embedded use cases because it does support ultra-low interrupt latency through hardware optimizations. And for small cores, which are important for hardware accelerators in larger SoC systems, we have demonstrated high IPC and very low power in practical use cases.

23:30 ZB: If you want to learn more about open source RISC-V core, visit Chips Alliance or join Chips Alliance and participate in the revolution of open source, high-performance and better RISC-V cores. Thank you very much for listening, and it was a pleasure to present at Flash Memory Summit. Please don't hesitate to ask any questions, either to the FMS application, or feel free to send me an email directly on [email protected]. Thank you very much.

Dig Deeper on Flash memory and storage

Disaster Recovery
Data Backup
Data Center