The significance of parallel I/O in data storage

TechTarget.com/searchstorage

https://www.techtarget.com/searchstorage/feature/The-significance-of-parallel-I-O-in-data-storage

The significance of parallel I/O in data storage

By Jon Toigo

Based on recent Storage Performance Council SPC-1 benchmark results, we are poised for a watershed moment in data storage performance. You could call it -- as the leader in the new technology, DataCore Software, almost certainly will -- parallel I/O.

SPC-1 measures the I/O per second (IOPS) handled by a storage system under a predefined enterprise workload that typifies random I/O queries and updates commonly found in OLTP, database mail server applications. It's similar to a Transaction Processing Performance Council benchmark in the database world. In real life, SPC-1 rates the IOPS that can be handled by data storage infrastructure and, by virtue of the expense of the kit evaluated, the cost per IOPS.

Until parallel I/O technology (re-)emerged, the two SPC-1 metrics were generally proportional: IOPS could be accelerated, usually via expensive and proprietary hardware enhancements, which in turn drove up the cost per IOPS. With parallel I/O from DataCore Software, cost per IOPS is inversely proportional to the cost of the kit. The result is steady improvements in I/O performance at a steadily decreasing price.

Origins of parallel I/O

The term parallel I/O may sound like an exotic new tech to some, but it is a simple concept based on well-established technology -- well, technology that hasn't been much discussed outside of the rarified circles of high-performance computing for nearly three decades.

Parallel I/O is a subset of parallel computing, an area of computer science research and development that was all the rage in the late 1970s through the early 1990s. Back then, computer scientists and engineers worked on algorithms, interconnects and mainboard designs that would let them install and operate multiple, low-performance, central processor chips in parallel to support the requirements of new high-performance transaction processing applications. Those development efforts mostly fell on hard times when Intel and others pushed a microprocessor architecture to the forefront that utilized a single processor chip design (Unicore) and a serial bus used to process application instructions in and out of memory and deliver I/Os to storage infrastructure.

The Unicore processor-based system -- on which the PC revolution, and most client-server and the bulk of distributed server computing technology came to be based -- dominated the business computing scene for approximately 30 years. In accordance with Moore's Law, Unicore technology saw a doubling of transistors on a chip every two years; in accordance with House’s Hypothesis, there was a doubling of chip clock speeds at roughly the same clip.

The impact of that progression put the kibosh on multiprocessor parallel processing development. PCs and servers based on Unicore CPUs evolved too quickly for parallel computing developers to keep pace. By the time a more complex and difficult to build higher-speed multi-chip parallel processing machine was defined, its performance was already eclipsed by fast single-processor systems with a serial bus. Even as applications became more I/O intensive, Unicore computers met these requirements with brute-force improvements in chip capacities and speeds.

Until they didn't. At the beginning of the Millennium, House's Hypothesis fell apart. The trend lines for increased processor clock rates became decoupled from the trend lines describing transistors per integrated circuit. For a number of technical reasons, mostly related to power and heat generation, chip speeds plateaued. Instead of producing faster Unicore chips, developers began pushing out multicore chips, capitalizing on the ongoing doubling of integrated circuits on a chip die, as forecasted by Moore.

Today, multicore processors are de rigueur in servers, PCs, laptops, and even tablets and smartphones. Though some observers have failed to notice, multicore is quite similar to multichip from an architectural standpoint, which reopens the possibilities of parallel computing, including parallel I/O, for improved application performance.

Parallel I/O will improve performance

Most applications are not written to take advantage of parallel processing. Even the most sophisticated hypervisor-based software, while it may use separate computer cores to host specific virtual machines (VMs), still leverages each logical core as a VM and processes hosted application workloads sequentially. Below this layer of application processing, however, parallelism can be applied to improve overall performance. That's where parallel I/O comes in.

DataCore has resurrected some older parallel I/O algorithms company chief scientist and co-founder Ziya Aral was working on back in the heyday of multiprocessor systems engineering and implemented them using the logical cores of a multicore chip. Every physical core in a multicore processor supports software multithreading to enable more efficient use of chip resources by operating systems and hypervisors. The combination of physical cores plus multithreading has the system providing two or more logical cores for each physical core.

With so many logical processors, a lot of these cores are going unused. DataCore's technology uses some of the cores that are currently unused to create a parallel processing engine explicitly developed to do nothing but service I/O requests from all hosted applications. Such an engine enables I/Os to be processed in and out of many applications concurrently -- rather than sequentially -- which translates into much less time to service I/O. This is known as parallel I/O, and it is exactly what DataCore has now demonstrated with the Storage Performance Council SPCI-1 benchmark.

By itself, parallel I/O may not seem like much more than an interesting nuance in data storage stack design. But its practical implications are significant.

Massively increased storage I/O throughput. As shown in the DataCore SPC-1 benchmark, this simple software add-on to a commodity system comprising a common Lenovo server and some commodity disk and flash solid-state drives -- roughly $38,400 in hardware and software -- produced 459,000 IOPS. The result was a system that would be among the least costly high-performance storage offerings in the world, at less than eight cents per SPC-1 IOPS. While there are a few hardware storage platforms with better SPC-1 performance numbers, they tend to be far more costly in the dollars-per-IOPS metric.
DataCore ran SPC-1 as a hyper-converged system and it achieved the fastest-ever response times. The reported latency was the lowest ever on an audited SPC-1 benchmark by a factor of 3x to 100x better than all other previously published results. Where others ran the benchmark on separate systems driving the I/O workloads to their data storage gear, DataCore ran the application workload on the same server system it used to service the I/Os -- making the storage direct attached. The hyper-converged mode results also factored in the cost of the workload generators and handled the burden of running the applications on the same platform that achieved the fastest response times ever reported. This only makes the results more impressive.
Faster storage response times and higher throughput usually translate into faster application performance. This result is prized in contemporary virtual server environments. Hypervisor-based server computing has begun to place the greatest stress on uniprocessor/serial bus computer architecture with entire virtual machines -- applications and operating systems encapsulated as VMs -- competing for fixed physical processor and bus resources. Parallel I/O can greatly improve the performance of VM I/O so that the shared serial bus is less of a chokepoint. This means the DataCore technology can accelerate VM workload performance while enabling greater VM density on a single physical hypervisor host.

27 Jan 2016