Gorodenkoff - stock.adobe.com
The data centers and systems that power hyperscale cloud platforms represent the pinnacle of IT infrastructure design and implementation. They offer levels of scalability, reliability and throughput beyond what any average enterprise will ever need.
That said, enterprise IT teams -- including storage admins -- have much to learn from Google, AWS and other major public cloud providers. Through the application of certain hyperscale data center design principles, admins can work toward more scalable, resilient and automated IT storage systems.
Key similarities and differences
Both hyperscale cloud providers and enterprise IT operators struggle to accommodate an explosion of data. They also share similarities when it comes to spending. Every dollar counts for cloud operators and online service providers when building servers and storage systems; seemingly small savings add up when multiplied across tens of thousands of systems. While enterprises aren't quite as cost-conscious and are willing to pay more for products from a trusted vendor, no IT organization has money to waste.
To minimize operational costs -- a critical aspect of IT cost optimization -- hyperscale cloud providers automate every task that doesn't require manual oversight. The key to task automation is software, which in the context of cloud infrastructure requires replacing function-specific hardware with extensible software that can run on standard servers.
These and other requirements of hyperscale cloud operators have reshaped the server, networking and storage industries in several ways, including:
- new techniques for distributed redundancy and scalability;
- a focus on flexible hardware built from commodity components; and
- a concomitant transition from purpose-built appliances to software-defined services that run on easily replaced, standard servers.
Once IT organizations and engineers adopt the cloud ethos of treating systems like cattle (as in how farmers manage a herd), not pets (like how we care for household animals), it transforms every IT service -- whether compute resources or storage pools -- into software.
Hyperscale data center design and storage implications
While there are similarities between traditional enterprise and public cloud infrastructures, the analogy isn't perfect -- which Google points out in a blog on cloud-native architectures.
For example, traditional architectures tend to involve high-cost infrastructure that IT teams must manage and modify manually. These architectures also tend to have a small, fixed number of components. This kind of traditional, fixed infrastructure, however, doesn't make sense for the public cloud, because of the cloud's pay-per-use model; organizations can cut costs if they reduce their infrastructure footprints. Public cloud resources also scale automatically. As a result, there are attributes of on-demand cloud services that aren't applicable to private infrastructure.
Nonetheless, IT teams can apply the following hyperscale data center design principles to optimize enterprise storage:
Embrace the software abstraction layer
Servers were the first infrastructure layer to be virtualized, with a software abstraction layer between the physical hardware and logical resources. Virtual machines (VMs) became the standard runtime environment for enterprise applications. Over the past decade, software virtualization has spread across the data center as VMs have evolved into containers. Software-defined networking has spawned software-defined WAN, network functions virtualization and virtual network overlays. Software-defined storage (SDS) has decoupled data storage devices from the information management and data placement control plane.
Initial SDS platforms were designed for particular uses, such as providing block volumes for VM instances and databases. Recent products have become format- and protocol-agnostic, able to shard data across multiple nodes and present it as a logical volume, network file share or object storage. To provide hardware flexibility, SDS also works with standard servers with integrated JBOD SSDs, HDDs and NVMe devices.
Build services, not infrastructure
By isolating resources from physical hardware, software abstraction layers provide the flexibility to mix and match hardware. They let teams package resources as services instead of raw infrastructure. To take a cue from the hyperscale cloud providers, use SDS to deliver an object, file or volume service that doesn't just include capacity, but valuable ancillary features like backup, long-term archival, versioning and QoS levels.
The delivery of services instead of infrastructure also provides flexibility in infrastructure design and the packaging of related services. It enables feature and performance upgrades without changes to the delivery and billing models. With storage as a service, admins can also use servers and drives with different performance and cost characteristics to deliver different service tiers, as well as spread data across multiple data centers and regions for higher availability.
Design for automation
Replacing raw storage with software-defined data and information management services also facilitates task automation. This, in turn, reduces Opex, decreases provisioning time and increases reliability. SDS enables programmatic control because it exposes a host of APIs for storage configuration, deployment, software updates and user provisioning. To deliver storage like a hyperscale cloud provider, use the APIs exposed by SDS products in automation and infrastructure-as-code platforms like Terraform, Ansible, SaltStack or VMware vRealize Automation, as this transforms manual processes into programmable scripts.
Plan for failures
Servers and storage devices routinely die. For cloud providers with hundreds of thousands of servers and millions of drives, failures happen constantly. To embrace the cattle-not-pets philosophy, design for failures. Ensure that a dead drive or server doesn't knock out a storage volume or object blob. One standard technique involves sharding files, blobs or volumes into blocks replicated and spread across multiple drives, nodes and data centers, using erasure coding, hashing or similar algorithms to guarantee data integrity.
Some failures don't involve data destruction, but rather corruption or performance loss. Cloud operators continually monitor such events and use automated notification systems and scripts to repair or mitigate the damage without manual intervention -- and, hopefully, before users notice. Monitoring can also determine the extent of any corruption or outage, and route incoming storage requests to intact replicas and unaffected data centers.
IT has always struggled to keep up with storage capacity demands. But today, accelerating data growth has created a crisis in many organizations. To build storage like a hyperscale cloud platform, design for Moore's Law-type growth. Admins should be able to add storage nodes and JBOD arrays to expand scale-out systems nondisruptively.
SDS is pivotal to such designs, as it separates the control plane -- volume, file and node management and configuration -- from the data plane -- storage nodes and arrays. Thus, adding capacity to a distributed system doesn't require taking down and migrating a volume. Instead, IT staff can add nodes and enable the system to automatically redistribute data across the newly available capacity.
Unlike traditional SAN-based enterprise storage designs, hyperscale clouds don't scale up and consolidate -- they scale out and distribute. They also use monitoring telemetry and predictive machine learning algorithms to determine scaling profiles for capacity additions. The goal is to have enough capacity without wasting too much in reserved space.
Remember machines are substitutable
Compared to traditional storage systems, standard servers that run an SDS stack save money. Organizations can substitute expensive and proprietary storage hardware with cheap, commodity servers. However, these same machines are substitutable -- like storage Lego blocks in a larger distributed system. Because every file or data block is replicated across drives on several nodes, the failure of one or two systems doesn't affect an entire data volume. Machine interchangeability and data redundancy also let IT staff perform repairs or replacements in bulk and at a convenient time, not as a reactionary fire drill.
To act like a cloud operator, IT organizations must be able to justify the number of systems required for scale-out distributed designs. When you only have one dairy cow, it's impossible to treat Elsie like just another animal in the herd.