Alex - stock.adobe.com
AWS offers many services to store, manage and analyze data in a scalable and reliable way. Two well-known examples are RDS and Redshift. Pick the wrong one for an application, and performance and budgets can suffer.
The Amazon Relational Database Service (RDS) manages database servers in the cloud. Amazon Redshift supports data warehouse and data lake approaches, enabling it to access and analyze large amounts of data. While they have similarities, these two AWS database services solve different problems. They're based on different design patterns, and they scale and access data in different ways.
What are the key similarities?
Both RDS and Redshift use SQL to access data. They both manage compute and storage infrastructure to deliver scalability, availability and security features. For software, data is grouped in logical databases and schemas, stored in tables, and organized into rows and columns. Amazon Redshift is based on PostgreSQL, a widely used database engine. Both equip application owners with a programmatic way to store and access data and make it available to end users.
What are the key differences?
RDS and Redshift differ in key areas, including underlying design, provisioning and scaling approach, and data access. Cost is also a distinction.
Structure. The most important difference between RDS and Redshift is their data processing design. As it is a relational database service, RDS follows an online transaction processing (OLTP) design. OLTP programs follow a transactional process, meaning data is protected from concurrent changes and from corruption from failed processes. Redshift follows an online analytical processing (OLAP) approach. OLAP cleans and organizes data from data warehouses into structured data cubes to prepare it for queries. This difference means they solve two different problems.
RDS is meant to serve online in real-time transactions that require an immediate response. Redshift is suited to jobs with longer and heavier data analysis that can be executed asynchronously.
Provisioning and scalability. At a high level, provisioning an RDS database consists of several steps:
- launching a database instance;
- selecting its instance family, such as T3, M5 or R5;
- selecting its size, such as large, xlarge, 4xlarge or others;
- choosing the storage type, either general purpose SSD or provisioned IOPS; and
- setting the storage capacity.
In the case of RDS Aurora, a MySQL- and PostgreSQL-compatible database within AWS RDS, the customer must launch a database within an RDS cluster. A cluster consists of a single primary node and an optional number of read replicas, as well as Multi-AZ or regional backup alternatives. RDS stores all the source data in a single node. The only way to scale storage is to increase disk capacity in the RDS instance. Adding more read replicas can offload the primary node, but the primary node is still the single point where source data is stored and managed.
To provision a Redshift cluster, the customer selects the following:
- node type, such as RA3, DS2 or DC2;
- node size, such as large, xlarge, 8xlarge or another size; and
- number of nodes.
In the case of RA3 nodes, Redshift users can specify the amount of storage per node. Otherwise, it's predefined according to the node size. Application owners can increase the cluster storage capacity by adding nodes or updating managed storage settings. Data gets distributed evenly across nodes, which delivers scalability to application owners. Compute and storage are distributed across multiple nodes. Thus, scalability for a Redshift cluster is much higher when compared with an RDS deployment.
Data access. RDS databases are not designed to access data stored outside their local storage system and predefined format. Redshift can access data stored either locally in the cluster or in external data sources.
For data stored externally, Redshift supports multiple formats, such as ORC, Parquet, JSON and CSV. Redshift is designed and optimized to store and access much larger data sets than RDS. This can be up to 128 TB per node, reaching potentially petabytes of data in a cluster. Compare this with RDS, which reaches 100 gibibytes to 64 TB for most database engines. Redshift is limited only by the external data storage limitations. In the case of AWS S3 cloud storage, limits are virtually nonexistent.
Cost. Regarding cost, comparing the two solutions isn't so straightforward.
RDS pricing is driven by the database engine, instance size and storage. The most expensive RDS instance (Microsoft SQL Server Enterprise db.x1.32xlarge) can reach close to $45,000 per month. But there are production-ready options that customers can deploy for $1,000 or less, such as an RDS MySQL m5.4xlarge.
The most expensive Redshift node (ra3.16xlarge) can run approximately $9,400 per month. This would make a multi-node cluster potentially reach $50,000 to $100,000 per month. Cost-management strategies, such as node and cluster right-sizing, can alleviate costs in both products.
While the cost for a Redshift cluster is higher than its equivalent in RDS, Redshift allows for significantly higher compute and storage capacity than an RDS deployment.
In practical terms, a relational database managed by RDS suits use in online requests. Consider it for requests coming from a web or mobile application, such as user login, a product catalog search, user details and so on.
A Redshift cluster is a fit for generating reports out of large amounts of data, such as site traffic reports, user activity, log analysis, market or business reports, billing analysis for large platforms and public data set analysis. The Redshift user might trigger report generation automatically with the expectation that it will take some time to process. Upon report completion, Redshift's output can be exported to a format where it can be accessed by other application components, such as business intelligence visualization tools, internal reports and online applications.
Making the right choice
Even though Redshift is based on PostgreSQL, there are many of that engine's features and design patterns not supported by Redshift. The AWS documentation describes these in detail.
Redshift supports both a data warehouse and a data lake approach. Redshift allows for local storage in the cluster nodes, which makes this model consistent with a data warehouse approach, where data is processed and stored internally following a predefined structure. However, Redshift also allows for a data lake model, where it accesses data stored externally, such as in S3 or even RDS databases. Redshift interacts with a data catalog, which can be based on the Amazon Athena interactive query service, AWS Glue serverless data integration service or EMR Hive data warehouse and analytic package to access these external data sources. The Redshift federated query feature allows it to connect to databases managed by RDS using the Aurora MySQL, MySQL and PostgreSQL engines.
Both RDS and Redshift help solve complex problems for application owners, especially when it comes to infrastructure provisioning and data storage management. They are different services that solve different problems. When applied to the right use case, both platforms are powerful and should be considered as part of the tool set to deploy modern applications in the cloud.