With Redshift Spectrum, an analyst can perform SQL queries on data stored in Amazon S3 buckets. This can save time and money because it eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Redshift Spectrum also expands the scope of a given query because it extends beyond a user's existing Redshift data warehouse nodes and into large volumes of unstructured S3 data lakes.
How Redshift Spectrum works
Redshift Spectrum breaks a user query into filtered subsets that are run concurrently. Those requests are spread across thousands of AWS-managed nodes to maintain query speed and consistent performance. Redshift Spectrum can scale to run a query across more than an exabyte of data, and once the S3 data is aggregated, it's sent back to the local Redshift cluster for final processing.
Redshift Spectrum must have a Redshift cluster and a connected SQL client. Multiple clusters can access the same S3 data set at the same time, but queries can only be conducted on data stored in the same AWS region.
Redshift Spectrum can be used in conjunction with any other AWS compute service with direct S3 access, including Amazon Athena, as well as Amazon Elastic Map Reduce for Apache Spark, Apache Hive and Presto.
Redshift Spectrum vs. Athena
Amazon Athena is similar to Redshift Spectrum, though the two services typically address different needs. An analyst that already works with Redshift will benefit most from Redshift Spectrum because it can quickly access data in the cluster and extend out to infrequently accessed, external tables in S3. It's also better suited for fast, complex queries on multiple data sets.
Alternatively, Athena is a simpler way to run interactive, ad hoc queries on data stored in S3. It doesn't require any cluster management, and an analyst only needs to define a table to make a standard SQL query.
Other cloud vendors also offer similar services, such as Google BigQuery and Microsoft Azure SQL Data Warehouse.
Amazon Redshift Spectrum follows a per-use billing model, at $5 per terabyte of data pulled from S3, with a 10 MB minimum query. AWS recommends that a customer compresses its data or stores it in column-oriented form to save money. Those costs do not include Redshift cluster and S3 storage fees.