Getty Images/iStockphoto

AWS adds data quality, scalability services for cloud data

The cloud giant expanded its data portfolio with a series of features designed to help organizations more easily scale database and data warehouse deployments.

AWS continued growing its cloud data capabilities with a series of features to help enterprises scale its database services and ensure data quality.

The features the tech giant revealed on Wednesday follow a series of updates it introduced on Nov. 29 at its re:Invent 2022 conference, including the new DataZone data catalog and governance service.

Among the new services AWS rolled out on Wednesday is the Amazon DocumentDB Elastic Clusters service intended to help document database workloads more easily scale up and down based on traffic requirements.

The Amazon Redshift cloud data warehouse also got a new multi-zone, high-availability configuration. AWS additionally brought data quality capabilities to the AWS Glue metadata discovery service.

AWS competes mainly against Microsoft Azure and Google Cloud Platform. Helping users easily manage scalability of database services is a challenge all three major public cloud vendors are addressing.

AWS and cloud technology itself are maturing, so there are fewer new areas for cloud vendors to push into, said Doug Henschen, an analyst at Constellation Research. The tech giants' shift to polishing services and filling gaps in their existing portfolios of services is understandable.

"One of those gaps was data quality. So the Glue Data Quality was a welcome -- and one could say overdue -- announcement," Henschen said. "It provides automated ways to generate data quality rules."

Henschen noted that if organizations previously struggled with data quality, they were likely already turning to third-party partners to provide data quality service through the AWS Marketplace.

Improving data quality in the cloud

Organizations are now commonly using data lakes, often with Amazon S3 cloud object storage, as a foundational element of data analytics and business intelligence

In a keynote at the conference, Swami Sivasubramanian, vice president of databases, analytics and machine learning at AWS, said a challenge with data lakes is that if organizations don't monitor the data quality, the lakes may become "data swamps."

"Customers told us building [those] data quality rules across data lakes and data pipelines is very, very time consuming and very error prone," he said.

The AWS Glue Data Quality service can generate automated data quality rules for data sets. The rules ensure the accuracy and freshness of data in a data lake or data pipeline, Sivasubramanian said.

"Rules can be applied to your data pipelines so poor-quality data does not even make it to your data lakes in the first place," he said.

The new service can run continuously; if data quality deteriorates for any reason, the organization is alerted.

AWS intros more scalability and security for cloud data

Amazon DocumentDB is AWS's JSON-based document NoSQL database service. DocumentDB can automatically scale up to 64 terabytes of data per cluster, serving millions of requests per second.

While DocumentDB could already scale a single database, Sivasubramanian said users have been looking for easier ways to manage throughput for multiple DocumentDB database clusters.

Customers told us building [those] data quality rules across data lakes and data pipelines is very, very time consuming and very error prone.
Swami SivasubramanianVice president, databases, analytics and machine learning, AWS

"[The customers] told us that scaling out or sharing the data sets across multiple database instances is really, really complex," he said.

With Amazon DocumentDB Elastic Clusters, AWS is helping users more easily scale multiple DocumentDB database clusters up to petabytes of capacity.

The AWS service automatically handles the underlying database configuration required to enable the scalability without users manually configuring the deployment, according to the vendor.

Sivasubramanian also used the keynote stage to unveil the Amazon Redshift Multi-AZ capability, bringing multiple availability zones to the cloud data warehouse service.

The multi-AZ configuration enhances availability for analytics applications with automated failover in the event of a disruption in one zone. The service lets users operate on multiple availability zones simultaneously.

While availability and scalability are important, so too is security. To that end, AWS extended its GuardDuty security service to its Amazon Aurora relational database.

The service can secure Aurora database deployments from security threats. It also provides security reporting to help users track and identify where incursions are coming from.

Next Steps

Peloton rides, runs, rows with AWS for data management

AWS launches new data management service DataZone

New AWS tools simplify access, management of data at scale

Dig Deeper on Data governance

Business Analytics
Content Management