AWS, Microsoft Azure and Google Cloud Platform are more than IaaS -- they have morphed into sophisticated data...
science environments. However, the cloud can also complicate data processing pipelines, as it requires analysts to build complex workflows that extract data from multiple sources, funnel it through various filters and then feed it to different data warehouses and analytics services.
To simplify data pipeline development, Google Cloud Platform (GCP) users can deploy Cloud Composer, a managed workflow orchestration service based on the open source Apache Airflow project. Let's look closer at the features and pricing for Cloud Composer, and review how it works with Airflow to ensure the tasks are done at the right time and in the right order.
Apache Airflow features
Initially developed by Airbnb, Apache Airflow automates data processing workflows that were previously written as long, intricate batch jobs. Users construct Airflow job pipelines as a directed acyclic graph (DAG), written in Python, to dynamically instantiate them via code.
Apache Airflow's four primary components are:
- Web server: A web UI visually inspects DAG definitions, dependencies and variables; monitors log files and task duration; and reviews source code.
- Scheduler: A persistent service that instantiates workflows ready to run.
- Executor: A set of worker processes to execute workflow tasks.
- Metadata database: A database that stores metadata related to DAGs, job status, connections and more.
Additional features of the architecture include workers, which are the nodes; brokers, which enable executors and workers to communicate; and configuration files.
Google Cloud Composer features
Cloud Composer provides operations teams functionality similar to that of infrastructure-as-code services, such as Google Cloud Deployment Manager or AWS CloudFormation. It includes a library of connectors, an updated UI and a code editor.
Google Cloud Composer also includes:
- full integration with various GCP data and analytics services, such as BigQuery and Cloud Storage;
- support for Stackdriver Logging and Monitoring;
- simplified interfaces for DAG workflow management and configuration of the Airflow runtime and development environments;
- client-side developer support via Google Developer Console, Cloud SDK and Python package management;
- access controls to the Airflow web UI using the Google Identity-Aware Proxy;
- support for connections to external environments, both on premises and on other clouds;
- network boundaries and isolation with shared VPC and private IP;
- compatibility with open source Airflow; and
- support for other community-developed integrations.
Cloud Composer basics
The Composer Airflow environment uses Google Compute Engine instances to run as a Kubernetes cluster. When creating an environment, users need to specify various parameters, such as node count, machine type, location, VM disk size, tags and network configurations.
Once a Composer user creates the environment, they can configure email notifications via the SendGrid service. Developers receive 12,000 free emails per month. If they don't want to use SendGrid, users can go with third-party Simple Mail Transfer Protocol options.
Google Cloud Composer workflows are described as DAGs, which are a set of tasks to be run, as well as their order, relationships and dependencies. A DAG's primary function is to ensure the task is completed as specified. To cite an example from Google, if a three-node DAG has tasks A, B and C, the workflow might specify that task A needs to run before B, but C can run whenever. A DAG might also specify constraints, such as, if task A fails, it can be restarted up to five times. Tasks A, B and C could be anything, such as running a Spark job on Google Cloud Dataproc.
Cloud Composer pricing
Like other GCP services, Google Cloud Composer pricing is based on resource consumption, which is measured by the size of its environment and the duration the environment runs. Specifically, users are billed per minute based on the number and size of web server nodes, database storage and network egress.
In addition to Cloud Composer, users pay for the services that enable it to run:
- underlying Kubernetes nodes, which run the Airflow worker and scheduler processes;
- Google Cloud Storage buckets, which store DAG workflows and task logs; and
- Stackdriver Monitoring data collection.
To more accurately estimate your costs, refer to the GCP documentation's pricing example.