Kit Wai Chan - Fotolia

Tip

Hadoop workflow automation lets IT pros go with the flow

Hadoop workflow managers are not just resource schedulers. They handle intricate tasks and service handoffs that are essential for managing big data services.

Kurt Marko, MarkoInsights

Published: 05 Jul 2016

IT pros have been managing batch processes since mainframe jobs were run on Hollerith cards -- prompting a reliance on system automation software. So it's no surprise that batch processing of Hadoop jobs has spawned a plethora of workflow managers.

Hadoop has a built-in process and resource scheduler YARN for managing job allocation to various physical systems and storage volumes. Hadoop workflow managers allow IT teams to create complex scripts that control job creation, execution and output. Using a workflow manager with YARN is akin to using a Linux scripting language like Awk, Bash, Perl or Python in conjunction with the operating system's native cron scheduler.

IT teams that need to execute complex Hadoop tasks should understand the major features and various options of a workflow manager on Hadoop, including some that are native to a particular Hadoop distribution.

Workflow vs. process scheduling

There's often a point of confusion between a resource scheduler, called a negotiator in Hadoop-speak, and a workflow manager. Negotiators are necessary elements of distributed systems such as Hadoop. The scheduler spawns processes on multiple nodes and allocates resources based on application requirements and the available capacity within the cluster. Hadoop's Yet-Another-Resource-Negotiator (YARN) operates transparently to the user.

In contrast, workflow managers coordinate complex Hadoop tasks that consist of multiple jobs that run sequentially, in parallel or in response to event triggers. A job can be many things, whether running individual Java apps, accessing the Hadoop file system or other data stores or running various Hadoop applications, including:

Hive data warehouse;
Pig high-level, data-flow language and execution framework for parallel computation;
MapReduce programming framework for parallel processing in massively scalable applications;
Sqoop bulk data transfer between Hadoop and a relational database management system; and
Spark general compute engine for applications such as extract, transform and load (ETL) as well as machine learning, stream processing and graph computation.

Features of a Hadoop workflow manager

The complexity and variety of big data workflows necessitated the creation of several different workflow management tools. The Apache Software Foundation, which manages the Hadoop open source project, developed some of managers. Third parties also initiated and contributed workflow managers to the Hadoop project, while still others are part of different commercial Hadoop distributions.

All Hadoop workflow managers provide control over task execution, options and dependencies. Most workflow managers break down tasks as a directed acyclic graph (DAG) in which each action is a node or vertex that can trigger other actions, but not loop back on itself. Actions include various Hadoop applications or even subsidiary workflows including various flow-control operators for logical decisions or operations, flow forks and joins. The workflow defines all of the properties and parameters each action uses and then reports progress using standard state descriptions -- created, running, suspended, succeeded, killed and failed.

Apache Hadoop vs. third-party distributions

Hadoop workflow managers demonstrate fundamental differences in the programming model/language, code complexity, property/parameter description format, supported applications, scalability, documentation and support.

Azkaban is one of the first Hadoop workflow schedulers and is designed around a simple web user interface (UI). Azkaban tackles the problem of Hadoop job dependencies, such as for data warehouse applications where tasks needed to run in specific order from data ingestion, ETL to data analysis. The schedule was developed by LinkedIn.

Oozie is the default workflow scheduler for Apache Hadoop. Oozie represents jobs as DAGs that can be scheduled or triggered by events or data availability. An XML variant called Hadoop Process Definition Language defines workloads; however, as with Azkaban, a web UI is used to simplify workflow design.

An in-depth explanation of DAGs.

Airflow is also based on DAGs and programmed via a command-line interface or web UI. Airflow was developed by Airbnb to author and monitor data pipelines. Its pipelines are defined via Python code, which allows for dynamic pipeline generation -- pipelines that build other pipelines -- and extensibility. Its modular design is highly scalable, using a message queue to orchestrate an arbitrary number of workers.

Pinball is a flexible Hadoop workflow manager that works on a customizable workflow abstraction layer. In Pinball, workflows take action on tokens, which are elements of system state, uniquely tracked and controlled by a master controller and temporarily checked out to worker tasks. According to its developers at Pinterest, workers in Pinball impersonate application-specific clients. Executor workers run jobs or delegate the execution to an external system and monitor progress. Scheduler workers instantiate workflows at predefined time intervals. Pinball also includes a UI for visualizing workflow execution and job logs.

Luigi was developed the online service Spotify to handle its big data needs in which workflows exist as Python code; a web UI is available to help in workflow creation. Because it was designed for long-running batch processes, Luigi tasks cannot be defined dynamically. However, tasks can include many major Hadoop packages such as Hive, Pig and Spark.

Other workflow managers target specific types of applications:

BMC Software's Control-M is a commercial replacement and superset of Oozie;
Zaloni Bedrock is a Hadoop-based data management platform;
Hotonworks' Tez is a framework for data processing pipelines;
Ni-Fi is a data-flow software for stream analytics from Hortonworks;
Falcon, also from Hortonworks, is a framework for data replication, lifecycle management and compliance archival; and
Schedoscope, a scheduling framework for Hadoop application development, testing and deployment.

Implications for data admins

When Hadoop enters an IT environment, it comes with an entire ecosystem of data analysis, storage, application and workflow software. Each category requires careful study to determine the best tool for the job and how it fits into an organization's overall big data strategy. Due diligence in implementing big data should prevent developers from reinventing the wheel and overloading IT staff with tools, each of which creates administrative and training overhead.

When it comes to Hadoop workflow managers, it's best to start with the default choice, namely Oozie, until you identify specific requirements that it doesn't cover, and alternatives that are a better fit.

Next Steps

Get to know Hadoop distributions

The growing list of Hadoop components

Hadoop cluster management gets simpler with new tools

Hadoop workflow automation lets IT pros go with the flow

Hadoop workflow managers are not just resource schedulers. They handle intricate tasks and service handoffs that are essential for managing big data services.

Workflow vs. process scheduling

Features of a Hadoop workflow manager

Apache Hadoop vs. third-party distributions

Implications for data admins

Next Steps

Dig Deeper on Systems automation and orchestration

6 best open source workflow engines for 2026

Astronomer Otto: A data engineering agent built for Apache Airflow

GreenOps – Astronomer: Workflow orchestration is the hygiene layer a data team needs

GCP Associate Data Practitioner Exam Dumps and Braindumps