Apache Falcon is a data management tool for overseeing data pipelines in Hadoop clusters, with a goal of ensuring consistent and dependable performance on complex processing jobs.
Using XML, system administrators can define operational and data governance policies for Hadoop workflows in Falcon. The Falcon tooling can be set up to manage dependencies between the system infrastructure, data and processing logic. That can be very useful, because large-scale Hadoop applications can be difficult to manage, involving, as they do, hundreds or thousands of compute nodes, with a large number of such jobs typically running on a cluster at any given time.
In action, Falcon relies on Oozie job scheduling software to generate the processing workflows. Falcon is available for declarative programming through simple APIs.
The software is also designed to enable admins to set procedures for replication, retention and archiving of incoming data. For example, it provides capabilities for tagging data to comply with data retention and discovery requirements. Next steps include expanded support for data governance via enhanced monitoring and tracing of job failures and other processing events.
The Falcon software was initially created by developers at online ad broker InMobi. Together with engineers from Hadoop distribution provider Hortonworks, the InMobi development team initiated Falcon as an incubator project at the Apache Software Foundation in April 2013.
As of this writing, Apache has made four releases of the technology available -- the most recent was Apache Falcon 0.6, which was released in late 2014 and included an improved user interface and additional REST APIs, as well as updated documentation and bug fixes. At the same time, Falcon moved out of incubation status and became an Apache top-level project. Hortonworks became the first vendor to incorporate Falcon into its commercial Hadoop distribution, starting with Version 2.2 of the Hortonworks Data Platform, also in late 2014.