Share

engineering

5min read

Apache Airflow: Tutorial and Beginners Guide

Apache Airflow: Tutorial and Beginners Guide

Apache Airflow tutorial is for you if you’ve ever scheduled any jobs with Cron and you are familiar with the following situation:

image1

Image source: xkcd: Data Pipeline

We all know Cron is great: simple, easy, fast, reliable… Until it isn’t. When the complexity of scheduled jobs grows, we often find ourselves in this giant house of cards that is virtually unmanageable. How can you improve that? Here’s where Apache Airflow comes to the rescue! What is Airflow? It’s a platform to programmatically author, schedule and monitor workflows. It was started a few years ago by Airbnb and has since been open-sourced and gained a lot of traction in the recent years. After a few years in incubation at Apache, it has just recently become an Apache TLP Top-Level Project. In this article, I will show you what problems can be solved using Airflow, how it works, what are the key components and how to use it - on a simple example. Let’s get started!

Airflow overview

Both Airflow itself and all the workflows are written in Python. This has a lot of benefits, mainly that you can easily apply good software development practices to the process of creation of your workflows (which is harder when they are defined, say, in XML). These include code versioning, unit testing, avoiding duplication by extracting common elements etc. Moreover, it provides an out-of-the-box browser-based UI where you can view logs, track execution of workflows and order reruns of failed tasks, among other things. Airflow has seen a high adoption rate among various companies since its inception, with over 230 companies (officially) using it as of now. Here’s some of them:

image5

Use cases

Think of Airflow as an orchestration tool to coordinate work done by other services. It’s not a data streaming solution—even though tasks can exchange some metadata, they do not move data among themselves. Example types of use cases suitable for Airflow:

  • ETL (extract, transform, load) jobs - extracting data from multiple sources, transforming for analysis and loading it into a data store
  • Machine Learning pipelines
  • Data warehousing
  • Orchestrating automated testing
  • Performing backups

It is generally best suited for regular operations which can be scheduled to run at specific times.

Core concepts

Let’s now go over a few basic concepts in Airflow and the building blocks which enable creating your workflows.

Airflow DAG

First of them is the DAG - short for Directed Acyclic Graph. It’s a collection of all the tasks you want to run, taking into account dependencies between them. The DAG doesn’t actually care about what goes on in its tasks - it doesn’t do any processing itself. Its job is to make sure that whatever they do happens at the right time and in the right order. Airflow DAGs are defined in standard Python files and in general one DAG file should correspond to a single logical workflow.

image2

Image source: Developing elegant workflows with Apache Airflow

Airflow operators

While DAGs describe how to run a workflow, Airflow operators determine what actually gets done. There are several types of operators:

  • action operators which perform a single operation and return (e.g. BashOperator, GoogleCloudStorageDownloadOperator),
  • sensors which pause the execution until certain criteria are met, such as until a certain key appears in S3 (e.g. GoogleCloudStorageObjectSensor),
  • and finally transfer operators which connect 2 services and enable sending data between them (e.g. GoogleCloudStorageToS3Operator).

An operator is simply a Python class with an “execute()” method, which gets called when it is being run.

class ExampleOperator(BaseOperator):
    def execute(self, context):
        # Do something
        pass

In the same vein a sensor operator is a Python class with a “poke()” method, which gets called repeatedly until “True” is returned.

class ExampleSensorOperator(BaseSensorOperator):
    def poke(self, context):
        # Check if the condition occurred
        return True

In Airflow there are many built-in operators for various services (and new ones are being added all the time). However, in case you need a functionality which isn’t there, you can always write an operator yourself. There are a few good practices one should follow when writing operators:

  • Idempotence: They should be idempotent, meaning that each subsequent execution should produce the same end result and that it should be safe to execute them multiple times.

  • Atomicity: An Airflow operator should represent a non-divisible unit of work. It should either fail or succeed completely, just like a database transaction.

  • Metadata exchange: Because Airflow is a distributed system, operators can actually run on different machines, so you can’t exchange data between them, for example, using variables in the DAG. If you need to exchange metadata between tasks you can do it in 2 ways:

    • For small portions of metadata use XCOM (name comes from cross-communication), which is just a record in a central database that the operators can write to and read from.
    • For larger data, such as feeding the output of one operator into another, it’s best to use a shared network storage or a data lake such as S3, and just pass its URI via XCOM to other operators.

Task

In order to execute an operator we need to create a task, which is a representation of the operator with a particular set of input arguments. For example we have one BashOperator, but we can create three different “bash tasks” in a DAG, where each task is passed a different bash command to execute. So far we have the DAG, operators and tasks.

But wait - there’s more! When the DAG is run, each Task spawns a TaskInstance - an instance of a task tied to a particular time of execution. All task instances in a DAG are grouped into a DagRun.

So to summarize: a DAG consists of tasks, which are parameterized representations of operators. Each time the DAG is executed a DagRun is created which holds all TaskInstances made from tasks for this run.

image7

Image source: Understanding Apache Airflow’s key concepts

Above is an example of the UI showing a DAG, all the operators (upper-left) used to generate tasks (lower-left) and the TaskInstance runs inside DagRuns (lower-right). White box - task not run, light green - task running, dark green - task completed successfully.

If all that’s still a bit unclear, make sure to check the example below to see how it’s used in practice.

Example

To show you elements of our Apache Airflow tutorial in practice we’ve created an example DAG which is available in GitHub.We use it internally in our company to monitor instances of various services running in our project in Google Cloud Platform (GCP). The DAG runs every day at 5 PM, queries each service for the list of instances, then aggregates the results and sends us a message via Slack and email. Feel free to take a look at the code to see what a full DAG can look like. image9

Summary

Thanks to the out-of-the-box features, Python-defined workflows, wide adoption rate and a vibrant community Airflow is a really great tool that’s here to stay. Make sure to try it out for yourself and see if it can help you get rid of those pesky, unmaintainable cron jobs from your pipelines ;)

Share

Szymon

Software Engineer

Did you enjoy the read?

If you have any questions, don’t hesitate to ask!

Did you enjoy the read?

If you have any questions, don’t hesitate to ask!