share

ENGINEERING

5min read

Apache Airflow: Tutorial and Beginners Guide

Apache Airflow tutorial is for you if you’ve ever scheduled any jobs with Cron and you are familiar with the following situation:

image1

Image source: xkcd: Data Pipeline

We all know Cron is great: simple, easy, fast, reliable… Until it isn’t. When the complexity of scheduled jobs grows, we often find ourselves in this giant house of cards that is virtually unmanageable. How can you improve that? Here’s where Apache Airflow comes to the rescue! It’s a platform to programmatically author, schedule and monitor workflows. It was started a few years ago by Airbnb and has since been open-sourced and gained a lot of traction in the recent years. After a few years in incubation at Apache, it has just recently become an Apache TLP Top-Level Project. In this article, I will show you what problems can be solved using Airflow, how it works, what are the key components and how to use it - on a simple example. Let’s get started!

Airflow overview

Both Airflow itself and all the workflows are written in Python. This has a lot of benefits, mainly that you can easily apply good software development practices to the process of creation of your workflows (which is harder when they are defined, say, in XML). These include code versioning, unit testing, avoiding duplication by extracting common elements etc. Moreover, it provides an out-of-the-box browser-based UI where you can view logs, track execution of workflows and order reruns of failed tasks, among other things. Airflow has seen a high adoption rate among various companies since its inception, with over 230 companies (officially) using it as of now. Here’s some of them:

image5

Use cases

Think of Airflow as an orchestration tool to coordinate work done by other services. It’s not a data streaming solution—even though tasks can exchange some metadata, they do not move data among themselves. Example types of use cases suitable for Airflow:

  • ETL (extract, transform, load) jobs - extracting data from multiple sources, transforming for analysis and loading it into a data store
  • Machine Learning pipelines
  • Data warehousing
  • Orchestrating automated testing
  • Performing backups

It is generally best suited for regular operations which can be scheduled to run at specific times.

Core concepts

Let’s now go over a few basic concepts in Airflow and the building blocks which enable creating your workflows.

Airflow DAG

First of them is the DAG - short for Directed Acyclic Graph. It’s a collection of all the tasks you want to run, taking into account dependencies between them. The DAG doesn’t actually care about what goes on in its tasks - it doesn’t do any processing itself. Its job is to make sure that whatever they do happens at the right time and in the right order. Airflow DAGs are defined in standard Python files and in general one DAG file should correspond to a single logical workflow.

image2

Image source: Developing elegant workflows with Apache Airflow

Airflow operators

While DAGs describe how to run a workflow, Airflow operators determine what actually gets done. There are several types of operators:

  • action operators which perform a single operation and return (e.g. BashOperator, GoogleCloudStorageDownloadOperator),
  • sensors which pause the execution until certain criteria are met, such as until a certain key appears in S3 (e.g. GoogleCloudStorageObjectSensor),
  • and finally transfer operators which connect 2 services and enable sending data between them (e.g. GoogleCloudStorageToS3Operator).

An operator is simply a Python class with an “execute()” method, which gets called when it is being run.

class ExampleOperator(BaseOperator):
    def execute(self, context):
        # Do something
        pass

In the same vein a sensor operator is a Python class with a “poke()” method, which gets called repeatedly until “True” is returned.

class ExampleSensorOperator(BaseSensorOperator):
    def poke(self, context):
        # Check if the condition occurred
        return True

In Airflow there are many built-in operators for various services (and new ones are being added all the time). However, in case you need a functionality which isn’t there, you can always write an operator yourself. There are a few good practices one should follow when writing operators:

  • Idempotence: They should be idempotent, meaning that each subsequent execution should produce the same end result and that it should be safe to execute them multiple times.

  • Atomicity: An Airflow operator should represent a non-divisible unit of work. It should either fail or succeed completely, just like a database transaction.

  • Metadata exchange: Because Airflow is a distributed system, operators can actually run on different machines, so you can’t exchange data between them, for example, using variables in the DAG. If you need to exchange metadata between tasks you can do it in 2 ways:

    • For small portions of metadata use XCOM (name comes from cross-communication), which is just a record in a central database that the operators can write to and read from.
    • For larger data, such as feeding the output of one operator into another, it’s best to use a shared network storage or a data lake such as S3, and just pass its URI via XCOM to other operators.

Task

In order to execute an operator we need to create a task, which is a representation of the operator with a particular set of input arguments. For example we have one BashOperator, but we can create three different “bash tasks” in a DAG, where each task is passed a different bash command to execute. So far we have the DAG, operators and tasks.

But wait - there’s more! When the DAG is run, each Task spawns a TaskInstance - an instance of a task tied to a particular time of execution. All task instances in a DAG are grouped into a DagRun.

So to summarize: a DAG consists of tasks, which are parameterized representations of operators. Each time the DAG is executed a DagRun is created which holds all TaskInstances made from tasks for this run.

image7

Image source: Understanding Apache Airflow’s key concepts

Above is an example of the UI showing a DAG, all the operators (upper-left) used to generate tasks (lower-left) and the TaskInstance runs inside DagRuns (lower-right). White box - task not run, light green - task running, dark green - task completed successfully.

If all that’s still a bit unclear, make sure to check the example below to see how it’s used in practice.

Example

To show you elements of our Apache Airflow tutorial in practice we’ve created an example DAG which is available in GitHub.We use it internally in our company to monitor instances of various services running in our project in Google Cloud Platform (GCP). The DAG runs every day at 5 PM, queries each service for the list of instances, then aggregates the results and sends us a message via Slack and email. Feel free to take a look at the code to see what a full DAG can look like. image9

Summary

Thanks to the out-of-the-box features, Python-defined workflows, wide adoption rate and a vibrant community Airflow is a really great tool that’s here to stay. Make sure to try it out for yourself and see if it can help you get rid of those pesky, unmaintainable cron jobs from your pipelines ;)

share


SzymonSoftware Engineer

POLIDEA NEWSLETTER

Sign in and expect sharp insights, recommendations, ebooks and fascinating project stories delivered to your inbox

The controller of the personal data that you are about to provide in the above form will be Polidea sp. z o.o. with its registered office in Warsaw at ul. Przeskok 2, 00-032 Warsaw, KRS number: 0000330954, tel.: 0048 795 536 436, email: hello@polidea.com (“Polidea”). We will process your personal data based on our legitimate interest and/or your consent. Providing your personal data is not obligatory, but necessary for Polidea to respond to you in relation to your question and/or request. If you gave us consent to call you on the telephone, you may revoke the consent at any time by contacting Polidea via telephone or email. You can find detailed information about the processing of your personal data in relation to the above contact form, including your rights relating to the processing, HERE.

Data controller:

The controller of your personal data is Polidea sp. z o.o. with its registered office in Warsaw at ul. Przeskok 2, 00-032 Warsaw, KRS number: 0000330954, tel.: [0048795536436], email: [hello@polidea.com] (“Polidea”)

Purpose and legal bases for processing:

 

Used abbreviations:

GDPR – Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the protection of natural persons with regard to the processing of personal data and on the free movement
of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)

ARES – Polish Act on Rendering Electronic Services dated 18 July 2002

TL – Polish Telecommunications Law dated 16 July 2004

1)        sending to the given email address a newsletter including information on Polidea’s new projects, products, services, organised events and/or general insights from the mobile app business world |art. 6.1 a) GDPR, art. 10.2 ARES and art. 172.1 TL (upon your consent)

Personal data:name, email address

2)       statistical, analytical and reporting purposes |art. 6. 1 f) GDPR (based on legitimate interests pursued by Polidea, consisting in analysing the way our services are used and adjusting them to our clients’ needs, as well as developing new services)

Personal data:name, email address

Withdrawal of consent:

You may withdraw your consent to process your personal data at any time.

Withdrawal of the consent is possible solely in the scope of processing performed based on the consent. Polidea is authorised to process your personal data after you withdraw your consent if it has another legal basis for the processing, for the purposes covered by that legal basis.

Categories of recipients:

Your personal data may be shared with:

1)       authorised employees and/or contractors of Polidea

2)       persons or entities providing particular services to Polidea (accounting, legal, IT, marketing and advertising services) – in the scope required for those persons or entities to provide those services to Polidea

 

Retention period:

1)       For the purpose of sending newsletter to the given email address – for as long as the relevant consent is not withdrawn

2)       For statistical, analytical and reporting purposes – for as long as the relevant consent is not withdrawn

Your rights:

 

Used abbreviation:

GDPR – Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the protection of natural persons with regard to the processing of personal data and on the free movement
of such data, and repealing Directive 95/46/EC (General Data Protection Regulation)

According to GDPR, you have the following rights relating to the processing of your personal data, exercised by contacting Polidea via [e-mail, phone].

1)       to access to your personal data (art. 15 GDPR) by requesting sharing and/or sending a copy of all your personal data processed by Polidea

2)       to request rectification of inaccurate personal data
(art. 16 GDPR) by indicating the data requiring rectification

3)       to request erasure of your persona data (art. 17 GDPR); Polidea has the rights to refuse erasing the personal data in specific circumstances provided by law

4)       to request restriction of processing of your personal data (art. 18 GDPR) by indicating the data which should be restricted

5)       to move your personal data (art. 20 GDPR) by requesting preparation and transfer by Polidea of the personal data that you provided to Polidea to you or another controller in a structured, commonly used machine-readable format

6)       to object to processing your personal data conducted based on art. 6.1 e) or f) GDPR, on grounds relating to your particular situation (art. 21 GDPR)

7)       to lodge a complaint with a supervisory authority,
in particular in the EU member state of your habitual residence, place of work or place of the alleged infringement if you consider that the processing
of personal data relating to you infringes the GDPR
(art. 77.1 GDPR)

No obligation to provide data:

Providing your personal data is not obligatory, but necessary for Polidea to provide you the newsletter service

Refusal to provide the above data will result in inability to receive the newsletter service.

Profiling

In the process of providing the newsletter service, we make decisions in an automated way, including profiling, based on the data you provide.

 

“Profiling” means automated processing of personal data consisting of the use of your personal data to evaluate certain personal aspects relating to you, in particular to analyze or predict aspects concerning your personal preferences and interests.

 

The automated decisions are taken based on the analysis of clicked and viewed content. They affect the targeting of specific newsletter content to selected users registered to receive the newsletter service, based on the anticipated interests of the recipient.