December 15, 2020 | 16min read
Airflow 2.0 Providers
Airflow 2.0 is a huge change in the workflow management ecosystem. We already wrote about the new way of authoring DAGs, but there are so many new things in Airflow 2.0, it’s hard to keep up. However, one topic is very dear to my heart—the project I was driving at the Airflow team for nearly a year. Let’s talk about Airflow Providers.
Airflow 1.10 has a very monolithic approach. It contains the core scheduling system, and all the integrations with external services—hooks, operators, sensors. Everything but a kitchen sink was thrown into a single “apache-airflow” package, no matter if you want to use it or not. It’s convenient (one package to rule them all!), but not very friendly, especially when it comes to dependencies. Ultimately, even if you don’t want to use, for example, Papermill operator you still have the papermill modules installed. And if you don’t install the right extras sometimes, the modules would not even import properly, because they lack the right dependencies! And those classes in 1.10 are present in various packages: airflow.operators, airflow.contrib.operators. This made sense when Airflow was created in Airbnb and operators developed by them were added to “airflow.operators”. When operators were developed outside of Airbnb and contributed to Airflow, they were added to the “contrib” folder. Years passed and the distinction between “core” and “contrib” operators has disappeared—not even the long-timers in the Airflow community would remember why they are in separate folders.
How to make sense of it and add some order to the way we manage different operators, hooks and sensors?
Meet Airflow Providers!
Airflow Providers is the idea that was forged in the community and took more than 2 years to complete. With Airflow 2.0 we are introducing a new way of how Apache Airflow is built internally. Airflow 2.0 consists of (altogether) more than 60 (!) packages. Yeah, you read it right six-ty not six-teen. Airflow’s main core is tiny (but how important!) package ’apache-airflow’ but if you look to find Google, Amazon, AWS, or even Postgres or HTTP operators there you won’t find any! Why? Because they are separated from the core of Airflow. Each of the “providers” (as we call it in the Apache Airflow nomenclature) has its own package. Each nicely separated in its own “airflow.providers” sub-package, with fully controlled and documented dependencies between them.
Your first reaction might be: “Noooooooooooo! I want my simple Airflow installation back! Should I install all those packages now instead of one? Will that make it a nightmare to manage and upgrade all those packages separately? Does it mean that I have to rewrite all my DAGs?”
Calm down! We got all your questions covered. Just read on!
First things first—what benefits do you get as a user from splitting Airflow into providers?
Quite a few in fact:
- Size of Airflow deployment
- Earlier access to the latest integrations
- Faster, incremental upgrade path for providers
- Backport providers for 1.10 series
- Safer migration path to Airflow 2.0
- Custom, 3rd-party providers
Let’s discuss those benefits one-by-one.
Your Apache Airflow installation might be a lot smaller if you only use some of the providers. Especially if you use any kind of container with Docker, Kubernetes, or other container platform, the Airflow containers customized for your case might be very small. The official Airflow Production Reference image has just 200 MB (some older images were > 400MB) and they can be a lot smaller if you customize the image. If you want to learn how to customize or extend the official image to suit your needs check out this documentation.. If you prefer to listen and watch the content, feel free to go to the talk I gave at the Airflow Summit 2020.
The world around us changes. Fast. And Airflow—being a workflow management system —should change as fast as all the services it manages (meaning every single day). With 60 providers and even more services (Google provider on its own has more than 50 Google Cloud Platform Services) and assuming all those services change once a year, it means that in order to keep up with the updates you would have to upgrade Airflow every single day. This is obviously not possible. With the Airflow release cadence (more or less every 2-3 months) it would mean you would have to wait at least those 2-3 months even if a new feature for a particular service is added immediately after release. That’s not good, and this is how it worked for a while in Airflow 1.10. But with the providers, we can do so much better. Thanks to the split, we can release every provider separately on its own cadence. You don’t have to wait until the next version of Airflow is out if there is only a change to the Amazon provider. We are currently working out the ways to release the providers more frequently on an ad-hoc basis. Expect much more frequent releases of those providers than releases of Airflow—especially for providers like Google that are changing almost daily.
Have you ever upgraded Apache Airflow to the latest version? Was it a difficult task? Did it take a lot of testing and involvement of your DevOps team, security reviews, and going through staging?
Probably you answered yes to all those questions—and for a good reason. When you upgrade systems like Airflow and your business continuity depends on it, you better be careful. This means that the delay before you see the latest version of your service operator is even longer than the Airflow release schedule. Your DevOps team might decide that they need more time and don’t want to install a particular version of Airflow. Yet, you might want to use one new operator that appeared in Airflow a week ago. What do you do? Providers to the rescue! When the providers are released separately, they can also be… upgraded separately. Yeah. You will be able to upgrade the single provider, without impacting the whole platform! How cool is that? And you will fly through security reviews and your DevOps team because this time, they only have to review this small, tiny package. And best of all, your test and deployment team will be able to both upgrade that new provider and downgrade it right after if they find any problem! It can’t get any better than that.
The situation with using new operators, hooks and sensors, was actually even worse in 1.10 lately. Remember the last time you saw the latest batch of some new operators for Airflow 1.10? Probably not and, again, for a good reason. Airflow 2.0 development started a long time ago and for a while we slowed down the speed of cherry-picking the latest operators from 2.0 to 1.10. In the beginning, cherry-picking was easy, but then it became almost impossible. With Airflow 2.0 dropping support for Python 2, we decided to change the package structure to providers.Trying to keep up became harder. At some point, we almost abandoned it and new integrations were only added to the “in-development” 2.0. The result? Well, we had literally hundreds of operators and many services developed in Airflow 2.0 that were unusable for the Airflow 1.10 users (so those who should be able to use them)!
Providers like Singularity, or Apache Livy are only available in 2.0. Also, at some point we had almost 150 new operators, hooks, and sensors in the Google provider alone(!) only available for Airflow 2.0 (and thus not really usable). Then the idea came: why don’t we introduce “Backport Providers” to let Airflow 1.10 users use the new operators NOW!
To be perfectly honest, Backport Providers came to be long before the regular “Providers”. They were our testing ground, where we tested how easy and manageable it was to have 60+ packages and what kind of dependency hell we were getting ourselves into. Only when we saw them in action, and proved that separate providers work, we made the decision that providers should become first-class-citizens in Airflow 2.0.
To explain what Backport Providers are: they are automatically refactored Providers that can be installed and work on Airflow 1.10. If you use Airflow 1.10 and haven’t used those providers now is the right time to do so. You can use the benefits of splitting the providers in 2.0 but in your 1.10 environment. This is rather cool, especially that (except for security fixes) we don’t plan to release any new version of Airflow 1.10.
One thing to note is that those backport providers only work with Python 3.6+, but since we are already almost a year after official Python 2 end of life, if you are running Airflow with Python 2, better migrate, like, yesterday!
With backport providers, your migration path to 2.0 will be safer. Simply because backport providers are the same providers as you will have in 2.0. They use the same imports, they are built using the same Python code. They have the same APIs and if the 2.0 providers added new features, changed parameters, or changed behavior vs their 1.0 predecessors, you would get exactly the same in the 1.10 version that you get in 2.0. This means that at the migration time you have one less thing to worry about.
You can migrate all your DAGs to backport providers gradually and incrementally while all your users run 1.10 version without any risk. You can always switch back to the original 1.10 operators in the 1.10 environment, where both backported and original operators might happily co-exist. Knowing that, you also have to make a new Airflow Deployment and convert configuration to 2.0. It’s quite a big deal to be sure all your DAGs are 2.0 ready before you migrate them. We also have the “upgrade-check” tool in Airflow that helps you determine if you are using any of the deprecated providers, or still using the original or backported providers.
However, there is one catch—we are going to release new versions of Backport Providers only 3 months (!) after Airflow 2.0 gets released. This means that you should already start thinking about switching to Airflow 2.0! If you switch to backport providers now in your 1.10 Airflow, while backport providers are fresh and updated, you will have a much smoother migration to the Airflow 2 series.
So, seize the opportunity while you can!
There is one more thing that should be mentioned here. While with Apache Airflow 2.0 we are releasing 60+ providers supported by the community, it is entirely possible to prepare your own providers. We have documentation on how to create your own provider. That includes some of the interesting capabilities like registering your own connections in the Core for automated hook creation, adding custom forms for the operators, or adding extra links to operators. You can make your operator easily, follow the Airflow convention and release them separately on your own. In the community, we are discussing whether to bring in more operators and providers to Airflow or create a bigger ecosystem for Airflow where 3rd parties are free to develop and publish their providers as standard Python packages. Both approaches are viable and I think both will co-exist.
What does the future hold? Who knows? But by introducing providers, we now have the opportunity to take both roads. And I am sure the community will make the right decisions here.
After sugarcoating, it’s time to answer all the difficult questions you undoubtedly wanted to ask but were afraid to.
This is actually one of the simplest things ever. If you want to install Airflow you don’t really need to change the way you’d installed it so far. In most cases, if you want to use Airflow 2.0. you shouldn’t have to install providers at all because they will be automatically installed when you choose the right extra. For example, when you want to install Google provider, you should add “[google]” extra. If, additionally, you want to install the Amazon one simply add “[google, amazon]”. And the “legacy” providers from Airflow 1.10 (for example “aws” or “gcp” work too (though you will get deprecation warning and they will get removed from Airflow 2.1). There are few exceptions to that. There used to be a few providers that could work out-of-the-box but they require installing a provider now (they are quite “niche”, for example openfaas, discord). Few others, for example http or ftp, are also separated as providers, but since they are rather popular those provider packages will always be installed whenever an Airflow-provider package is installed. You don’t have to worry about them, they will be there—for example under the new import path (in “airflow.providers.http”), but all your imports will still work. They will show deprecation warnings so you better migrate your DAGs later or better, before Airflow 2.0 migration, using Backport Providers.
In general, it’s nothing scary and it will work out-of-the-box in most cases. Additionally, you can always install any provider manually in the same way you install other Python packages.
This is a great question. And there is an easy answer: the same way as you update any of your other Python requirements. We are using dependency management which is standard in Python and we automatically generate those packages in the way that dependencies are kept in sync. If you want to upgrade all your providers and Airflow together, you can simply run:
pip install --upgrade apache-airflow[YOUR EXTRAS] --upgrade-strategy eager --constraints=https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.6.txt
This command will upgrade Airflow and all dependencies (including all the providers) to the latest supported and tested versions.
And here is a little secret you might not have realized so far. The “constraints” switch is a simple helpful thing but there is a little magic behind it.
We know that Airflow is a beast when it comes to dependencies, and if you know the term “dependency hell”, Apache Airflow is a pure manifestation of that evil. With 60+ packages, each one with their own dependencies, you might imagine how hard it is to keep them all in sync. Yet in the Airflow community, we’ve worked out some secret sauce in our Continuous Integration system that gets this issue under control. Maybe it doesn’t solve it entirely but it covers at least repeatable installation of any Airflow version released with ALL providers. Yeah, you read it right again. In our CI system, we make sure that not only Airflow but also all the latest 60+ providers and all their dependencies can be installed together and they don’t raise any version conflicts.
For those of you who know the details of Python dependency management, this might seem like magic. These constraints that you refer to in the installation command are nothing more than the list of package versions that we KNOW are working well with the particular Airflow version. And the list gets continuously updated and upgraded while we develop new Airflow versions! But wait, there are more than 400 packages on that list. And wait, we keep a separate list for each Python version supported (currently 3.6, 3.7, and 3.8). Do we have an army of Elves (Yeah Xmas is coming!) that sit there and continuously check which new versions of packages are updated and check all the combinations of those? Almost, but not quite entirely unlike this.
We’ve employed our Continuous Integration (we use Github Actions) to do it for us. This is something I am rather proud of! It took me quite a lot of time to polish, but finally, with Airflow 2.0 we have a system in place. During our CI builds, the CI upgrades all the dependencies to the latest ones. It follows the limitations we defined in our setup. Next, the CI tests for non-conflicting requirements, but it will also run more than 4000 unit/integration tests we have. Then a CI job will even try to perform what you would do—install Airflow from scratch with all the providers using those dependencies, and finally check if the installation is sound by querying Airflow using the command line interface.
How cool is that?
Yes, if you choose to do so. Upgrading only what you need and when you need it is the safest way to do it. You can choose various upgrade strategies, depending on the environment you have and how difficult and potentially dangerous such upgrades might be for your business. While you can upgrade both Airflow and providers using the “eager” upgrade strategy, the way providers are managed will allow you to choose your own way.
You can monitor the announcements of releases on the devlist, or subscribe to PyPI RSS and upgrade whenever the new providers are released. You can upgrade all providers you use in regular intervals, or you can keep the same strategy as there’ve been in Airflow 1.10 and upgrade to the latest providers when you upgrade Airflow. We’ve got all those scenarios covered and you adjust them to the way you handle other upgrades in your company.
However, there is one thing that we have to help you with when you make your decisions. We’ve decided in the Airflow community that all the versioning for all the providers will be strictly based on the SEMVER specification and that we will follow this approach rigidly. Since we have > 420 dependencies in Airflow, we know first-hand how hard you can be hit by backwards-incompatible changes in minor versions of your dependencies. So we want to spare our users the pain. You will be able to see if the particular version is a bugfix, feature upgrade, or backward-incompatible change by just looking at the version number. And of course details of what is included in each Provider release are automatically prepared in per-provider release notes. Also, we will make sure to keep cross-dependencies between the providers. Whenever we find out that there is a breaking dependency between the provider versions, we will update requirements for such providers following package dependency schemes, so that you will be able to see if different packages are compatible with each other.
Are you feeling a little bit more calm now? Good. Now it’s time for the nicest part of this article. Apache Airflow is an Apache project and “Community over Code” is the main motto of the Apache Software Foundation. Whatever code we produce is the result of a great community we have in Apache Airflow. Community means People.
I would like to thank the whole Airflow community for the work on providers. This was a true community effort to introduce providers for Airflow 2.0, and there are many people who contributed to it. This was a nearly Herculean effort to pull that off - with more than 750 (!) operators, hooks, sensors and transfers affected. It took many iterations and corrections, and even a few extra rounds of voting in the community when we found out that original decisions needed to be adjusted.
I’d like to point out how many, many people were involved. My colleague Kamil Breguła prepared a design for changed import paths and documented the options, Ash Berlin Taylor came up with the “providers” name, I came up with the idea of backport providers and then took on the lead of the discussion and voting. Many people helped with the actual work, and there would be no providers, without the effort of the whole team. Thank you for your support, input, guidance, building, reviewing, implementing it and testing.
My initial thanks go to Tim Swast from Google, Kamil Breguła, and Felix Uellendall for creating the original Airflow Improvement Proposals: AIP-8 Split Providers into Separate Packages for Airflow 2.0 and AIP-21: Changes in import paths, AIP-9 Automated Dependency Management. Together with automated tests enabled by AIP-26: Production-ready Docker Image I created, those were the AIPs that we could use to make the providers work.
Additionally, thanks to the whole Polidea Team who did the vast majority of the work! For relentlessly moving and testing all the operators to their new places and figuring out all the teething troubles.
Also thanks to the Astronomer team: Kaxil Naik, Daniel Imberman and, once again, Ash Berlin Taylor—they are fantastic colleagues and without their involvement, suggestions,reviews, comments and sometimes pushing in the right direction, the providers would never be where they are.
And, finally, special thanks to my fellow team members from Polidea!
Tomek Urbaszek, for initial fights with the Bowler refactoring tool, and for making sure our automated tests pass for many of the providers (especially the Google one).
Thanks again to Kamil Breguła for his relentless efforts to improve and split the documentation structure for Airflow and the providers. For a community-driven, open-source project like Apache Airflow, the documentation is an often “make, or break” case, and keeping it in order is hard work that pays off a million times.
Big thank you to Karolina Rosół, Head of Cloud and manager of Polidea Airflow team, the silent hero of our work. She helped us navigate the complex world of customers, community, and was always there to prioritise our decisions. All this in a true agile spirit. Karolina is a manager who not only removes obstacles but is also a true part of the team.
Finally, thanks to the whole Airflow community for your help and support :) It was quite a journey.
Principal Software Engineer