August 07, 2019 | 10min read
It’s a "Breeze" to Develop Apache Airflow
This is a tale about a modern approach to developer productivity. Improving productivity has been a recurring theme in most of the roles and places I have worked.
Before you read on, here’s a webinar where I show you how to achieve better productivity in Airflow using Breeze. Worth checking out!
It started at the beginning of my career with my older and more experienced colleague—one of the first “real” software engineers I had interacted with by then. I learned a lot from him. I specifically remember him saying: “When I develop code, I spend the first half of my time on developing tools to make my work 10x faster—that’s the key reason why I always deliver on time.”
His words stuck with me. I’ve been a part of many successful projects in my career but they pretty much always looked like this: when I first joined a project I’d look around, trying to learn the ropes and identify bottlenecks, such as the lack of tools or what part of the development process could slow me down. I’d also try to come up with some ideas which could enhance productivity, both for me and my team.
As an All-Stack-Engineer, I believe the best use of my time is to put my efforts not only to the code that I deliver but mainly to develop all the tools, DevOps setup, continuous integration, automated code analysis and everything essential for the developers. The best-case scenario is when your developers can use them independently and you don’t need to worry about them. Another phrase that I often repeat is “automation is my second name.”
JarekPrincipal Software EngineerProductivity is not about a number of lines of code you’ve written or features you’ve delivered. It’s about building the potential for future gains while focusing on delivering the project.
How do you define productivity though? And how can you measure it? I have a possibly surprising answer—it’s not about a number of lines of code you’ve written or features you’ve delivered. For me, productivity is something rather different. It’s all about focusing on the present and the future rather than the past. Productivity is more of a bet that in the future your team will be able to deliver more at the same time. It has way more to do with your perceived speed and enjoying work as a developer than the outcome of work you’ve already done. It is about building the potential for future gains while focusing on delivering the project.
I could think of a number of projects where I followed that philosophy of work and where it boosted the productivity of my team. Today, however, I’d like to focus on one project in particular, that I am currently most involved in: Apache Airflow.
Initially, we started contributing to this fantastic open-source project with a team of three which then grew to five. When we kicked it off a year ago, I realized pretty soon where the biggest bottlenecks and areas for improvement in terms of productivity were. Even with the help of our client, who provided us with a “homegrown” development environment it took us literally days to set it up and learn some basics.
Apache Airflow is a thoroughly tested project—it has almost 4,000 tests with around 80% coverage and varying complexity (from simple unit tests to end-to-end system tests). Airflow follows a modern software project philosophy: every single Pull Request can only be merged if all the tests pass. But that creates another problem—having to run all the tests for all the backends (there are 3 of them) and different python versions. At the time of starting the project we still supported 2.7, 3.5 and 3.6—so you had to wait sometimes for several hours to have the results of the CI build, depending on the delays and queues. At times, you have no control over the CI system you use. For example, recently we experienced hours of delays before we could even see the build starting. And then you might lose several hours of waiting because of the bad formatting of one line of code detected by pylint (!).
This is hardly acceptable to any developer. Usually you can work on several issues in parallel. Such waiting is not the worst thing that can happen, but there are always costs of context-switching, distractions, getting out of the flow and the good old “I already forgot what I was doing” phase by the time the build is completed. It’s even worse when you want to make a small Pull Request. For example, when you find a bug and have just one line of fix for it. You submit the Pull Request and… half a day later you see that your single line of code change is badly formatted. That’s hardly encouraging for a project such as Airflow where very few people are lucky enough to work on it full time, building community and encouraging people to contribute.
First, we started with the needs of our team—we had to work out a way to work together. Here is the list of things that we needed to do for efficient collaboration:
- fast iteration loop on tests
- sharing configuration of our project in the team
- sharing Google Cloud Platform information: project ID, service account keys, configuration of GCP resources
- good documentation (we knew our team members will change over time)
- create an easy way to onboard a new person
- create an easy way to upgrade the environment semi-automatically when needed
Inspired by the environment provided by our customer, we developed our work environment based on a shared Dockerfile. The Dockerfile and resulting image are used for the Airflow system test execution. We’ve automated the rebuilding of the docker image, and added an easy way to select a version of Python, backend, plus an easy integration with the pydevd remote debugging.
We’ve open-sourced the framework at the very beginning of the project. We made it in a way that allows users to run the system (end-to-end) tests communicating with the real Google Cloud Platform project directly from the IntelliJ/PyCharm IDE via normal “run as unit test” feature. How cool is that? It’s cool as a breeze or rather, thanks to the environment, “It’s a breeze to develop Airflow” now;) Hence, we named the tool Airflow Breeze. And what’s even cooler, our Graphic Designer Milka created an amazing logo for it:
What did we achieve by using Breeze? Faster (MUCH faster) speed of iterations during the development. It’s mainly because we cut down the test cycle time for every developer and we got fully reproducible system tests that could be run by anyone. Vacations of a team member or the “bus factor” issue were not a problem. Most of all, we got rid of the “works on my machine” problem. Everyone in the team had the same environment, which was easy to start and very fast with iterating tests on their local machine, allowing others to continue the work.
But… this was just a beginning and my goal was much bigger. In the meantime, due to our contributions to Airflow I was invited (together with my fellow engineer Kamil Breguła) to become a committer. I quickly realized that it was my chance to share a better development experience with the whole community of Airflow developers. I also realized that we had so many GCP-specific solutions in the Breeze that bringing it directly to the community was not possible, as they would not like to use Google Cloud Platform if they weren’t already using it. After all, Airflow is used to interact with many services and GCP is just a small subset of them.
I knew well that it would be difficult to convince long-time Airflow developers to dramatically change the way they work. If you work with something long enough, you start to get used to some of the problems you face and develop muscle memory, shortcuts and aliases that make you productive. You quickly forget how your daily work looked like when you just started out and so you don’t feel the need to change. I decided that I have to approach my mission in a different way.
The whole CI / build system for Airflow was pretty complex. Combination of Docker images (managed in a separate repository), Docker compose configuration, tox, mesh of bash scripts downloading kubernetes and installing minikube locally was a piece of work. It was really nice for the CI case though—it automatically set up Airflow image and external services (postgres/mysql/redis/kerberos) and used those for even complex integration and system tests. However, even if you managed to set it up locally (which required a bit of dark magic), it was designed to run the whole suite of tests, which would take at least 30 minutes. The setup was not designed for individual tests—the 30-40 seconds of overhead with every run was ok for the whole test suite, but for running individual tests (taking milliseconds to run) it was a major pain.
If you follow the documentation literally, you end up in a situation where running a single test—which normally takes 10 ms—introduces 30-40 seconds of overhead for database initialization or even minutes for installing missing dependencies.
That’s hardly a good cycle speed for locally run tests.
You could, of course, have a local virtualenv setup but maintaining that, keeping up with the changes and making sure that it’s the same as the CI environment was not really a viable solution. What some more experienced people did was they tuned the environment and manually modified and configured their local machine to be more productive, mostly to iterate faster. There is even a talk given by one of the most active committers and PMC members of Airflow—Ash Berlin-Taylor from Astronomer—where he explains all the magic needed to test Airflow. While the talk was inspiring and very helpful it was difficult to expect that not very experienced people who work in Airflow will be able to learn and follow all the tricks, or that they even realize that they need to apply them. A lot of the tricks by Ash ended up in our environment.
Coming back to how to introduce such a change to all Airflow developers—I decided to tackle a much bigger problem, one that was a pain for many other people, namely the complexity of the CI build I mentioned before. It turned out that the best way of implementing such an environment was to fix the CI environment and make it easy to install locally as well as iterate tests more quickly. Additionally, we could make it self-manageable. Docker image is periodically rebuilt with latest versions of base images (including security fixes), so we can make sure that everyone gets the latest version of the images. We also optimized the time needed to rebuild it. This opened up a number of further improvements. This was the road I decided to take—I embarked on quite a long journey, which took more than 6 months, slowly replacing the CI environment with one inspired and built based on the original Breeze.
To get to where we are now with Breeze, it took 4 Pull Requests (one of them was quite huge), more than 500 comments in Github discussions, tens of threads in the development mailing list, a screencast and a webinar. We have a brand new Breeze environment that makes it much easier to replicate the CI environment, and is optimized for both: running in CI and local development with quick iteration cycles over individual tests. At the same time, this environment is replicating exactly what the CI build does so when the tests pass locally, you can be sure they also pass in CI. You can easily run static code checks in exactly the same way they are run in the CI environment.
It also opened up an opportunity for moving to another CI system, because all the test steps are fully Dockerized now and are easy to move to another CI. The system is optimized enough that we can even run it as pre-commit-hooks which use the same docker that will prevent developers to commit files that do not follow the strict rules of the Apache Airflow community (currently it’s a work in progress). It helps with catching some errors at the very beginning of the development lifecycle.
This is what I call a super productive development environment! What’s even more important it takes less than 10 minutes to get the environment up and running from scratch(!). You have an auto-complete that suggests options you can use when you start the environment (for example which Python version you can choose) and then another auto-complete that helps you when you try to remember the test name you were supposed to run.
You can literally start hacking with Airflow in less than 10 minutes. Cool, isn’t it?
To sum up, we liked the name Breeze so much that we applied it in the new environment. What we originally called Breeze has become a GCP Breeze Extension that is going to be built on top of Breeze. We are working now on improving it, to make it much slimmer and—potentially—also reusable for other developers who would like to develop GCP operators for Airflow.
There is no point in discussing here the details of the environment, it’s much better to see it in action. Here are some resources where you can learn more about Breeze:
- My screencast
- Webinar recorded by me and Ash showing the new environment
- Documentation for the Pull Request (still in progress)
By the way, we have something else being cooked right now that will show the results of our “10x faster” philosophy, but–hold your horses— it’s a story for another blog post;)
Have fun using Breeze!
Principal Software Engineer