November 19, 2020 | 4min read
Why Apache Airflow? Interview with Asseco Business Solutions
Today we talk to Konrad Łyda, Machine Learning Engineer & Team Lead at Asseco Business Solutions. He is responsible for delivering deep learning and machine learning models to production in high quality. He’s always staying on top of the newest discoveries in the machine learning world and trying to implement them in his work to solve business problems successfully.
Asseco Business Solutions is a company which has been developing and deploying business management software for more than 20 years. We develop and implement our own products across many business sectors. I work in the Computer Vision department, responsible for delivering solutions based on machine learning models, which are then used by other teams developing mobile or web applications.
In this interview, I will speak from the perspective of my department—Image Recognition and Machine Learning Department, as it’s my “mini company” inside the larger company :)
Airflow was the first solution that we introduced to author, schedule and monitor workflows. We researched the topic and wrote down all pros and cons of Airflow and similar tools. At that point, we had found Airflow as the best suit for our needs.
Why have you decided to implement Apache Airflow? What was the problem you were facing and tried to solve?
In our daily work, we prepare many machine learning models based on data. We found out that some tasks are done manually but can be easily replaced with code. However, we didn’t want to end up with a giant house of cards built from random scripts. We wanted a solution that will help us organize these pieces of code into workflows.
Because we are all programmers, we looked for a workflow solution that is based on Python. At the very beginning, we rejected solutions where workflows are built based on some XML / YAML files or by “clicking”. That is why we dropped Oozie, NiFi and Stack Storm. Only Luigi and Airflow were left on the battlefield.
Airflow is a well documented, open-source solution with a large community, many documented use cases and out-of-the-box integrations available. These factors convinced us that it’s a solution for us.
Nope, apart from Airflow my department doesn’t use any other solution of this type.
We wanted to start simple. Therefore, we used the already available on GitHub Dockerfile of Apache Airflow for Docker’s automated build. We have changed it a little and deployed on a self-hosted server. Two docker containers (one with Postgres database, one with web server and a scheduler) using Local Executor have been doing the job for a year and a half now. However, recently we saw some performance dropdown on the current server and decided that moving to our self-hosted Kubernetes cluster would be the best solution for us. At first we thought about moving to KubeFlow, but because of the continued development of Airflow, it was possible to deploy Airflow in Kubernetes, which has dedicated operators. Therefore, taking into account that we had workflows already built in an “Airflow spirit”, we decided to do some tweaks to our pipelines and stick to Apache Airflow.
No, our team of machine learning engineers is also responsible for keeping our pipelines working and healthy. As the solution itself is well documented, it is not a problem to manage it by ourselves.
I think there were no such issues. Of course, we had to develop some operators and sensors tailored for our needs, but the fundamentals offered by Airflow are very stable to build on top of them.
Most of the time, we just used documentation and Stack Overflow. Sometimes, we were looking directly in Airflow source code because some features or configuration tricks weren’t mentioned in the docs, but all in all we managed to develop the whole solution by ourselves.
It’s been helping us to train and deploy machine learning models fast directly to production. We’ve been saving a lot of time using automated pipelines. Even though sometimes we need to tweak some parts of the workflow, the overall time saved is impressive.
Of course, there were situations when we found out that our pipeline could be improved a lot because we’d missed some built-in feature or we had simply developed it “our way”. It happens in all software development projects all the time.
First, you should know Python to build pipelines in Airflow. Then, if you see that in the company some tasks are done manually every day unnecessarily (and many of those tasks consist of moving data from one place to another while performing standard and predictable processing steps) Airflow is a tool for you.
In our case of machine learning pipelines, where there is a continuous loop of data prepared (- train - validate - store - repeat), Airflow has been a great choice for automating our daily routine.
Content Marketing Manager
Machine Learning Engineer & Team Lead, Asseco Business Solutions
You might also like
October 22, 2020
How to Open Source? A Guide for New Contributors and Maintainers
Whether you think about your first PR or are an experienced maintainer, this article has insights for you. Our cloud & OSS experts share their best practices on how to open source.