Process your data easily and effectively, thanks to Apache Airflow!


Ensure the successful digital and cloud transformation of your company with our Apache Airflow services.

Apache Airflow is a platform created by the Apache Software Foundation to programmatically author, schedule, and monitor workflows. It proves to be especially effective when it comes to cloud projects involving data processing and machine learning. Polidea proudly contributes to the development of this platform.

Let’s talk about your project!

Looking for help with Apache Airflow services? You’re in the right place! Get in touch and our Airflow committers and contributors will walk you through it.

What we do?

"Since Airflow is an open-source technology, we could implement its customizations for the different use cases to achieve the desired scale and functionality. Furthermore, our automotive partners already had some experience with Airflow and felt more comfortable with us using it as well."

Amr Noureldin

Solution Architect,
DXC Technology

We have

3

Apache Airflow committers

&

3

Project Management Committee members

Who are responsible for

10

%

of all the contributions in the project since it’s kickoff!

Find our committers on GitHub!

Our projects

Open Source

Apache Airflow


An All-in-one Scheduler For Seamless Workflows

As part of the open-source community, Polidea team developed and implemented an extensive set of operators for the Airflow system to work with different cloud service providers. Our Airflow committers contributed to the Open Source Airflow Project and provided 70+ operators for the Airflow DAGs, meeting the highest standards of open-source projects. As a result, the process of building multidimensional workflows of data turned out to be faster than ever before.

Open Source

Apache Airflow


Optimizing the tool’s performance for the community

Many Apache Airflow users face performance issues. This is manifested by a delay in scheduling tasks, as well as excessive use of resources. Together with Databand, we’ve made major progress in optimizing Airflow scheduler performance. As a result, tests are showing 10x faster performance with over 2000 fewer queries by count.

Open Source

Apache Airflow


It’s a Breeze to Develop Apache Airflow

Seeing the need for improving the Airflow development process and making it more efficient, our committer created a tool called Breeze. It’s an easy-to-use development environment, available for local use, and in Airflow CI tests. Thanks to our solution and involvement in the Apache community, developing Airflow is now faster and simpler for everyone.

Our clients
are our partners

Have more burning questions about Apache Airflow?

We got the answers!

How does Apache Airflow work?

Apache Airflow is one of the most highly-recommended schedulers that executes tasks depending on each other in a precise way, set up as a code. As part of the Apache Open Source software projects, it is developed by the whole community of skilled software engineers, which makes it more bullet-proof than any other.

Both Airflow itself and all the workflows are written in Python. This has a lot of benefits, mainly that you can easily apply good software development practices to the process of creating your workflows. These include code versioning, unit testing, avoiding duplication by extracting common elements etc. Moreover, it provides an out-of-the-box browser-based UI where you can view logs, track execution of workflows, and order reruns of failed tasks, among other things.

Do I need a special team to do a project in Airflow?

You can, of course, try to hire developers with a specific set of cloud skills internally, however, it might take time and money. Remember, cloud OSS tools do not come with paid support. The better option would be to hire a team of experts externally, preferably engineers who are involved with the Airflow project itself, like Apache committers and contributors. Lucky for you, some of them are at Polidea ;)

When should I consider Apache Airflow services?

Think of Airflow as an orchestration tool to coordinate work done by other services. It’s not a data streaming solution—even though tasks can exchange some metadata, they do not move data among themselves. Here are some examples of use cases suitable for Airflow:

  • ETL (extract, transform, load) jobs—extracting data from multiple sources, transforming for analysis and loading it into a data store
  • Machine Learning pipelines
  • Data warehousing
  • Orchestrating automated testing
  • Performing backups

Will Apache Airflow boost my team’s productivity?

Short answer—yes! It speeds up the work for your data scientists. Additionally, you can also speed up and simplify Airflow and workflow development and testing by using Breeze—a tool co-designed by Polidea.

Why Apache Airflow? An interview with Co-Founder of Banacha Street

We talk to Jan Kościałowski, Co-Founder of Banacha Street startup to discuss why and how they use Apache Airflow in their business. Jan is an actuary, data scientist, massive coffee geek, and lately a multi-tasking tech person, responsible for development in two young startups. He is always happy to learn something new and use it to solve some problems.

Please describe shortly what your company does.

Banacha Street, we build quant_kit—an Insight Search Engine that provides sophisticated alternative-data-based analyses returned for any query input by the user. It works similarly to a web browser, but instead of a list of webpages, a grid of insights in graphical form (graphs, maps, tables) is returned in response to a query. It is intended for any area in which data-intensive research is (or could be) important: hedge funds, VCs, large industrial/energy enterprises, etc.

What is a specific use case of Airflow at Banacha Street?

We have a bunch of serverless services that collect data from various sources (websites, meteorology, and air quality reports, publications, etc.), return it in a parsed format, and put it in a database. Once a day, at night, we want to gather specific data, analyze it, and based on the result of this analysis, get some more data, and analyze it further. The results of such processes are some quantities and plots. They can be put in a separate database and displayed to our end-users by the front-end.

What other solutions did you consider?

No other, in fact. We were already aware of what Airflow is capable of when we had several data fetching processes scheduled with AWS CloudWatch Events and some initial ideas about what to do with the data we had.

What type of deployment of Airflow have you used, and why? Was it determined by your current architecture (i.e., the fact that you are cloud-native)?

Our product is currently in the development/PoC stage, so we wanted a semi-production environment. For the sake of cost management, we don’t want any VMs to run continuously, so there exist some Terraform scripts which set up a VM with Airflow running LocalExecutor and a separate MySQL DB. As we are currently exploring possible analyses, the data volume is not too high, and such a setup is sufficient. We will most likely switch to Kubernetes when we start generating analyzes of moderate complexity daily.

How did the implementation influence your processes?

A centralized scheduler for the serverless services responsible for data fetching was an excellent replacement for cloud-provider-specific schedulers. It allows for more elasticity and easier integration with the other processes and will probably help us transition to cloud-agnosticism in the future. The greater change, however, on a more cultural level, was the application of Airflow to our analytical processes. We realized that calling data sources and analyzing what they yield is interconnected and dependent on the previous steps, so they all constitute a DAG (<3).

Not only did it help us run what needed to be run at specific points in time, but it also made us drop the previous vague way of thinking about sets of interconnected analyses in favor of a more systematic approach using directed graphs.

Was there any unexpected effect? Both positive and negative.

No negative effects to date. The shift in our approach can definitely be seen as a positive effect.

What was the challenge in the integration process?

The documentation could have been more detailed at some points, and the community size is not that of a well-established framework (however solicitous and helpful). We spent some time getting MySQL to work as the back-end properly.

How did you approach the integration process—did you seek any help, assistance from a development studio?

No, we develop everything in-house. However, we will be looking for a person experienced in Airflow. Despite its novelty, Airflow has gained much traction in data engineering teams all around Warsaw. So there is a considerable ecosystem from which we could draw an experienced hire.

To whom would you recommend considering using Airflow?

To any Python-centric organization or individual needing to schedule some tasks happening every once in a while. As the set of such tasks is surprisingly broad, I would advise any repetitive task, manual or somehow automated, to be considered as a candidate for an Airflow DAG.