July 09, 2020 | 4min read
Why Apache Airflow? An interview with Co-Founder of Banacha Street
We talk to Jan Kościałowski, Co-Founder of Banacha Street startup to discuss why and how they use Apache Airflow in their business. Jan is an actuary, data scientist, massive coffee geek, and lately a multi-tasking tech person, responsible for development in two young startups. He is always happy to learn something new and use it to solve some problems.
At Banacha Street, we build quant_kit—an Insight Search Engine that provides sophisticated alternative-data-based analyses returned for any query input by the user. It works similarly to a web browser, but instead of a list of webpages, a grid of insights in graphical form (graphs, maps, tables) is returned in response to a query. It is intended for any area in which data-intensive research is (or could be) important: hedge funds, VCs, large industrial/energy enterprises, etc.
We have a bunch of serverless services that collect data from various sources (websites, meteorology, and air quality reports, publications, etc.), return it in a parsed format, and put it in a database. Once a day, at night, we want to gather specific data, analyze it, and based on the result of this analysis, get some more data, and analyze it further. The results of such processes are some quantities and plots. They can be put in a separate database and displayed to our end-users by the front-end.
It allows for more elasticity and easier integration with the other processes and will probably help us transition to cloud-agnosticism in the future.
No other, in fact. We were already aware of what Airflow is capable of when we had several data fetching processes scheduled with AWS CloudWatch Events and some initial ideas about what to do with the data we had.
What type of deployment of Airflow have you used, and why? Was it determined by your current architecture (i.e., the fact that you are cloud-native)?
Our product is currently in the development/PoC stage, so we wanted a semi-production environment. For the sake of cost management, we don’t want any VMs to run continuously, so there exist some Terraform scripts which set up a VM with Airflow running LocalExecutor and a separate MySQL DB. As we are currently exploring possible analyses, the data volume is not too high, and such a setup is sufficient. We will most likely switch to Kubernetes when we start generating analyzes of moderate complexity daily.
A centralized scheduler for the serverless services responsible for data fetching was an excellent replacement for cloud-provider-specific schedulers. It allows for more elasticity and easier integration with the other processes and will probably help us transition to cloud-agnosticism in the future. The greater change, however, on a more cultural level, was the application of Airflow to our analytical processes. We realized that calling data sources and analyzing what they yield is interconnected and dependent on the previous steps, so they all constitute a DAG (<3).
Not only did it help us run what needed to be run at specific points in time, but it also made us drop the previous vague way of thinking about sets of interconnected analyses in favor of a more systematic approach using directed graphs.
No negative effects to date. The shift in our approach can definitely be seen as a positive effect.
The documentation could have been more detailed at some points, and the community size is not that of a well-established framework (however solicitous and helpful). We spent some time getting MySQL to work as the back-end properly.
How did you approach the integration process—did you seek any help, assistance from a development studio?
No, we develop everything in-house. However, we will be looking for a person experienced in Airflow. Despite its novelty, Airflow has gained much traction in data engineering teams all around Warsaw. So there is a considerable ecosystem from which we could draw an experienced hire.
To any Python-centric organization or individual needing to schedule some tasks happening every once in a while. As the set of such tasks is surprisingly broad, I would advise any repetitive task, manual or somehow automated, to be considered as a candidate for an Airflow DAG.
Co-Founder, Banacha Street