August 25, 2020 | 4min read
Why Apache Airflow? Interview with GoDaddy
Amit Kumar is a Principal SDE at GoDaddy—an American publicly traded Internet domain registrar and web hosting company. Amit is a Big Data professional with experience in building and working in data platforms, data engineering, ingestion, visualizations, business intelligence and DWH solutions, using different technologies.
GoDaddy is a well-known company owning many brands globally, including HEG in Europe for registering domains, hosting websites, and providing related services and products to enable small businesses to come online and succeed in online ventures.
Yes, Oozie and cron.
There are many different batch analytics and data teams which need an orchestration tool and readymade operators for building ETL pipelines and a consistent way to integrate all components and related tasks.
No, just Airflow.
What made you choose Airflow against other technologies? Did the fact the Airflow is an open-source solution help?
Most decision-makers initially did not like the idea of using Airflow but all the teams who started using it and got the hang of it really enjoyed it. So yes, being open source really helps in getting to know the details and how the solution works internally. I created a hybrid Airflow cluster in June 2018 on EKS with some workers on on-prem edge nodes but our decision-makers wanted to wait, given we did not have the capacity and team to maintain it at the time. This is why we decided to go via the route of edge nodes.
Source: GoDaddy website
Many teams are still using Oozie since we are still in the process of migrating to AWS from on-prem Hadoop Cluster.
Currently, we have multiple existing Airflow instances running on on-prem Hadoop edge nodes. We have created re-usable scripts to install Airflow on edge nodes for any team who wants to operate it. Each team owns their own Airflow instance, and can use multiple edge nodes to scale horizontally. For deployment of dags, the teams use Jenkins to push dags and plugins to Airflow nodes. We have recently created one cluster using Celery on AWS EKS but given that we are still migrating to AWS, it is not yet being used. In the near future we want to have a huge common Airflow Cluster to serve the needs of all teams and use-cases.
I have wanted to have an orchestration team since March 2018. I run multiple Airflow Slack channels with admins and users maintaining different instances of Airflow. The need for an orchestration team was raised by data teams multiple times in the past, and as a result we are now seriously considering building it as part of our data platform team.
There were challenges due to the limitations of the Hadoop edge node VMs, like maintenance and operation, which is why we decided to defer the ownership to user teams. Dag isolation, sub-dags, data lineage, multi-dag dependencies, creating a UI/template for submitting dags are few areas we are looking forward to having in Airflow. Moreover some of the edge nodes available to use only had centos 6.
No, we did everything in-house, with a big help from the open-source community.
We decided earlier not to use a central common Airflow cluster due to lack of dag isolation, scheduler issues, tasks being queued or not progressing, using xcomms circular dependencies in sub-dags and our developers being new to Airflow. It took us a few months to understand the nuances of Airflow and achieve stability.
Most of our data teams and I love using Airflow. I really like the community discussions and I want to work for the open-source community. I would recommend Airflow for batch data ingress, data migration, FTP file ingestion, egress, enterprise data, analytics, data science and ML teams.
Principal SDE, GoDaddy