business

August 25, 2020   |   4min read

Why Apache Airflow? Interview with GoDaddy

Amit Kumar is a Principal SDE at GoDaddy—an American publicly traded Internet domain registrar and web hosting company. Amit is a Big Data professional with experience in building and working in data platforms, data engineering, ingestion, visualizations, business intelligence and DWH solutions, using different technologies.

Can you please describe shortly what your company does?

GoDaddy is a well-known company owning many brands globally, including HEG in Europe for ​registering domains, hosting websites, and providing related services and products to enable small businesses to come online and succeed in online ventures.

Prior to implementing Airflow, have you been using any other orchestration tool?

Yes, Oozie and cron.

What is a specific use case of Airflow at GoDaddy?

There are many different batch analytics and data teams which need an orchestration tool and readymade operators for building ETL pipelines and a consistent way to integrate all components and related tasks.

Did you consider any other solutions?

No, just Airflow.

What made you choose Airflow against other technologies?​ ​Did the fact the Airflow is an open-source solution help?

Most decision-makers initially did not like the idea of using Airflow but all the teams who started using it and got the hang of it really enjoyed it. So yes, being open source really helps in getting to know the details and how the solution works internally. I created a hybrid Airflow cluster in June 2018 on EKS with some workers on on-prem edge nodes but our decision-makers wanted to wait, given we did not have the capacity and team to maintain it at the time. This is why we decided to go via the route of edge nodes.

GoDaddy website screen Source: GoDaddy website

Does your company use any other workflow solution right now and if so, why?

Many teams are still using Oozie since we are still in the process of migrating to AWS from on-prem Hadoop Cluster.

What type of deployment of Airflow have you used, and why?

Currently, we have multiple existing Airflow instances running on on-prem Hadoop edge nodes. We have created re-usable scripts to install Airflow on edge nodes for any team who wants to operate it. Each team owns their own Airflow instance, and can use multiple edge nodes to scale horizontally. For deployment of dags, the teams use Jenkins to push dags and plugins to Airflow nodes. We have recently created one cluster using Celery on AWS EKS but given that we are still migrating to AWS, it is not yet being used. In the near future we want to have a huge common Airflow Cluster to serve the needs of all teams and use-cases.

Did you build a team dedicated to Airflow inside your company?

I have wanted to have an orchestration team since March 2018. I run multiple Airflow Slack channels with admins and users maintaining different instances of Airflow. The need for an orchestration team was raised by data teams multiple times in the past, and as a result we are now seriously considering building it as part of our data platform team.

What was the challenge in the integration process? Did you face any technical difficulties?

There were challenges due to the limitations of the Hadoop edge node VMs, like maintenance and operation, which is why we decided to defer the ownership to user teams. Dag isolation, sub-dags, data lineage, multi-dag dependencies, creating a UI/template for submitting dags are few areas we are looking forward to having in Airflow. Moreover some of the edge nodes available to use only had centos 6.

How did you do it - did you seek any help, assistance from a development studio?

No, we did everything in-house, with a big help from the open-source community.

How did the implementation influence your processes?

We decided earlier not to use a central common Airflow cluster due to lack of dag isolation, scheduler issues, tasks being queued or not progressing, using xcomms circular dependencies in sub-dags and our developers being new to Airflow. It took us a few months to understand the nuances of Airflow and achieve stability.

To whom would you recommend using Airflow?

Most of our data teams and I love using Airflow. I really like the community discussions and I want to work for the open-source community. I would recommend Airflow for batch data ingress, data migration, FTP file ingestion, egress, enterprise data, analytics, data science and ML teams.

Tomek Urbaszek

Software Engineer

Amit Kumar

Principal SDE, GoDaddy

Did you enjoy the read?

If you have any questions, don’t hesitate to ask!