April 21, 2020 | 8min read
Big Data for Your Business
While I was looking for inspiration for this blog post (and generating volumes of data while doing it), I came across Geoffrey Moore’s tweet about his thoughts on big data. It said, “Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” I particularly liked this statement because it is not only a funny metaphor but also a true one, especially nowadays.
However, before having a closer look at big data analytics itself, it’s worth to remember what big data per se generally is. It describes large sets of data that are either structured or unstructured. Their complexity and size are so massive that conventional data processing techniques—excel sheets, various tools, access databases, and other solutions prone to human-error—won’t be suitable for such types of data sets. Big data can also be defined using the famous 3Vs model in which each of the Vs stands for:
- Volume—It refers to the amount of data generated, e.g., through social media websites such as Facebook, Twitter, Instagram. An interesting fact is that there are a total of 300 million photos updates on Facebook per day. (Source: Zephoria)
- Velocity—It’s a speed at which data is generated. Google now processes over 40,000 search queries every second on average! This translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. (Source: Internet Live Stats)
- Variety—It relates to the structured and unstructured data that is possible to be generated either by humans or machines. The structured one is visible in texts, pictures, videos, and tweets, whereas unstructured data like voicemails, emails, recordings are also important elements under variety. In variety, it’s all about the ability to classify the incoming data into various categories. (Source: Whishworks)
And now, it’s time to use the amounts of data that can bring value. How exactly can you do that? Use big data analytics. It’s a process of checking, analyzing, and uncovering hidden data patterns. Thanks to big data analytics, you can find out different pieces of information, including market trends, performance, and many more.
Every business, no matter its size or field, needs valuable information and data. When it comes to taking strategic initiatives, improving relationships with customers or business partners, big data plays a significant role.
It is an important matter, especially now when all of us stay at home and leave our trace as well as valuable information on the web. Because of COVID-19, many businesses have to take better care of their presence on the online market and use infrastructures that will bring them the best insights. In fact, big data analytics is currently helping healthcare workers, policymakers, scientists, and epidemiologists to aggregate the data about the virus, utilize it, and make better decisions for the future. You can read more about it in the latest Forbes article.
Speaking of the business perspective, the most crucial aspect is to be able to process and use the data for various benefits. You can analyze it to strengthen business strategy or prepare well-thought marketing campaigns.
However, to be successful in this matter, you need to have specific knowledge or a skill set about data analytics, or you can help yourself a bit and use one of the tools available on the market. Some of them are open sourced and used by various companies from SMEs to big corporations.
In this article, I’d like to show a few examples of using Apache Airflow and Apache Beam for business intelligence. At Polidea, we actively contribute to both of these open-source projects. I’m super proud to be a part of this journey and have the opportunity to collaborate with Apache committers and PMC members daily. Most of the ones I know are my fellow colleagues from Polidea :-)
Apache Airflow is an orchestration tool, developed by an active and wonderful community. It allows you to author and monitor your batch data pipelines in an iterative way. Many companies are using Airflow to orchestrate their data workflows in a centralized manner.
The need to use Apache Airflow first appeared at Airbnb. They struggled with the rapid growth of data due to the Airbnb platform becoming more popular, hence generating tons of information by growing the number of users. To become an entirely data-driven organization, Airbnb’s data engineering team needed to hire more people to make it happen. Also, to keep the processes working, all of the engineers had to run batch jobs. All of this naturally consumed a lot of time and was full of repeatable tasks, I assume no one really liked. This was the moment for their Data Engineer Maxime Bauchemain to decide to build an Apache Airflow scheduling tool and make it as an open-source project. (Source: Astronomer)
Another example of a company that uses Airflow is Devoted Health. I had the pleasure of talking to Adam Boscarino, who is a Senior Data Engineer at Devoted Health, and he explained how Apache Airflow solved business and technological issues in a healthcare product.
Devoted Health offers Medicare Advantage to qualified people who look for tailored health insurance plans in the US. Currently available in parts of Texas and Florida, Devoted’s mission is to provide care to its members that you would want your family to have. Devoted works with various kinds of data from different parties such as plans’ members, providers (hospitals, doctors, pharmacies, etc.), insurers, as well as internal sources like the Devoted sales team.
From a business perspective, there is an ever-increasing amount of data that can be shown on the dashboards and used by Devoted’s executive team. The value of such data has significantly grown thanks to Apache Airflow and its validation and check operators, which give the team confidence they are reporting accurate metrics. This data needs to be accurate as it plays a key role in Devoted’s business powering metrics that are used to provide the best healthcare possible to its members.
How did reality without Airflow look before Adam joined the Devoted Health’s engineering team? There were approximately 12 data scientists back then who had to turn out the code that needed to be put in the pipelines with only one engineer able to deploy it with cron. After the introduction of Airflow, the data scientists have become more self-sufficient. They now use internal tooling developed by the Data Engineering team to build pipelines without needing to know the internals of Airflow and can deploy pipelines with confidence. Thanks to Airflow, the Data Science team owns their entire process from development to production. Writing a Data Science pipeline has never been easier.
Recently, I’ve also checked out Astronomer’s Airflow podcast, where I came across a great interview with Maksime Pecherskiy (CDO, San Diego), in which he explained how Apache Airflow is used in the public sector. I won’t present the full case study here, but I want to mention business benefits that were presented by Maksime. In the interview, he summarized that using Apache Airflow helped them in the following way:
As a result, we help city staff save time, access data faster, get answers faster, and make better decisions more quickly. In addition, we save money in software development costs - our vendors charge quite a bit of money to create new reports.
To sum up, Airflow itself won’t solve a business problem. However, it would help all of the people who work on solving various business issues in which data plays a significant role in experimentation and scaling. It also makes potential data more efficient.
Apache Beam is another excellent tool to mention. It is a unified programming model defining both batch and streaming data-parallel processing pipelines. Beam takes care of running the pipeline written in the language of choice on different data processing engines. This feature—along with the unified model—is suited both for smaller tasks on finite datasets and large, complicated operations on infinite streams of data. All of this allows developers to focus on the data processing instead of customizing programs to fit each task or data processing backend better. This way, you can focus on the big picture when designing large-scale batch and streaming data processing pipelines. It is an open-source project that great folks from Polidea contribute to as well.
Regarding businesses that are happy to use Apache Beam as their programming model, I can name at least a few of them. The most known are e.g., Lyft, Sky, Spotify, and on the Polish market—Allegro.
When is it a good idea to use the Beam model? Let’s investigate Lyft’s case. For those of you who are not familiar with what Lyft is, it’s a ride-sharing company based in San Francisco, and now one of the biggest Uber competitors. In the case of transport technology, you have two sides of the market that you have to combine to maintain an efficient system. I’m speaking about real-time price changes during high demand as well as keeping good passenger experience.
This complex system makes real-time decisions using various data sources; machine learning models; and a streaming infrastructure for low latency, reliability and scalability. In this streaming infrastructure, our system consumes a massive number of events from different sources to make these pricing decisions.
The quote comes from a great presentation carried out by Rakesh Kumar, Software Engineer at Lyft, at Beam Summit 2019 in North America.
A big challenge Rakesh talked about was reacting to various events in a cron scheduler (a time-based job scheduler in Unix-like computer operating systems) based on legacy infrastructure.
By leveraging Apache Beam, Lyft’s Streaming Platform powers pricing by bringing together the best of two worlds: ML models in Python and Apache Flink on JVM as the streaming engine.
My conclusion would be to highlight the importance of real-time stream processing tools such as Apache Beam. If your business can be classified as generating big volumes of data and has to react to the rapid and real-time market changes, you might think of using Apache Beam that can run on any execution engine such as Dataflow or Spark.
Thinking forward about the growth of your business and knowing how to analyze tons of data to be ahead your competitors sounds like a good choice. Naturally, you’d have to cooperate with data scientists or engineers to be able to do it. However, thanks to diverse open source-big data analytics tools that can automate their work and save you from additional costs would do miracles when it comes to setting up plans for your marketing campaigns, business plans, reacting to your customers needs on a regular basis. That’s the true power of big data analytics.
If you have any questions about Apache Airflow of Beam related projects, don’t hesitate to contact us. Have a big data project in mind? Our Committers and Contributors will be happy to help!
Head of Cloud & Open Source
You might also like
October 22, 2020
How to Open Source? A Guide for New Contributors and Maintainers
Whether you think about your first PR or are an experienced maintainer, this article has insights for you. Our cloud & OSS experts share their best practices on how to open source.