July 16, 2020 | 3min read
Apache Spark vs. Apache Beam—What to Use for Data Processing in 2020?
The Big Data Industry has seen the emergence of a variety of new data processing frameworks in the last decade. One of them is Apache Spark, a data processing engine that offers in-memory cluster computing with built-in extensions for SQL, streaming and machine learning. Apache Spark was open sourced in 2010 and donated to the Apache Software Foundation in 2013. Since then, the project has become one of the most widely used big data technologies. According to the results of a survey conducted by Atscale, Cloudera and ODPi.org, Apache Spark is the most popular when it comes to artificial intelligence and machine learning.
Apache Beam is a different story. According to the project’s description, Apache Beam is a unified programming model for both batch and streaming data processing. If you haven’t heard yet about Apache Beam or you aren’t sure about the role of Apache Beam in the big data world, just visit my previous blog post.
How do Apache Spark and Apache Beam relate to each other? Beam’s intention isn’t to replace Apache Spark. Instead, Beam promises to unify all data processing engines in a single model and syntax. The idea is to write a pipeline using one of the SDKs that Beam provides and execute it on one of the supported engines, Apache Spark being one. You may ask—why should I use Apache Beam when Spark’s API is well-known and many tutorials can be found on the Internet? Let’s follow the pros and cons of using Beam over Spark.
The main advantage of using Beam is portability across data processing engines. If Spark no longer satisfies the needs of your company, the transition to a different execution engine would be painless with Beam. On the other hand, if your code is written natively for Spark, the cost of retraining data analysts and software developers (or even hiring new ones!) is tremendously high. This situation becomes more probable after we realise the market hasn’t chosen any of the engines as a default standard.
It is worth noting that Beam is neither an intersection nor a union of the capabilities offered by execution engines. Beam settles its own model. It’s possible that with Beam you get additional features not available natively. As an example, Beam over Spark used to allow users to make use of features like watermarks, windowing and triggering that were not available any other way.
However, that handy abstraction layer comes at a cost.
First, Beam will often be one step behind new functionalities that are available or are just about to become available. In terms of Apache Spark, the biggest functionality gap at the moment is probably a lack of support for streaming. Another example is that there is no easy way to run pipelines on a Spark cluster managed by YARN.
Apart from functionality cost, there is also a performance cost. Researchers from the University of Potsdam in Berlin developed a set of benchmarks that read messages from Apache Kafka, execute a query like sampling data or matching a certain regex, and write the results back to Kafka. It was shown that Beam has a noticeably negative impact on the performance in almost all cases. However, keep in mind that the version of Beam used by authors was quite old: 2.3.0. Beam is still under heavy development and a lot has changed since then. An argument for this is that in 2019, Beam’s mailing list for developers was the most active among all Apache projects!
Finally, if you would like to use R programming language for developing applications, Beam is not a good choice for you, as it doesn’t support R.
If you start your project from scratch, Apache Beam gives you a lot of flexibility. Beam model is constantly adapting to market changes, with the ultimate goal of providing its benefits to all execution engines. However, before you decide on Beam, make sure all the features you need are available.
Beam’s community is steadily growing, so you can expect support from developers for quite a long time. And, of course, if you have any questions or need help with data processing our team is also willing to help you!