A Testing Framework for Apache Beam
Apache Beam is a unified programming model used to implement both batch and streaming data processing jobs. Polidea delivered a testing framework that helps users make more informed decisions regarding whether or not Beam is the right choice for their project.
Let’s imagine that you run an e-commerce business. You probably want to provide your customers with the best possible service, by offering them products that might actually interest and encourage them to make a better purchase. Extracting usable insights from data—aka data processing—can help you with that. This goes for all the industries that rely heavily on Big Data, such as finance, travel, insurance, health, and more.
The Apache Software Foundation (ASF) community—the world’s largest open source non-profit foundation—offers a lot of tools to deal with Big Data, data processing, and cloud. One of them is Apache Beam—a unified programming model used to implement both batch and streaming data processing jobs.
KatarzynaSenior Software Engineer
Apache Beam gave the cloud community a completely new way of creating data processing systems. It was exciting to be able to bring this project closer to the users.
Apache Beam enables running jobs on different data processing engines (runners) and developing code in multiple languages (SDKs). It aims to open the world of data processing to unprecedented scenarios, for example running Python code on Apache Flink or even mixing different languages in one data processing pipeline. However, currently not every feature is supported equally in all SDKs and runners. Moreover, the performance and accuracy of Beam’s features is in some cases unknown due to the lack of proper tools to test them.
The client needed a testing framework that would allow Beam developers and users to:
Check if their solution is correct and performant (integration & performance testing), which could help to assess which parts of the Apache Beam code need fixing or optimizing.
Make an easy estimation of which Beam’s runner/IO/filesystem to use with Apache Beam. Running tests in simulated environments would clarify early on which approach might potentially generate the least amount of bugs and problems, as well as save the engineering team’s time and money.
Apache Beam logo—source
Scope of the project:
- Testing framework implementation
- ParquetIO connector development
- Java 11 support in Apache Beam
- Project migration from Maven to Gradle build system
- Code quality reviews
- Engagement with the community
Trusted by the client to be self-reliant, Polidea team worked independently in the project that required high-quality engineering skills. At the same time, the team made sure all work was transparent and compatible with the ASF community standards.
Our engineers became Apache Beam contributors and successfully implemented the testing framework, simplifying the client’s initial idea and—along the way— fixing a number of bugs in the Beam’s core source code.
The framework allows writing integration and performance tests that use different SDKs, real data processing engine instances, data sources (databases, filesystems), and significant loads of data. Thanks to the solution, the users and project developers can also:
- write tests that collect metrics and store them in BigQuery database for reference,
- run the tests in isolation (which is key in getting accurate and trustworthy results),
- create dashboards with collected results for analysis.
Our team also developed the ParquetIO connector, that allows users to access Parquet files easily from Beam’s data processing pipelines. Additionally, we speeded up work related to migrating the project to a new build system and providing Java 11 support for Beam.
Thanks to the implementation, both the client and Beam developers have more knowledge about the performance and general state of the Beam project, which allows them to make more informed business decisions. The project has been successfully migrated to Gradle build system and is close to fully support Java 11. The client can also take advantage of the ParquetIO connector in Java SDK.
Along the way, Polidea became recognizable in the Beam community— they contributed to the improvement of documentation and added their input in discussions surrounding the evolution of the project, proposing practical solutions. The team gave multiple talks promoting the project at tech events like Geecon 2019, Beam Summit (in 2018 and 2019) and GDG Warsaw. Plus, as a sentiment to the valuable contribution, we now have an official Apache Beam Committer on board!
Besides developing the testing framework our team has a broad experience in developing and optimizing data processing applications. So, if you need to build, scale or improve your data processing systems don’t hesitate to contact us.