Share

Open Source

Apache Beam

Apache Beam


A Testing Framework for Apache Beam

Apache Beam is a unified programming model used to implement both batch and streaming data processing jobs. Polidea delivered a testing framework that helps users make more informed decisions regarding whether or not Beam is the right choice for their project.

Project scope

Cloud

Testing

Let’s imagine that you run an e-commerce business. You probably want to provide your customers with the best possible service, by offering them products that might actually interest and encourage them to make a better purchase. Extracting usable insights from data—aka data processing—can help you with that. This goes for all the industries that rely heavily on Big Data, such as finance, travel, insurance, health, and more.

The Apache Software Foundation (ASF) community—the world’s largest open source non-profit foundation—offers a lot of tools to deal with Big Data, data processing, and cloud. One of them is Apache Beam—a unified programming model used to implement both batch and streaming data processing jobs.

Katarzyna

Senior Software Engineer

Apache Beam gave the cloud community a completely new way of creating data processing systems. It was exciting to be able to bring this project closer to the users.

The challenge

Apache Beam enables running jobs on different data processing engines (runners) and developing code in multiple languages (SDKs). It aims to open the world of data processing to unprecedented scenarios, for example running Python code on Apache Flink or even mixing different languages in one data processing pipeline. However, currently not every feature is supported equally in all SDKs and runners. Moreover, the performance and accuracy of Beam’s features is in some cases unknown due to the lack of proper tools to test them.

The client needed a testing framework that would allow Beam developers and users to:

  • Check if their solution is correct and performant (integration & performance testing), which could help to assess which parts of the Apache Beam code need fixing or optimizing.

  • Make an easy estimation of which Beam’s runner/IO/filesystem to use with Apache Beam. Running tests in simulated environments would clarify early on which approach might potentially generate the least amount of bugs and problems, as well as save the engineering team’s time and money.

Other than that, the client asked Polidea to implement ParquetIO connector, as well as to help the Beam community provide tests for Java 11 and migrate the project from Maven to Gradle build system.

image1 Apache Beam logo—source

Scope of the project:

  • Testing framework implementation
  • ParquetIO connector development
  • Java 11 support in Apache Beam
  • Project migration from Maven to Gradle build system
  • Documentation
  • Code quality reviews
  • Engagement with the community

The solution

Trusted by the client to be self-reliant, Polidea team worked independently in the project that required high-quality engineering skills. At the same time, the team made sure all work was transparent and compatible with the ASF community standards.

Our engineers became Apache Beam contributors and successfully implemented the testing framework, simplifying the client’s initial idea and—along the way— fixing a number of bugs in the Beam’s core source code.

The framework allows writing integration and performance tests that use different SDKs, real data processing engine instances, data sources (databases, filesystems), and significant loads of data. Thanks to the solution, the users and project developers can also:

  • write tests that collect metrics and store them in BigQuery database for reference,
  • run the tests in isolation (which is key in getting accurate and trustworthy results),
  • create dashboards with collected results for analysis.

Our team also developed the ParquetIO connector, that allows users to access Parquet files easily from Beam’s data processing pipelines. Additionally, we speeded up work related to migrating the project to a new build system and providing Java 11 support for Beam.

image2

The outcomes

Thanks to the implementation, both the client and Beam developers have more knowledge about the performance and general state of the Beam project, which allows them to make more informed business decisions. The project has been successfully migrated to Gradle build system and is close to fully support Java 11. The client can also take advantage of the ParquetIO connector in Java SDK.

Along the way, Polidea became recognizable in the Beam community— they contributed to the improvement of documentation and added their input in discussions surrounding the evolution of the project, proposing practical solutions. The team gave multiple talks promoting the project at tech events like Geecon 2019, Beam Summit (in 2018 and 2019) and GDG Warsaw. Plus, as a sentiment to the valuable contribution, we now have an official Apache Beam Committer on board!

Besides developing the testing framework our team has a broad experience in developing and optimizing data processing applications. So, if you need to build, scale or improve your data processing systems don’t hesitate to contact us.

Share

Interested in the project?

Set up a call to learn more!

Interested in the project?

Set up a call to learn more!

You might also like

As part of the open source community, Polidea team developed and implemented an extensive set of operators for the Airflow system to work with different cloud service providers.

Open Source

Apache Airflow

An All-In-One Scheduler for Seamless Workflows