first steps with Apache Beam in Java: tutorial

ImageImage
Erick_Romero.jpeg
written bySoftware Engineer, EPAM Anywhere, Colombia

I'm a Java developer with 7+ years of experience and high capabilities in all aspects of the Java ecosystem, such as Spring Boot and GCP, designing solutions for people. I also love clean code and algorithms.

I'm a Java developer with 7+ years of experience and high capabilities in all aspects of the Java ecosystem, such as Spring Boot and GCP, designing solutions for people. I also love clean code and algorithms.

In this brief Apache Beam tutorial, you will learn how to get started with it using Java and try a quick example on your own.

All you need is:

  • Basic knowledge in Java
  • Your favorite IDE (I recommend IntelliJ IDEA)

What is Apache Beam?

Apache Beam is a unified model for defining both batch and streaming data parallel processing pipelines, which include ETL (Extract, Transform, Load) as well as batch and stream data processing.

Common Apache Beam use cases:

  • Creating powerful batch and stream processing of big data
  • Processing billions of events per day in real time
  • Reducing stream and batch processing costs
  • Creating robust and scalable machine learning infrastructure to allow integrations with hundreds of applications

Apache Beam has some important concepts to know, as illustrated and described below:

A diagram of Apache Beam concepts

Pipeline incorporates a defined set of operations to apply to a dataflow as shown in the example below:

A sequence of Apache Beam pipeline operations
Apache Beam pipeline operations

SDK is a set of dependencies to create, process, and apply transforms and PCollection to the pipeline. Apache Beam has SDKs for Java, Go, TypeScript, and Python.

Runner is the most important component because it runs the pipeline. We can choose from numerous runners such as Apache Flink, Apache Spark, GCP Dataflow, and others. For our example, we’ll use a direct runner to test the pipeline locally.

PTransform is a data processing operation like ParDo, Combine that helps to process the data.

PCollection is a data stream required for the pipeline to perform the operations over the data.

Aggregation is used when we need to process data from multiple inputs and group them with a common key such as PCollection.

Get started with Apache Beam step by step

Now that you know the basics, it’s time to try a real example using Apache Beam in Java. We’re going to filter two lists with a static letter and then merge it. Let’s get started.

1. First, download this repository with all of the required dependencies to save some time.

2. Open the repository, move to the example branch, go to the PipelineDemo class, and add the following code to create a pipeline:

3. Import Pipeline class from org.apache.beam.sdk. Now you’ve created a Pipeline object.

4. We can start creating PCollections to apply some transformers too, as follows:

Next, import PCollections from org.apache.beam.values, and Create.of from org.apache.beam.transform

5. Now, it’s time to apply some transformers to the Pipeline, so we’ll apply a ParDo transformer to process and extract countries that start with letters C and B.

6. The previous process extracts the countries that start with a defined letter and creates two PCollections. We’re going to join the two previous outputs and apply Flatten.pCollection to flatten and get the final merged list:

7. Now, apply TextIO.write() transformer to see the final response in a common format like .text.

8. Finally, run the pipeline using a direct runner to do so locally.

The pipeline will create a mergepcollections with the following result:

A pipeline running result

That’s it! We’ve completed our quick Apache Beam tutorial, and now you know the basic steps you need to get started with it using Java. If you want to see more examples, feel free to explore the main branch.

Happy coding!

Erick_Romero.jpeg
written bySoftware Engineer, EPAM Anywhere, Colombia

I'm a Java developer with 7+ years of experience and high capabilities in all aspects of the Java ecosystem, such as Spring Boot and GCP, designing solutions for people. I also love clean code and algorithms.

I'm a Java developer with 7+ years of experience and high capabilities in all aspects of the Java ecosystem, such as Spring Boot and GCP, designing solutions for people. I also love clean code and algorithms.

our editorial policy

Explore our Editorial Policy to learn more about our standards for content creation.

read more