blog/engineering/java/creating a streaming pipeline in apache beam

creating a streaming pipeline in apache beam

contents

share links

Facebook Twitter LinkedIn

Send via email

4 min read

updated 24 Feb 2024

illustration of tubes towards each other

written by

Erick Romero

Software Engineer, EPAM Anywhere, Colombia

I'm a Java developer with 7+ years of experience and high capabilities in all aspects of the Java ecosystem, such as Spring Boot and GCP, designing solutions for people. I also love clean code and algorithms.

let's find your best Java job at Anywhere

find me a job

subscribe to EPAM Anywhere vacancies!Hundreds of open job posts for Software Engineers, QA, DevOps, Business Analysts and other tech professionals

In this quick tutorial we will learn how to stream files using Google PUB/SUB and store them in Big Query using Apache Beam.

All you need is:

A Google account (and a bank card to enable the free trial that comes with a US$300 credit)
Basic knowledge in Java
Basic knowledge in Apache Beam
Your favorite IDE (I recommend IntelliJ IDEA)

Step 1: Download the repository

First off, we are going to download this Github repository with all the required dependencies to save some time.

Step 2: Create a new Google Cloud project

In the Google Cloud console, on the project selector page, create a new Google Cloud project.

Give the project any name and click “create.”

We need to enable billing for our project, but do not worry — Google provides us with US$300 for playing.

Click on the navigation bar on the left and select “billing option.” Then click on “LINK A BILLING ACCOUNT.”

Select an organization type using the drop-down menu (in my case I chose “Personal Project”), then accept the Terms of Service.

The second step is to provide a phone number and verify your identity.

You will need to add a bank card and confirm that you are not a robot. After that, click on “START MY FREE TRIAL.”

Now that we have our GCP billing activated, we can play with the services and then disable all of them to avoid extra charges.

using_GraalIVM_&_AWS_Lambda_in_Java_preview.jpg

using GraalVM & AWS Lambda in Java for cold start problems

Read full story

Step 3: Enable the Pub/Sub service

First, search for “Cloud Pub/Sub API” and click on the first result, then click on the “enable” button.

Now it is time to create your service account to manage the access of your services.

Search for “Service Accounts” and select it
Click on “New Service Account” and fill in the following fields with the name of the service account

Click on “CREATE AND CONTINUE.”

Now we need to provide access to our project by adding these roles: Pub/Sub Admin and BigQuery Admin.

Click on “SELECT A ROLE,” then click on “CONTINUE” and “DONE.”

Step 4: Create a service account key

Create a service account key to access the account from outside of GCP.

On the service account page, click on the email property within the filter table and go to the key tab.

Click on “ADD KEY AND CREATE A NEW KEY IN JSON FORMAT AND STORE IN YOUR PC.”

Step 5: Create a topic and subscription to this project

Go to the Google shell terminal and run the following commands:

Awesome! Now we have our topic and subscription created, so be sure to copy the name of the project and the subscription because we will use them later.

Preview_—_Best_cloud_certifications_2022.png

the best cloud certifications you need in 2024

Read full story

Step 6: Set up a streaming pipeline

Search bigquery service and go to the viewing pinned projects option. You will see that a new project has already been created. Click on the three dots and select “create dataset,” then name it “streamingtest.”

Within the streamingtest, click on the new database to create a table.

Name the table “pubsubtest” and add additional fields by clicking the + symbol in Schema options, then create the table.

Now we need to set some arguments in IntelliJ and add these parameters:

Create an environment variable called “GOOGLE_APPLICATION_CREDENTIALS” and point it to the JSON service key that we downloaded before.

Now we have the project set up and we need to import our initial project that we already cloned.

Once we import it, we will add some lines of code.

First, create an interface where we can add a few custom arguments:

Next, create a class called “saveToBigquery” which extends from the DoFn class:

Here we are mapping the files into a TableRow object to insert in BigQuery.

Finally, add the logic to create a pipeline and apply several transformations to our file.

In the main method, add the next code:

Here we are creating a pipeline from the file that we pass from the input folder, reading each minute for new files, putting in a topic, and reading from the subscription.

Finally, we map this information and save in BigQuery.

To make it work, just add any txt file to the input folder in the root of the project and it should save that content in BigQuery.

And there you go!

In this quick tutorial, we learned how to create a streaming pipeline in Apache Beam using Pub/Sub. You can also do this in Amazon Kinesis.

Additionally, you can use Google storage or GC functions to emit files to the topic and also get from subscriptions.

Important! Do not forget to remove the topic, subscription, and project from the GCP console to avoid extra charges.

You can find the final project in my Github.

Happy coding!

engineering/java

updated 24 Feb 2024

Facebook LinkedIn Twitter Send via email

written by

Erick Romero

Software Engineer, EPAM Anywhere, Colombia

our editorial policy

Explore our Editorial Policy to learn more about our standards for content creation.

top resume-boosting Java projects for your portfolio

Build an impressive portfolio with the best Java projects. From intermediate to side projects, and from backend projects to those ideal for resumes, we cover them all.

java

what’s the difference between Spring and Spring Boot in Java?

Check out a comparative guide of the Spring framework and Spring Boot to choose the right one for your next project. Click to read.

java

difference between Java and PHP: which one is the best choice for your project?

Check out our complete comparative guide of PHP and Java, compare the different features to make the right choice when building a web or mobile application. Click to read.

java

difference between Spring and Struts in Java: what’s the best choice?

Check out a complete comparative guide on the difference between Struts and Spring Java frameworks and the reasons to choose each one for your project. Click to read.

java

difference between Java EE and Spring: which framework is the best choice?

Check out a comparative guide of J2EE and Spring to understand the differences and choose the right framework for your project. Click to read.

java

12 pros and cons of Java for your project

Understanding the pros and cons of Java lets you decide if Java is the right language for your project. Here’s what you need to know.

java

Java frameworks for microservices: compare and choose the best one

Discover the best Java microservices frameworks to build an effective architecture. Features, comparisons, and explanations included. Read on for more!

java

top 5 java web development frameworks: which is the best one for your next project?

Check out our complete guide on the most popular Java web frameworks, their features, pros and cons to help you choose the best one for your development project. Click to read.

latest blog posts

career advice

how to become a senior software engineer: a guide

Learn how to grow to a senior software engineer role, how long does it take, and whether you have the skills to become one.

career advice

creative Salesforce portfolio project ideas

Explore our dynamic collection of Salesforce portfolio projects and get inspiration from ideas to win your next top developer job.

tips & tools

rebase vs. merge: a comprehensive guide

Understand the techniques of rebase vs merge, discover the pros, cons, and when best to use git rebase vs git merge in this comprehensive guide.

tips & tools

the future of Agile: what to expect in 2024 and beyond

Discover the agile trends driving innovation in 2024 and beyond with applications for both IT and non-IT teams.

career advice

best React projects for a portfolio: from ideas to standout examples

Discover best React projects for portfolio enhancement and get inspired with project ideas to craft your shining portfolio.

tips & tools

Z-test vs T-test: the differences and when to use each

Explore statistical significance using Z-test vs T-test, understand their differences, when to use them, and how to decide between T-test or Z-test for your hypothesis testing.

job interviews

Ruby on Rails interview questions

Ace your next Ruby on Rails developer interview with our comprehensive guide of top 40 questions and answers. Prepare to land your dream job!

job interviews

data scientist interview questions and answers

Explore our guide on the top 50 data scientist interview questions and answers to ace your next data science job interview.