In this quick tutorial we will learn how to stream files using Google PUB/SUB and store them in Big Query using Apache Beam.
All you need is:
- A Google account (and a bank card to enable the free trial that comes with a US$300 credit)
- Basic knowledge in Java
- Basic knowledge in Apache Beam
- Your favorite IDE (I recommend IntelliJ IDEA)
Step 1: Download the repository
Step 2: Create a new Google Cloud project
Give the project any name and click “create.”
We need to enable billing for our project, but do not worry — Google provides us with US$300 for playing.
Click on the navigation bar on the left and select “billing option.” Then click on “LINK A BILLING ACCOUNT.”
Select an organization type using the drop-down menu (in my case I chose “Personal Project”), then accept the Terms of Service.
The second step is to provide a phone number and verify your identity.
You will need to add a bank card and confirm that you are not a robot. After that, click on “START MY FREE TRIAL.”
Now that we have our GCP billing activated, we can play with the services and then disable all of them to avoid extra charges.
Step 3: Enable the Pub/Sub service
First, search for “Cloud Pub/Sub API” and click on the first result, then click on the “enable” button.
Now it is time to create your service account to manage the access of your services.
- Search for “Service Accounts” and select it
- Click on “New Service Account” and fill in the following fields with the name of the service account
Click on “CREATE AND CONTINUE.”
Now we need to provide access to our project by adding these roles: Pub/Sub Admin and BigQuery Admin.
Click on “SELECT A ROLE,” then click on “CONTINUE” and “DONE.”
Step 5: Create a topic and subscription to this project
Go to the Google shell terminal and run the following commands:
Awesome! Now we have our topic and subscription created, so be sure to copy the name of the project and the subscription because we will use them later.
Step 6: Set up a streaming pipeline
Search bigquery service and go to the viewing pinned projects option. You will see that a new project has already been created. Click on the three dots and select “create dataset,” then name it “streamingtest.”
Within the streamingtest, click on the new database to create a table.
Name the table “pubsubtest” and add additional fields by clicking the + symbol in Schema options, then create the table.
Now we need to set some arguments in IntelliJ and add these parameters:
Create an environment variable called “GOOGLE_APPLICATION_CREDENTIALS” and point it to the JSON service key that we downloaded before.
Now we have the project set up and we need to import our initial project that we already cloned.
Once we import it, we will add some lines of code.
First, create an interface where we can add a few custom arguments:
Next, create a class called “saveToBigquery” which extends from the DoFn class:
Here we are mapping the files into a TableRow object to insert in BigQuery.
Finally, add the logic to create a pipeline and apply several transformations to our file.
In the main method, add the next code:
Here we are creating a pipeline from the file that we pass from the input folder, reading each minute for new files, putting in a topic, and reading from the subscription.
Finally, we map this information and save in BigQuery.
To make it work, just add any txt file to the input folder in the root of the project and it should save that content in BigQuery.
And there you go!
In this quick tutorial, we learned how to create a streaming pipeline in Apache Beam using Pub/Sub. You can also do this in Amazon Kinesis.
Additionally, you can use Google storage or GC functions to emit files to the topic and also get from subscriptions.
Important! Do not forget to remove the topic, subscription, and project from the GCP console to avoid extra charges.
You can find the final project in my Github.