blog/engineering/cloud/creation of an ETL pipeline with GCP Dataflow and Apache Beam

creation of an ETL pipeline with GCP Dataflow and Apache Beam

contents

SHARE LINKS

Facebook Twitter LinkedIn

Send via email

8 min read

updated 24 Feb 2024

written by

Oleksandr Reshetnik

Software Engineering Manager

let's find your best cloud job at Anywhere

find me a job

subscribe to EPAM Anywhere vacancies!Hundreds of open job posts for Software Engineers, QA, DevOps, Business Analysts and other tech professionals

Processing large amounts of raw data from different sources requires the appropriate tools and solutions to effectively carry out data integration. To minimize expenses and streamline their processes, many businesses prefer to work with serverless tools and codeless solutions. This article will show you how to use an open-source data processing platform to perform Extract, Transform, and Load (ETL) operations. Building an ETL pipeline with Apache beam and running it on Google Cloud Dataflow is EPAM’s example of creating an ODS solution.

Challenge 1: new complex data integration solution

Challenge 1: data integration solution supporting very different types of data sources

A client contacted EPAM with a challenging task - to build a data integration solution for data analysis and reporting, with the possibility of integrating it with machine learning and AI-based applications. The data integration solution had to gather data from different types of data sources (SQL, NoSQL, REST) and transform the data for reporting, discovery, and ML. Seeking a solution that could easily integrate with multiple traditional and non-traditional data sources and be flexible to any change, the client was interested in an analytic tool like Tableau or Data Studio.

The client expected that the developed solution would include a reliable and flexible data pipeline architecture, an internal common data model for the abstraction of incoming data extracted from different data sources, as well as data dictionaries for reporting, data pipeline orchestration, and reporting in Tableau.

What is the ETL pipeline: ETL vs. data pipeline

A data pipeline is a process of moving data from a source to a destination for storage and analysis. Generally, a data pipeline doesn’t specify how the data is processed along the way. One feature of the data pipeline is that it may also filter data and ensure resistance to failure.

If that is a data pipeline, what is an ETL pipeline? It is also a data pipeline but the name – ETL data pipeline – identifies the sequence of steps that are applied to the data. In an ETL data pipeline, data is read (Extract), processed (Transform), and handed over to the target system (Load).

An Extract-Transform-Load pipeline: enables data migration from a source system to new storage; centralizes and standardizes multiple sources for a consolidated view; and provides a vast dataset for BI and analytics. Generally speaking, ETL is a sub-process, while “data pipeline” is a broader term that represents the entire process of transporting data.

EPAM solution

The EPAM team identified the first step as creating the ODS (Operational Data Store); a central repository of data from multiple systems with no specific data integration requirement. The ODS makes data available for business analysis and reporting by synthesizing data in its original format from various sources into a single destination.

In contrast to a data warehouse containing static data and performing queries, the ODS is an intermediary for a data warehouse. It provides a consolidated repository available for all systems, overwriting and changing data frequently, reporting on a more sophisticated level, and supporting BI tools.

In the search for a self-service integration platform, the EPAM team wanted a manageable solution for external teams and external sources that was not as costly as P2P integration.

First, we found a platform to run our solution. Then we decided on a solution. Finally, we put it all up together.

What is Dataflow?

The Google Cloud Platform ecosystem provides a serverless data processing service, Dataflow, for executing batch and streaming data pipelines. As a fully managed, fast, and cost-effective data processing tool used with Apache Beam, Cloud Dataflow allows users to develop and execute a range of data processing patterns, Extract-Transform-Load (ETL), and batch and streaming.

Data can be brought in from multiple data sources (CRM, Databases, REST API, file systems). A pipeline executed by Dataflow extracts and reads data, then transforms it, and finally writes it out, loading it into the cloud or on-premise storage.

Dataflow is a perfect solution for building data pipelines, monitoring their execution, and transforming and analyzing data, because it fully automates operational tasks like resource management and performance optimization for your pipeline. In Cloud Dataflow, all resources are provided on-demand and automatically scaled to meet requirements.

Dataflow works both for batch and streaming data. It can process different chunks of the data in parallel, and is designed to process big data. Dataflow is a perfect solution for autoscaling resources, balancing dynamic work, reducing processing cost per data record, and providing ready-to-use real-time AI patterns. The scope of features enables Dataflow to execute complex pipelines using basic and specific custom transforms.

With Dataflow providing a range of opportunities, your results are limited only by your imagination and your team’s skills.

Why does EPAM choose Dataflow?

There are other data processing platforms in addition to Dataflow. Google’s Dataproc service is also an option, but the EPAM team liked that Dataflow is serverless and automatically adds up clusters if needed along the way. Most importantly, Google intended Apache Beam programs to run on Dataflow or a user’s own systems.

For standard data processing solutions, Google Cloud Dataflow provides quick-start templates. By eliminating the need to develop pipeline code, Cloud Dataflow templates facilitate pipeline building. Custom data processing solutions, however, require Apache Beam programming experience.

Google promotes Dataflow as one of the major components of a Big Data architecture on GCP. With the ability to extract data from open sources, this serverless solution is native to the Google Cloud Platform, enabling quick implementation and integration. Dataflow can also run custom ETL solutions since it has: building blocks for Operational Data Store and data warehousing; pipelines for data filtering and enrichment; pipelines to de-identify PII datasets; features to detect anomalies in financial transactions; and log exports to external systems.

What is Apache Beam?

What is Apache Beam? Evolved from a number of Apache projects, Apache Beam emerged as a programming model for creating data processing pipelines.

The Apache Beam framework does the heavy lifting of large-scale distributed data processing. It handles details like sharing datasets and managing individual workers, allowing users to focus on what exactly they need to do rather than how they’re going to do it.

You can use Apache Beam to build your pipelines and then run them on cloud-based runners like Dataflow, Apache Flink, Apache Nemo, Apache Samza, Apache Spark, Hazelcast Jet, and Twister2. You can use Apache Beam not only for Java but also for Python and Go to build pipelines.

Why did EPAM choose Apache Beam?

First, Apache Beam is very effective, and it is effortless for use with Java. In contrast to Apache Spark, Apache Beam requires less configuration. Apache Beam is an open ecosystem, independent of any particular vendor, and run by a community.

Beam has four essential components:

The pipeline is a complete process consisting of steps that read, transform, and save data. It starts with the input (source or database table), involves transformations, and ends with the output (sink or database table). Transformation operations can include filtering, joining, aggregating, etc., and are applied to data to give it meaning and the form desired by the end user.
PCollection is a specialized container of nearly unlimited size representing a set of data in the pipeline. PCollections are the input and the output of every single transformation operation.
PTransform is a data processing step inside of your pipeline. Whatever operation you choose – data format conversion, mathematical computation, data grouping, combining, or filtering – you specify it, and a transform performs it on each element from PCollection. P in the name stands for parallel as transforms can operate in parallel in a large amount of distributed workers.
The runner determines where the pipeline will operate.

Apache Beam Code Examples

Even though either Java or Python can be used, let’s take a look at the pipeline structure of Apache Beam code using Java.

Here’s an example created with Dataflow. To develop and run a pipeline, you need to

Create a PCollection
Apply a sequence of PTransforms
Run the pipeline

Create a PCollection tutorial — Creating of a PCollection

Here’s an example created with Apache Spark runner, which replicates the code but uses a different Runner options fragment.

Now, let’s run this code for basic transformations. Here’s a WordCount example. Supporting ‘read’ and ‘write’ operations, the pipeline reads from multiple text formats.

The first transform is ‘TextIO.read’ and the output is PCollection with string lines of text from the input file. The second turns the string lines in PCollection<Strings>. The third transform is filtering empty words. The fourth transform counts the number of times a word shows up. This is the map transform applying a function for each element in the input PCollection and producing one output element. Each word count gets a value. The fifth and final transform formats MapElements into TypeDescriptors strings, creating an output text file.

This example is executed with standard transformations of the Beam framework.

Now let’s see how to run a pipeline with user-defined custom transforms. After the pipeline is created, a text file Read transform is applied. The Count transform is a custom transform that counts words. The map transform applies a new custom function, FormatAsTextFn, and formats each occurrence of word count into a string. Strings then go into an output file.

Now let’s examine how to create composite transforms. Here, we use ParDo steps (a transform for generic parallel processing) and transforms in SDK to count words. The PTransform subclass CountWords has two complex transforms. The ParDo extracts words and the SDK-provided transform runs Count.perElement.

A similar algorithm is used for extracting words. We just split the line into words and filter out empty values. In this example, we specify explicit DoFns. This function gets exactly one element in the input PCollection and a receiver in the resulting output file.

Challenge 2: unexpected context

The customer was happy with the solution that EPAM suggested based on Dataflow. A fully serverless data processing system, built on top of GCP and using Google’s Big Data architecture, was just what they needed to support multiple sources and to use with Machine Learning and AI.

After the solution was presented, however, the customer revealed an unexpected piece of information: the databases were extremely important and sensitive. So, for the sake of security, the customer couldn’t trust Google or use the bulk of the GCP services.

Our solution

In response, the EPAM team decided to use available software services and build a separate ETL solution for each data source. Moving forward, each new data source will be added by generating a new project from the common ETL archetype and using the business logic code.

While designing the customer’s project pipeline, the EPAM team studied the customer’s data sources and decided to unify all data in the staging area before loading it to the data warehouse. The staging area acts as a buffer and protects data from corruption.

Basically, we arrived at a Lambda Architecture variation that allows users to process data from multiple sources in different processing modes.

EPAM’s solution is open to expansions and external integrations. Using a batch system and a stream processing system in parallel, EPAM’s solution has two independent stages, increasing system reliability and expanding possibilities for external systems.

External systems can passively integrate through access to a data source and actively send data to the staging area, either to Google BigQuery or right to a reporting BI tool. Having the staging area in the project’s pipeline allows users to cleanse data and combine multiple data sources.

EPAM integrators built a software architecture design ideal for data integration, and a GCP Dataflow framework to use as a starting point for ETL pipeline building. EPAM’s software architecture design allows using ready-made ETL solutions and codeless integrations. The customer and integrators will eventually increase the number of integrations, and extend the Apache Beam framework with new components. This will greatly facilitate subsequent integrations.

Summing up

Our conclusions from building an ETL pipeline for a customer can be summarized as follows:

Google Dataflow is a reliable, fast, and powerful data processing tool
Using a serverless approach significantly accelerates data processing software development
Apache Beam is a programming model for data processing pipelines with rich DSL and many customization options
Designing the ETL pipeline in the framework style enables users to create reusable solutions with self-service capabilities
Serverless and decoupled architecture is a cost-effective approach to accommodating your customer needs

engineering/cloud

updated 24 Feb 2024

Facebook LinkedIn Twitter Send via email

written by

Oleksandr Reshetnik

Software Engineering Manager

our editorial policy

Explore our Editorial Policy to learn more about our standards for content creation.

top resume-boosting Java projects for your portfolio

Build an impressive portfolio with the best Java projects. From intermediate to side projects, and from backend projects to those ideal for resumes, we cover them all.

java

what’s the difference between Spring and Spring Boot in Java?

Check out a comparative guide of the Spring framework and Spring Boot to choose the right one for your next project. Click to read.

java

difference between Java and PHP: which one is the best choice for your project?

Check out our complete comparative guide of PHP and Java, compare the different features to make the right choice when building a web or mobile application. Click to read.

java

difference between Spring and Struts in Java: what’s the best choice?

Check out a complete comparative guide on the difference between Struts and Spring Java frameworks and the reasons to choose each one for your project. Click to read.

java

difference between Java EE and Spring: which framework is the best choice?

Check out a comparative guide of J2EE and Spring to understand the differences and choose the right framework for your project. Click to read.

java

12 pros and cons of Java for your project

Understanding the pros and cons of Java lets you decide if Java is the right language for your project. Here’s what you need to know.

java

Java frameworks for microservices: compare and choose the best one

Discover the best Java microservices frameworks to build an effective architecture. Features, comparisons, and explanations included. Read on for more!

java

top 5 java web development frameworks: which is the best one for your next project?

Check out our complete guide on the most popular Java web frameworks, their features, pros and cons to help you choose the best one for your development project. Click to read.

latest blog posts

career advice

how to become a senior software engineer: a guide

Learn how to grow to a senior software engineer role, how long does it take, and whether you have the skills to become one.

career advice

creative Salesforce portfolio project ideas

Explore our dynamic collection of Salesforce portfolio projects and get inspiration from ideas to win your next top developer job.

tips & tools

rebase vs. merge: a comprehensive guide

Understand the techniques of rebase vs merge, discover the pros, cons, and when best to use git rebase vs git merge in this comprehensive guide.

tips & tools

the future of Agile: what to expect in 2024 and beyond

Discover the agile trends driving innovation in 2024 and beyond with applications for both IT and non-IT teams.

career advice

best React projects for a portfolio: from ideas to standout examples

Discover best React projects for portfolio enhancement and get inspired with project ideas to craft your shining portfolio.

tips & tools

Z-test vs T-test: the differences and when to use each

Explore statistical significance using Z-test vs T-test, understand their differences, when to use them, and how to decide between T-test or Z-test for your hypothesis testing.

job interviews

Ruby on Rails interview questions

Ace your next Ruby on Rails developer interview with our comprehensive guide of top 40 questions and answers. Prepare to land your dream job!

job interviews

data scientist interview questions and answers

Explore our guide on the top 50 data scientist interview questions and answers to ace your next data science job interview.