Processing large amounts of raw data from different sources requires the appropriate tools and solutions to effectively carry out data integration. To minimize expenses and streamline their processes, many businesses prefer to work with serverless tools and codeless solutions. This article will show you how to use an open-source data processing platform to perform Extract, Transform, and Load (ETL) operations. Building an ETL pipeline with Apache beam and running it on Google Cloud Dataflow is EPAM’s example of creating an ODS solution.
Challenge 1: new complex data integration solution
Challenge 1: data integration solution supporting very different types of data sources
A client contacted EPAM with a challenging task - to build a data integration solution for data analysis and reporting, with the possibility of integrating it with machine learning and AI-based applications. The data integration solution had to gather data from different types of data sources (SQL, NoSQL, REST) and transform the data for reporting, discovery, and ML. Seeking a solution that could easily integrate with multiple traditional and non-traditional data sources and be flexible to any change, the client was interested in an analytic tool like Tableau or Data Studio.
The client expected that the developed solution would include a reliable and flexible data pipeline architecture, an internal common data model for the abstraction of incoming data extracted from different data sources, as well as data dictionaries for reporting, data pipeline orchestration, and reporting in Tableau.
What is the ETL pipeline: ETL vs. data pipeline
A data pipeline is a process of moving data from a source to a destination for storage and analysis. Generally, a data pipeline doesn’t specify how the data is processed along the way. One feature of the data pipeline is that it may also filter data and ensure resistance to failure.
If that is a data pipeline, what is an ETL pipeline? It is also a data pipeline but the name – ETL data pipeline – identifies the sequence of steps that are applied to the data. In an ETL data pipeline, data is read (Extract), processed (Transform), and handed over to the target system (Load).
An Extract-Transform-Load pipeline: enables data migration from a source system to new storage; centralizes and standardizes multiple sources for a consolidated view; and provides a vast dataset for BI and analytics. Generally speaking, ETL is a sub-process, while “data pipeline” is a broader term that represents the entire process of transporting data.
The EPAM team identified the first step as creating the ODS (Operational Data Store); a central repository of data from multiple systems with no specific data integration requirement. The ODS makes data available for business analysis and reporting by synthesizing data in its original format from various sources into a single destination.
In contrast to a data warehouse containing static data and performing queries, the ODS is an intermediary for a data warehouse. It provides a consolidated repository available for all systems, overwriting and changing data frequently, reporting on a more sophisticated level, and supporting BI tools.
In the search for a self-service integration platform, the EPAM team wanted a manageable solution for external teams and external sources that was not as costly as P2P integration.
First, we found a platform to run our solution. Then we decided on a solution. Finally, we put it all up together.
What is Dataflow?
The Google Cloud Platform ecosystem provides a serverless data processing service, Dataflow, for executing batch and streaming data pipelines. As a fully managed, fast, and cost-effective data processing tool used with Apache Beam, Cloud Dataflow allows users to develop and execute a range of data processing patterns, Extract-Transform-Load (ETL), and batch and streaming.
Data can be brought in from multiple data sources (CRM, Databases, REST API, file systems). A pipeline executed by Dataflow extracts and reads data, then transforms it, and finally writes it out, loading it into the cloud or on-premise storage.
Dataflow is a perfect solution for building data pipelines, monitoring their execution, and transforming and analyzing data, because it fully automates operational tasks like resource management and performance optimization for your pipeline. In Cloud Dataflow, all resources are provided on-demand and automatically scaled to meet requirements.
Dataflow works both for batch and streaming data. It can process different chunks of the data in parallel, and is designed to process big data. Dataflow is a perfect solution for autoscaling resources, balancing dynamic work, reducing processing cost per data record, and providing ready-to-use real-time AI patterns. The scope of features enables Dataflow to execute complex pipelines using basic and specific custom transforms.
With Dataflow providing a range of opportunities, your results are limited only by your imagination and your team’s skills.
Why does EPAM choose Dataflow?
There are other data processing platforms in addition to Dataflow. Google’s Dataproc service is also an option, but the EPAM team liked that Dataflow is serverless and automatically adds up clusters if needed along the way. Most importantly, Google intended Apache Beam programs to run on Dataflow or a user’s own systems.
For standard data processing solutions, Google Cloud Dataflow provides quick-start templates. By eliminating the need to develop pipeline code, Cloud Dataflow templates facilitate pipeline building. Custom data processing solutions, however, require Apache Beam programming experience.
Google promotes Dataflow as one of the major components of a Big Data architecture on GCP. With the ability to extract data from open sources, this serverless solution is native to the Google Cloud Platform, enabling quick implementation and integration. Dataflow can also run custom ETL solutions since it has: building blocks for Operational Data Store and data warehousing; pipelines for data filtering and enrichment; pipelines to de-identify PII datasets; features to detect anomalies in financial transactions; and log exports to external systems.
What is Apache Beam?
What is Apache Beam? Evolved from a number of Apache projects, Apache Beam emerged as a programming model for creating data processing pipelines.
The Apache Beam framework does the heavy lifting of large-scale distributed data processing. It handles details like sharing datasets and managing individual workers, allowing users to focus on what exactly they need to do rather than how they’re going to do it.
You can use Apache Beam to build your pipelines and then run them on cloud-based runners like Dataflow, Apache Flink, Apache Nemo, Apache Samza, Apache Spark, Hazelcast Jet, and Twister2. You can use Apache Beam not only for Java but also for Python and Go to build pipelines.
Why did EPAM choose Apache Beam?
First, Apache Beam is very effective, and it is effortless for use with Java. In contrast to Apache Spark, Apache Beam requires less configuration. Apache Beam is an open ecosystem, independent of any particular vendor, and run by a community.
Beam has four essential components:
- The pipeline is a complete process consisting of steps that read, transform, and save data. It starts with the input (source or database table), involves transformations, and ends with the output (sink or database table). Transformation operations can include filtering, joining, aggregating, etc., and are applied to data to give it meaning and the form desired by the end user.
- PCollection is a specialized container of nearly unlimited size representing a set of data in the pipeline. PCollections are the input and the output of every single transformation operation.
- PTransform is a data processing step inside of your pipeline. Whatever operation you choose – data format conversion, mathematical computation, data grouping, combining, or filtering – you specify it, and a transform performs it on each element from PCollection. P in the name stands for parallel as transforms can operate in parallel in a large amount of distributed workers.
- The runner determines where the pipeline will operate.
Apache Beam Code Examples
Even though either Java or Python can be used, let’s take a look at the pipeline structure of Apache Beam code using Java.
Here’s an example created with Dataflow. To develop and run a pipeline, you need to
- Create a PCollection
- Apply a sequence of PTransforms
- Run the pipeline
Here’s an example created with Apache Spark runner, which replicates the code but uses a different Runner options fragment.
Now, let’s run this code for basic transformations. Here’s a WordCount example. Supporting ‘read’ and ‘write’ operations, the pipeline reads from multiple text formats.
The first transform is ‘TextIO.read’ and the output is PCollection with string lines of text from the input file. The second turns the string lines in PCollection<Strings>. The third transform is filtering empty words. The fourth transform counts the number of times a word shows up. This is the map transform applying a function for each element in the input PCollection and producing one output element. Each word count gets a value. The fifth and final transform formats MapElements into TypeDescriptors strings, creating an output text file.
This example is executed with standard transformations of the Beam framework.
Now let’s see how to run a pipeline with user-defined custom transforms. After the pipeline is created, a text file Read transform is applied. The Count transform is a custom transform that counts words. The map transform applies a new custom function, FormatAsTextFn, and formats each occurrence of word count into a string. Strings then go into an output file.
Now let’s examine how to create composite transforms. Here, we use ParDo steps (a transform for generic parallel processing) and transforms in SDK to count words. The PTransform subclass CountWords has two complex transforms. The ParDo extracts words and the SDK-provided transform runs Count.perElement.
A similar algorithm is used for extracting words. We just split the line into words and filter out empty values. In this example, we specify explicit DoFns. This function gets exactly one element in the input PCollection and a receiver in the resulting output file.
Challenge 2: unexpected context
The customer was happy with the solution that EPAM suggested based on Dataflow. A fully serverless data processing system, built on top of GCP and using Google’s Big Data architecture, was just what they needed to support multiple sources and to use with Machine Learning and AI.
After the solution was presented, however, the customer revealed an unexpected piece of information: the databases were extremely important and sensitive. So, for the sake of security, the customer couldn’t trust Google or use the bulk of the GCP services.
In response, the EPAM team decided to use available software services and build a separate ETL solution for each data source. Moving forward, each new data source will be added by generating a new project from the common ETL archetype and using the business logic code.
While designing the customer’s project pipeline, the EPAM team studied the customer’s data sources and decided to unify all data in the staging area before loading it to the data warehouse. The staging area acts as a buffer and protects data from corruption.
Basically, we arrived at a Lambda Architecture variation that allows users to process data from multiple sources in different processing modes.
EPAM’s solution is open to expansions and external integrations. Using a batch system and a stream processing system in parallel, EPAM’s solution has two independent stages, increasing system reliability and expanding possibilities for external systems.
External systems can passively integrate through access to a data source and actively send data to the staging area, either to Google BigQuery or right to a reporting BI tool. Having the staging area in the project’s pipeline allows users to cleanse data and combine multiple data sources.
EPAM integrators built a software architecture design ideal for data integration, and a GCP Dataflow framework to use as a starting point for ETL pipeline building. EPAM’s software architecture design allows using ready-made ETL solutions and codeless integrations. The customer and integrators will eventually increase the number of integrations, and extend the Apache Beam framework with new components. This will greatly facilitate subsequent integrations.
Our conclusions from building an ETL pipeline for a customer can be summarized as follows:
- Google Dataflow is a reliable, fast, and powerful data processing tool
- Using a serverless approach significantly accelerates data processing software development
- Apache Beam is a programming model for data processing pipelines with rich DSL and many customization options
- Designing the ETL pipeline in the framework style enables users to create reusable solutions with self-service capabilities
- Serverless and decoupled architecture is a cost-effective approach to accommodating your customer needs