The Snowflake Inc. company disrupted the world of data analytics with its same-name platform. They launched their renowned solution in 2012. Since then, the Snowflake platform has received numerous updates and functionality to become a robust and safe data platform.
One of the clients at EPAM reached out to us with a request to develop a complex analytics platform for its end-clients. Its vital features included self-service data discovery, dashboards, and real-time data connections.
Not all source systems we needed to integrate our system with, allowed us to extract the data in real-time. Our task was to design a solution that would combine batch and streaming data pipelines. On top of that, our core technologies were Snowflake and Tableau.
Snowflake is a key technology for the analytical system, but it required extra tools to help us build a full-fledged platform.
Another requirement was to use the AWS cloud as a hosting platform. Considering all of those conditions, we came up with a solution similar to Lambda Architecture.
On a high level, the architecture consists of the following layers:
Now let's overview the implementation of the end-to-end data pipeline.
Here's the step-by-step data workflow:
As part of the speed layer, the data goes through the following steps:
The serving layer is implemented as a set of Snowflake DB views that combine the information from data marts (prepared in the batch dataflow) and the Snowflake external tables (Live View on the diagram). As a result, the actual data is ready for consumption from the Tableau server through customized dashboards and self-service data discovery capability. Using Live Connection mode, the Tableau server makes queries directly against Snowflake DB.
If you have experience with Snowflake, you might be wondering why we didn't use Snowpipe to implement continuous data loading into the database. Snowpipe enables loading data in micro-batches from files as soon as they are available in a stage and makes it available to users within minutes. Our requirement was to decrease this time frame to seconds.
Also, with Snowpipe’s serverless compute model, Snowflake manages load capacity, ensuring optimal resources to meet demand, but you have to pay for it. In the case of external tables, we do not load data into Snowflake (it is stored in an S3 bucket) and, accordingly, we do not need to pay for data loading.
Keep in mind that as files with new events are added to S3, the external tables must be refreshed (alter external table … refresh).
As organizations are becoming more data-driven, business requirements are becoming more complicated.
For instance, lately, we've been seeing a trend for near real-time processing. This demand pushes software engineers to search for cutting-edge technologies like Snowflake. In this post, we shared our successful experience with Snowflake to implement near real-time analytics.