Data Strategy: Data Lake Solutions

(Looking for help in determining the best data lake solution for your organization? Take a look at our big data consulting services and contact us for more information)

In this article we will be exploring three major cloud vendors – Amazon Web Services, Azure, and Google Cloud Platform – and their distinct data lake/big data tools.

In the previous article we explored a few of the tools that help to build and use data lakes. These toolsets are almost all supported as hosted versions in each of the major cloud platforms. This allows you to spin up, say, Hadoop or Elastic and evaluate it before you commit to building a full implementation. In addition to the tools listed previously, each cloud provider has their own native flavors of data lake tools to work with. In some cases these native versions may offer performance gains as they are more deeply integrated into the cloud infrastructure.

Amazon Web Services (AWS) For Data Lakes

AWS Lake Formation provides a wizard type interface over various pieces of the AWS ecosystem that allow you easily to build a data lake. The primary backend storage of an AWS data lake is its S3 storage. S3 storage is highly scalable and available and can be made redundant across a number of availability zones. For less frequently used data/archives, Amazon Glacier offers you the option to store data in archives that have much longer retrieval times but at a fraction of the cost of S3 storage. This allows you to contain your storage costs by dividing storage needs into availability needs.

AWS Glue is a set of tools that allow you to perform ETL procedures upon data. You can use it as a sort of data mapping tool that can be pointed to any other AWS storage to define relationships. Once you have defined these relationships you can schedule tasks to be run that query, copy, and move data from one place to another. These tools are helpful when you are dealing with a large number of data assets in the AWS ecosystem.

AWS Athena is a clever tool which allows you to bypass the step of creating a traditional relational data warehouse or Hadoop cluster by leveraging data already residing in S3 storage. It allows you to create virtual table definitions on top of your data and then run standard SQL queries to obtain results from those definitions.

AWS Redshift is is AWS’ high performance data warehouse which allows for querying and analysis at exabyte scale.

Microsoft Azure For Data Lakes

Azure Data Lake is the competitor to AWS Cloud Formation. As with AWS, Azure Data Lake is centered around its storage capacity, with Azure blob storage being the equivalent to Amazon S3 storage. Azure Data Lakes rely heavily on the Hadoop architecture.

HDInsight is Microsoft’s management layer above Hadoop and Spark. You can use HDInsight to easily spin up Hadoop or Spark clusters and manage workloads.

Azure Data Lake Analytics allows you to create, schedule, and process big data querying jobs. Azure handles the creation of compute resources necessary to perform the analytics so you can focus on the query tasks themselves rather than the associated resources and infrastructure.

U-SQL is a Microsoft developed language for big data, combining combines data mapping and querying facilities. It provides language expressions for ETL operations, schema creation, and querying.

Power BI allows you to create presentation layer dashboards on top of existing Azure data stores to leverage those data stores into actionable information.

Google Cloud Data Lake (Google Cloud Platform)

Google Cloud Storage is the backend storage mechanism driving data lakes built on Google Cloud Platform. As with other cloud vendors, Google Cloud Storage is divided into tiers (standard/nearline/coldline) by availability and access time (with less accessible storage being much cheaper).

Google DataProc manages the creation of computing and storage clusters for Hadoop and Spark.

BI Engine allows for the creation of dashboards and visualizations on top of Google Cloud Platform data lakes.

BigQuery is the GCP engine which allows for data mapping to resources, querying, and automatic provisioning of computing resources on a query by query basis. It also offers the BigQuery Data Transfer Service to move data from many other Google platforms (ads, Youtube, etc) into BigQuery.

BigQuery ML is Google's tool for creating machine learning models based on BigQuery data. Google has long been a leader in machine learning/AI research, so expect this component to become a centerpiece of Google's data offering.

Summary

AWS

Azure

Google Cloud Platform

Storage

Amazon S3

Azure Blob Storage

Google Cloud Storage

Compute

AWS EC2 Instances

Azure Virtual Machines

Compute Engine

Queries at scale

AWS Redshift

U-SQL

BigQuery

Data Visualization

Amazon Quicksight

Power BI

Looker Studio

Amazon Sagemaker

Azure AI Studio

Vertex AI

Which data lake is right for you?

Choosing the right data lake for your organization can be difficult, and there is no clear winner from the data lake perspective. However, there are some factors that might influence your decision to build a data lake in one environment vs. another. Here are our factors that might lean you toward one particular solution:

Reasons to build your data lake on AWS:

Your organization is heavily invested in the AWS ecosystem via other products such as EC2, S3 storage, Route 53 network services, etc.
You want an environment that has stood the test of time and offers very compelling support.
You aren't heavily tied to a particular ecosystem such as Microsoft or Oracle.
You want an environment where there are many talented people available on the market.

Reasons to build your data lake on Azure:

Your organization is heavily invested in the Microsoft ecosystem (using SQL Server, Power BI, Office Online)
Your technical staff tend to specialize in Microsoft solutions.
Your organization might be thinking about integrating with OpenAI products in the future.

Reasons to build your data lake on Google Cloud Platform:

Your organization tends toward open source solutions and is not tied closely to any particular vendor and you want to stay platform independent.
You are already using BigQuery or Looker Studio in your organization.
Your data lake will be relying on a lot of data from Google Analytics or Google Ads.

Additional Data Lake Resources

What is a data lake?

What is a data lake and why does it matter?

Why build a data lake?

Do you really need a data lake?

Do you need a data lake if you already have a data warehouse?

Conclusion

In this series we learned the rationale behind building a data lake. We then explored how to build a data lake and the common tools used with data lakes. Finally, in this article we explored the cloud data lake options.

About

Case Studies

Data Dives

Power BI Consulting

Business Intelligence Consulting

Data Science Consulting

Nonprofit Data Consulting

Geographical Analysis Consulting

Data Lake Solutions