There are several steps involved in the process of building a data lake. In this article we will cover the basic process of building a scalable data lake.
Before you get started in building a data lake you have to decide where the data lake will actually reside. This usually entails a choice between on premises data storage and cloud data storage. Whatever the choice your data store needs to be scalable, available, and performant, secure, and redundant.
Once you have decided where your data lake will be located it is time to begin “loading” data into the data lake. We prefer the term “ingest” over the term “load” because “loading” often carries with it connotations of the Extract/Transform/Load process familiar to data warehouses. The process of loading a data warehouse involves a lot of data preparation, but data lakes ingest data in its most raw form.
This data ingestion can take many forms:
Once the data is in the data lake, the data is often transformed into other formats and used to load tools more specifically targeted to the usefulness of the data lake data. There are several common use cases for the data in a data lake; these use cases are outlined below:
Once the data is loaded into the data lake it can be operated upon for further insight. There are several common “use cases” for data manipulation
There are a number of tools available to you as you build your data lake. These tools facilitate the various steps of building the data lake, from storage and ingestion to analysis and search. We will explore several of these toolsets here.
Hadoop is an open source framework for handling big data. At its core Hadoop is known as a distributed file system. What this means is that you can feed data into one point of entry, and then Hadoop divides the storage of that information across many computing nodes. This provides several advantages. First, this ensures data redundancy. One file might be replicated across dozens of nodes in a Hadoop framework, which ensures that your data lake will continue to be fully functional even if several of the Hadoop nodes are not operational. This protects the availability of the data lake in the event of hardware or software problems.
Hadoop’s distributed nature powers its ability to perform fast querying over large-scale data sets. Hadoop’s core querying technology, MapReduce, allows a user to define instructions for operations upon data. Once Hadoop has the instructions, it coordinates the execution of instructions across its many data nodes. Because of the distributed nature of the data within the Hadoop architecture, this means in practice that Hadoop queries are run across many nodes simultaneously. This “divide and conquer” approach allows for fast querying over enormous datasets. A search or calculation that might take a very long time on one computing node can be reduced to a fraction of the time when distributed over many nodes.
Hadoop contains some querying facilities, but Spark is focused almost exclusively on performing intensive queries on large data sets. It allows you to connect to large data sets anywhere, and perform parallel operations on those datasets. It supports a number of statistical and querying languages such as Python, Java, R, and SQL.
Elastic is optimized to work around JSON documents and allows for fast and easy management of documents within the document store. Fast document search is the primary reason why Elastic is deployed as part of a data lake. Like Hadoop, Elastic can be configured to spread its data over many computing/storage nodes, but it presents a unified interface and API where data can easily be added, removed, updated, and searched, all using intuitive JSON language.
Elastic offers Beats and Logstash as easy ways to move data into the Elastic Search engine. These tools function as easy to configure services and monitors. These monitors run on your servers, take high frequency data such as server logs and network traffic, and send them to Elasticsearch.
Kibana is the data visualization arm of the elastic family. It allows for a user of Elastic to create dashboards based on queries run against the backend Elasticsearch store. Kibana is used frequently in server monitoring and tracking web hits for various properties, but it can be configured to provide dashboards of all sorts of data.
TensorFlow and PyTorch are machine learning frameworks that allow data scientists to build machine learning models with ease.