Many organizational leaders are familiar with the idea of a data warehouse, but not a data lake, and wonder what the purpose of a data lake might be. We will explore in this article how a data lake differs from a data warehouse, and why a data lake might be a good addition to an organization’s data strategy.
It must first be said that data lakes and data warehouses are not mutually exclusive. In fact, the whole concept of a data lake arose to SUPPLEMENT, not SUPPLANT, traditional data warehouses. A traditional data warehouse is engineered to solve a particular set of problems, and while it is good at some things, it is not great at other use cases where a data lake excels. We explore below some of the tradeoffs of data storage and how data lakes and data warehouses differ in their approach. A data lake is just another tool for organizational data strategy.
The first major difference between data warehouses and data lakes is that data warehouses are usually optimized for READ operations and analytics whereas data lakes are optimized for WRITE operations and quick storage. Data warehouses that are performant for analytics operations are often heavily tuned so that important data can be accessed in a particular manner conducive to fast reporting. This involves creating and maintaining indexes on many fields and, in the case of magnetic storage, physically optimizing the location of data on disk to ensure that the data can be accessed quickly.
The optimization for read access has some advantages – when most of the workload on the data involves generating well-defined reports, data warehouses can be tuned precisely to make those specific reports as fast as possible. However, this optimization for quick read access has some tradeoffs. The biggest tradeoff is that when data is ADDED to the data warehouse some process must optimize the storage of the new data into the existing indexes and layout of the data warehouse. These operations have some performance cost, particularly when a data warehouse grows larger and larger.
In contrast, data lakes should be optimized for high frequency WRITE operations. A data lake does not necessarily care how the data is structured as long as it can be saved quickly to permanent storage. This means that a data lake is the best place for storing high frequency data such as real-time server activity logs, IoT information coming in from millions of devices every few seconds, or geolocation data of assets on the move. Of course, this optimization for writing comes at the expense of slower read operations – querying raw data lake data often takes much longer as the data is usually not indexed for fast access.
A common data workflow is for data to come into a data lake in real time, and then at spaced intervals for batch processes to copy or move some of that information into the data warehouse for reporting.
Traditional data warehouses have been optimized around the storage of well-defined, well-structured data. In most cases data is defined in the form of tables and columns, and each column has a strict definition of the kind of data that can be stored in that column, frequently along with size constraints on the data to be stored in that column.
These constraints have a number of advantages. They allow for the detection of bad data before it even enters the data warehouse. If a data must be a number and a string is entered the data is rejected before it even lands in the data warehouse. The constraints also allow for the read optimization discussed above. Since the data warehouse knows the type of data it is expecting beforehand, it allocates the right amount of space for that data. This makes a data warehouse very efficient for dealing with data in certain known datatypes – fixed length text, integers, floating point numbers, dates. However, it also means that data warehouses are not particularly good at storing unstructured, variable length binary content (commonly referred to as Binary Large Object, or “BLOB” content). This BLOB content - media files, binary document formats, save files from programs, extremely long text - is best located outside of a data warehouse.
Data lakes are perfect storage for unstructured data, particularly emerging data that “needs to go somewhere” but does not yet have an obvious use case. An example of this kind of data might be image data coming off of quality assurance of manufacturing processes. As the price of data storage drops and as regulatory data retention laws multiply, organizations are increasingly taking a “save everything” stance to data, and data lakes are a perfect first-pass storage solution.
The rise of JSON as a format for web applications and data exchange has also led to many de-emphasizing structured relational data storage in favor of a “no-sql” approach to data storage. Tools are emerging around that use case, most prominently elasticsearch as a tool that excels in storing and searching JSON documents.
The reality of today’s competitive business environment is that many organizations must innovate faster than ever, and that often means multiplying data faster than the data definitions can keep up. This means that data lake storage will begin to assume the front line in scooping up data.
There are some pieces of organizational data which, while they are important, exist in their own silos and are not cross-correlated with other organizational data. Examples of this would be server access logs for a company that does not specialize in cybersecurity. The server log data is important to IT operations staff, but it does not really “touch” the other aspects of the organization. For these pieces of data “vertical analytics” is most important.
Vertical analytics are analytics generated by rolling up one piece of data into day/week/month/quarter/year views without reference to other data. This kind of data is perfect for a data lake because vertical analytics can often be run on the raw input data without loading it into any other relational structure. The major cloud platforms are now offering tools to do precisely that – to allow “data-warehouse-like” querying on raw text files.
Horizontal analytics are all about looking at data ACROSS an organization to get a bigger picture of what’s going on. An example of a horizontal analytics operation, point of sale data might be combined with warehouse inventory data and website click data to optimize pricing and supply chain operations. To achieve this kind of analysis data must be pulled from several different data sources and stored in a way that cross-comparisons and correlations can be achieved.
Data governance also comes into play with such horizontal analytics – because the data comes from many different parts of the organization it’s important to define up front what those pieces of data actually mean. Doing horizontal analytics is a slower process, subject to much integration work and conversation around data definitions. These horizontal analytics are more suited to a traditional data warehouse where the data is structured in known ways and storage is optimized for more well-defined reports.
Traditional data warehouses came of age in a time where it was mostly large vendors innovating on large-scale data platforms. As a result there are a few vendors (particularly Oracle, Microsoft, and IBM) who dominated the data warehouse platform market. These platforms tend to be very costly and as a result organizations feel more locked into their commitment to a particular vendor and platform.
In contrast, data lakes lend themselves to the use of open source tools and a “choose your own pipeline” approach to building data architecture. At each step of the data journey one can choose the disk storage, etl tools, analytics and machine learning tools to run on data lake data. Many of the tools commonly used in a data lake pipeline are open source, including Apache Spark, Kafka, Hadoop, ElasticSearch, Tensorflow and PyTorch. This means that an organization has more flexibility in trying different solutions. The three major cloud platforms – Amazon Web Services, Microsoft Azure, and Google Cloud Platform – all offer hosted variants of these tools if you prefer a managed version.
Because there is still significant overlap between data warehouses and data lakes, and because data engineers have traditionally been trained with a particular set of tools (relational mentality, SQL query language), cloud data lake providers have provided tools that can be run directly against data lake data to provide virtual data warehousing structure and query capabilities. Amazon Redshift, Azure U-SQL, Google BigQuery all exist to provide a “data-warehouse-like” experience on top of raw data lake data.
Our prediction is that in the next ten years data lakes will take the place of preeminence in the world of data strategy and data warehouses will recede to a smaller, but still very important role. Instead of trying to twist a monolithic data warehouse to suit a thousand disparate organizational needs, organizations will begin to see that a data lake and the surrounding processes can be used to feed many different use cases. One of these use cases will indeed be providing the data that feeds the traditional data warehouse, but other use cases might be low latency text search using Elasticsearch, or machine learning using Tensorflow, or image recognition using neural networks.