Thirty years ago only a very small sliver of organizations had to worry about this “big data”. Data storage was relatively expensive, and developing the processes to capture that data was time consuming. Most organizations were happy to develop a few procedures to move their data into data warehouses.
Much has changed in the ensuing decades. Regulations have dictated that organizations save much more data than before. The price of storage and compute power have dropped to the point where it feels like a mistake NOT to save as much data as possible. Cloud platforms have emerged that have enabled a “save everything” approach. Many organizations that would not have called themselves “data driven” a decade ago are struggling with the big data problem.
Traditionally data warehouses have been used to deal with the increasing volume of data. However, over the past decade, the data landscape has changed and data warehouses are no longer the best “first pass” data storage. Data warehouses are optimized around well-structured data that has been cleansed and formatted for analytics. They do not work well with image data, JSON/XML, or high frequency data such as logs and IoT endpoints.
The concept of a data lake has arisen in response to several relatively new trends in organizational data storage:
At its root a data lake is nothing more than a highly scalable and performant raw data store where data can be staged. What you do with that data is up to you.
The following are some of the differences between data lakes and more traditional data warehouses:
Scalable – There is not much point in creating a data lake if your data storage solution is not highly scalable. Data lakes can grow to petabytes and even exabytes in size depending on your storage needs. Cloud vendors have for the most part solved this problem by offering access to virtually unlimited storage. On premises data lakes will have to deal with this problem by acquiring extensive NAS storage.
Available – This means that the data lake can be accessed by all the people and applications that need to use it –whether for writing data for storage or reading data for analytics or other uses. Cloud vendors solve this problem through partitioning and access roles. On premises solutions will involve similar provisioning of network/storage access.
Performant – Because of the amount of data being ingested, data lake storage needs to be highly performant, both from the network access perspective and the actual disk storage perspective. This is one area where on-premises storage perhaps has an advantage over cloud storage. Internal network connections are often much faster and offer far lower latency than the external network connections required to move data to the cloud. Some of this difference is mitigated if many of your “data generating applications” already exist in cloud infrastructure. Flexibility in disk storage is an advantage of cloud providers – plans exist where you can direct your “freshest” data to highly responsive data stores, and as data ages, it can be archived to slower, lower-priority (and less expensive) storage.
Secure – With so much critical data in one place, security issues become paramount. Unfortunately data security often runs at cross-purposes with other elements in this checklist. There are several layers to security. Cloud providers often handle operating system and network security for you, but, in theory, their data stores are accessible to all with the right permissions. In contrast, an on-premises solution might allow you to hide the data behind your network firewalls, but you are responsible at the end of the day for patching, monitoring and mitigating all attacks against your network and infrastructure.
Redundant – Because so much organizational data is being stored in the data lake, it is critical to provide some sort of redundancy to the data lake. Cloud vendors handle this process by allowing you to mirror data storage over a number of availability zones. This physical separation of data copies reduces the risk of a catastrophic event wiping out years of organizational data.