What is a data lake?
Before we answer this question, it would be good to have a clear idea of what a data lake actually is. The term was coined by James Dixon, founder and CTO of Pentaho. He describes a data lake as follows:
“If you think of a data mart as a shop filled with water bottles – cleansed, packaged, and structured for consumption– then a data lake is a large body of water in a more natural state. The contents of this lake stream in from various sources and combine to fill up the lake. However, it’s impure and not packaged. Multiple users can dive in, fish, do research, and take samples.”
Data warehouse versus data lake
The most significant difference between a data warehouse and a data lake is that a data warehouse is filled with structured data, whereas a data lake is unstructured. The advantage of a warehouse is that its structure makes it easier to find answers to BI questions than a data lake. However, a data lake can house much more complex and/or larger quantities of data. This data can then be employed for analysis when needed.
Is more always better?
Many people think more is always better. A lot of organizations immediately started working with data lakes. Some thought they were a good addition, or even a replacement, to a data mart or their data warehouse. However, there are six good reasons to think carefully before building a data lake of your own.
1. The lake stays dry
Filling a lake requires millions of liters of water instead of just a few bottles. That requires many bytes of data, and more specifically, the right data. So, ask yourself whether your organization can fill and maintain a data lake, and what the goal of using this big data is.
2. Rules and responsibilities
Many data lakes are used to store data related to privacy, which comes with its own set of rules and thus risks for your organization. Data about employees and customers, let alone patients or clients, are not subject to the same rules. That sounds obvious, but many organizations with a data lake don’t always know which data they’re collecting, where it comes from, and which risks and responsibilities come with them.
Data streams into your data lake without a clear image of its contents, because you collect as much data as possible. On top of that, because there is no prioritization of raw data, it becomes more difficult to comply with all the laws and regulations.
3. The lake becomes a swamp
When a data lake fills up, it’s easily polluted. A data lake, by definition, accepts any kind of data. Given all the raw data that flows into the lake, it becomes very difficult to safeguard and guarantee the data quality.
The volume of raw data makes it nearly impossible to determine which discoveries other analysts or users have already made using the same data from the lake. Without any descriptive metadata, every researcher must start from scratch.
Tracking down the correct data becomes a real nightmare in this situation.
If you don’t find a way to maintain your data lake carefully from the beginning, you run the risk of the lake turning into a swamp in no time.
4. Required: good fishermen
You have the technology to build a data lake. But do you have the right fishermen – data scientists? In other words, do you have the expertise to “fish out” the data and use it effectively for your organization? Also, carefully consider points 2 and 3. Do you have the right people to dam your lake, to maintain it, and to make sure it doesn’t flood and people aren’t just fishing in it unchecked?
5. The BI tools are not finished yet
So, you have the fishermen, but do they have the right fishing rods? Most BI tools are not yet equipped to be able to fish in a data lake. New tools for data lakes are rather different from what you might be used to (see also: point 4). Before you build a data lake, make sure you have the right BI rods (and fishermen), or you won’t catch anything.
6. Start small at first
Have you figured out your small data and are you optimizing its effectiveness? Instead of filling an entire lake, many organizations would benefit more from starting with small bottles of water.
Is your company ready for a data lake?
A data lake can provide great benefits for your company, but you should make sure you’ve checked all the boxes before you start building one. Otherwise, the lake will quickly devolve into a swamp that will swallow up your organization.
Should you build a data lake?
Contact us now and together we can determine if there’s a good business case to be made for a data lake in your organization.