Handling large amounts of data is a prerequisite of digital transformation, and key to this are the concepts of data lakes and data warehouses, as well as data hubs and data marts.
In this article, we’ll start at the top of that hierarchy and look at data lakes. As organisations try to get a grip of their data and to wring as much value from it as they can, the data lake is a core concept.
It’s an area of data management and analysis that depends on storage – sometimes lots of it – and it’s an activity that’s ripe for a move to the cloud, but can also be handled on-premise.
Data lake vs data warehouse
The data lake is conceived of as the first place an organisation’s data flows to. It is the repository for all data collected from the organisation’s operations, where it will reside in a more or less raw format.
Perhaps there will be some metadata tagging to facilitate searches of data elements, but it is intended that access to data in the data lake will be by specialists such as data scientists and those that develop touchpoints downstream of the lake.
Downstream is appropriate because the data lake is seen, like a real lake, as something into which all data sources flow, and they are potentially, many, varied and unprocessed.
From the lake, data would go downstream to the data warehouse, which is taken to imply something more processed, packaged and ready for consumption.
While the data lake contains multiple stores of data, in formats not easily accessible or readable by the vast majority of employees – unstructured, semi-structured and structured – the data warehouse is made up of structured data in databases to which applications and employees are afforded access. A data mart or hub may allow for data that is even more easily consumed by departments.
So, a data lake holds large quantities of data in its original form. Unlike queries to the data warehouse or mart, to interrogate the data lake requires a schema-on-read approach.
Data lake: Data types and access methods
Sources of data in a data lake will include all data from an organisation or one of its divisions.
It might include structured data from relational databases, semi-structured data such as CSV and log files as well as data in XML and JSON formats, unstructured data like emails, documents and PDFs, as well as and binary data, such as images, audio and video.
In terms of storage protocol that means it will need to store data that originated in file, block and object storage.
But, of those, object storage is a common choice of protocol for the data lake itself. Don’t forget, access will not be to the data itself, but to the metadata headers that describe the data, which could be attached to anything from a database to a photo. Detailed querying of the data often happens elsewhere, not in the data lake.
Object storage is very well-suited to storing vast amounts of data, as unstructured data. That is, you can’t query it like you can a database in block storage, but you can store multiple object types in a large flat structure and find out what’s there.
Object storage is generally not designed for high performance, and that’s fine for data lake use cases where queries are more complex to construct and process than in a relational database in a data warehouse. But that’s fine because much querying at the data lake stage will be to provide more easily queryable data stores for the downstream data warehouse.
Data lake on-prem vs cloud
All the usual on-premise vs cloud arguments apply to data lake operations.
On-prem data lake deployment has to take account of space and power requirements, design, hardware and software procurement, management, the skills to run it and ongoing costs in all these areas.
Outsourcing the data lake to the cloud has the advantage of offloading the capital expenditure (capex) costs of infrastructure to an operational expenditure (opex) one of payments to the cloud provider. That, however, could result in unexpected costs as data volumes scale and upon data flow to and from the cloud, for which you will also be charged.
So, a careful analysis of the benefits and drawbacks of each is needed. That could also take into account issues such as compliance and connectivity that go beyond just storage and data lake architecting.
Of course, you can also operate between the two locations, in a hybrid cloud fashion by bursting to the cloud when needed.
On-prem data lake products
In terms of storage, a data lake will often need a fair amount of it. If it’s the data lake for an enterprise-scale organisation, that’s going to definitely be the case.
In the middle of the past decade, storage vendors seemed to test the waters with data lake products. EMC, for example, had its Federation Business Data Lake, launched in 2015, that delivered EMC storage, plus VMware and Pivotal big data products.
But that seemed to be short-lived. By 2017, Dell EMC was targeting its Elastic Data Platform at data lake deployments.
Elsewhere, Dell EMC has also targeted its scale-out network-attached storage (NAS) Isilon product range at data lake use cases.
Hitachi Vantara has perhaps more of an emphasis on analytics, big data and the internet of things (IoT) since its rebrand. It offers data lake capability based on its Hitachi Content Platform storage in conjunction with the Lumada IoT platform and Pentaho data integration environments.
Pentaho Data Integration and Analytics is aimed at big data. Reports and analytics can be accessed remotely, and once a user gains access to data, it can be processed and consumed anywhere. Pentaho supports Hadoop, Spark, NoSQL data stores and analytic databases. The Lumada IoT platform uses Pentaho data orchestration, visualisation and analytics software.
IBM also comes under the category of storage vendors that make some noise about data lakes. It offers its storage arrays and consulting, alongside partnering with Cloudera to offer data lake solutions. Cloudera is a data management platform that allows for orchestration and analytics of large volumes of data.
NetApp doesn’t make a great play about data lakes as such, but it does offer its Ontap-powered arrays as storage for big data, Hadoop, and Splunk, for example.
HPE likewise doesn’t make any very specific plays toward data lake deployment, except to say you can build one using its GreenLake pay-per-use product portfolio.
It’s fair to say you can build data lakes on any supplier’s hardware, and white box commodity kit is also a popular choice. It seems some of the big storage suppliers went through a brief period of offering products tailored to data lakes, with talk even of data lake appliances, but such projects are big ones with many tentacles and lend themselves more to a consulting and solutions-type approach.
Enter the cloud
The hardware suppliers dabbled with discrete data lake products, but eventually seem to have concluded it’s an amorphous area in terms of marketing and sales and that their consulting arms will pick it up.
The big cloud suppliers, meanwhile, have gone the other way, with all three offering defined data lake services.
The AWS data lake solution offers a console from which customers can search for and browse available data sets. Then they can tag, search, share, transform, analyse, and govern specific subsets of data across a company or with other external users.
It is based on AWS’s S3 object storage and uses a variety of AWS services to knit it together that include AWS Lambda microservices, Amazon Elasticsearch, Cognito user authentication, AWS Glue for data transformation, and Amazon Athena analytics.
Azure’s data lake offering is along similar lines, and offers the ability to run massively parallel data transformation and processing programs in (Azure’s own) U-SQL, R, Python and .Net over petabytes of data.
You can then use Azure’s HDInsight, which is a managed open-source analytics service that includes frameworks such as Hadoop, Apache Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm and R.
Google Cloud Platform comes across a little less like a one-stop-shop for data lake deployment than AWS and Azure. There’s no doubt you can build data lakes on GCP – and they boast that Twitter does it with them, for one – but it’s probably more a consultancy-heavy project than the off-the-shelf type offers from the other two.
Data lakes more well-defined by cloud providers
There’s no doubt the idea of the data lake is a useful concept. The idea of a repository into which all corporate data flows and where it is selected and then made more easily accessible is a good one.
And it’s quite easy to see that certain types of storage are better suited to it. Its needs are not immediate and rapid and so fairly cheap and deep storage like object-based are ideal.
What’s interesting is that the on-prem storage vendors seemed to make a big deal of big data/data lakes, and in some cases even touted the idea of a data lake appliance.
But the reality of data lake deployment has been of something rather larger and multi-tentacled which made it not well-suited to discrete products, so the hardware vendors have largely flirted with it and moved on, unless consulting and services provide their route to it.
Meanwhile, however, the big cloud providers – being predominantly service-based – have been able to knit together solutions to build data lakes with relative ease and so, at least in the offerings of AWS and Azure, data lake solutions are prominent and well-defined.