Manage

Discovering the value of IoT with data lakes

Dremio

The vast number of devices connected to the internet has led to an outburst of data growth. IoT technology has enabled communication between humans, devices and systems as a fundamental element of swift and game-changing decision making. Data collected from multiple IoT devices on its own has value. However, the complexity and non-standard nature of IoT data can be a problem when it comes to extracting that value.

To successfully extract and amplify its business value, IoT data must be combined with existing non-IoT data. The solution to unleashing this value is to create a modern cloud data lake and use well-established best practices to prevent it from becoming an IoT data swamp. This approach allows enterprises to use the latest innovations to not only make the most out of their IoT infrastructure, but to do so in the most cost-effective way.

Data lake storage has become the new data aggregation layer

Cloud data lake storage is transforming the way we are thinking about data lakes. Cloud storage solutions such as ADLS and Amazon S3 have introduced new features, such as infinite scalability, cost flexibility, ease of maintenance, high availability and the elimination of silos. These properties didn’t exist less than a decade ago with on-premises data lakes.

These characteristics have changed the way organizations view data lake storage. In the past, data lakes were just afterthoughts. Now, data lake storage is the first place where data lands.

Cloud data lakes provide separation of compute and data

Data lake storage also changed the game in terms of how data gets processed because you can now benefit from the separation of compute and data. Less than a decade ago, we talked about bringing compute to data with the goal of running everything in the same cluster and getting faster performance as a result. In the cloud, things work differently; not just because networking infrastructure has gotten better, but also because of the notion of separation between compute and storage.

Traditional cloud data warehouse approaches separate compute and storage, and that is not a bad start. However, what you really want is complete separation of compute and data, which is only possible with an open cloud architecture.

What that means is that now you can have storage as a service and separately leverage pipeline services and compute engines, such as Spark, Dremio, and Hive, that can run directly on your data. This is a key advantage for IoT data lake infrastructure because you can scale your data and your compute separately, giving you fine-grained control over speed and cost.

Building a successful IoT data lake architecture

In traditional data warehouse approaches, data obtained from IoT devices has to be transformed, standardized and blended before it is ready for analysis. This process is slow and expensive and can translate into missed business opportunities. To get it right, you have to learn how to simplify the process by building a data lake that is flexible, fast, secure and cost-efficient. Here’s how it’s done:

Data should stay where it landed

Though cloud storage is inexpensive, the practice of siloing and keeping multiple copies of data is not only inefficient, but it can also increase storage costs. IoT analysts want fast, self-service access to data and you can achieve this by eliminating the complexity of the pipelines required to satisfy their demands.

IoT infrastructures generate data in multiple shapes and sizes, and while heavy-lifting extract, transform and load processes are still needed to land this data in the lake, you can avoid having to implement similar processes when providing data to end users. It is possible to have scenarios where data is stored in several buckets within the same cloud, or even a multi-cloud environment where different sets of data are stored in separate cloud storages.

The key is to implement a self-service infrastructure where a range of users can consume data directly from the data lake, using tools that they are familiar with without needing additional help from IT.

Maintain security and governance

The business impact of cybersecurity breaches in the U.S. alone has reached over $6 trillion, so it is critical to make security a fundamental part of your IoT cloud data lake strategy. Make sure to always establish security measures such as encryption of data in transit and at rest, as well as role-based access control.

Your main focus should be simplicity. Security systems sometimes can be so complex that users try to work around them, driving them to less-governed alternatives. Allocating enough access to get the data they need will keep them from going outside the system. This can be enabled by providing a governed mechanism for data sharing that prevents disconnected copies and avoids restricting access to data unnecessarily.

You should also enable coarse-grained ownership when possible. The scalability and elasticity of the cloud makes it easier to create separate resources for different teams. Full resource isolation is emerging as a common model for data lakes, allowing data teams to use their resources without sharing them with other organizational units. In addition, access control is easier to set up and maintain compared to fine-grained access control.

Control costs with efficient workload management and elastic scalability

This is likely the most difficult, yet the most necessary best practice. Teams across your organization will have different workload requirements, different SLAs and a common pool of resources. This can be challenging because it means that you must find the perfect balance when deploying resources to avoid both over-provisioning and under provisioning. Swaying too far to either side will heavily impact costs and workload performance.

Consider leveraging processing engines that allow you to efficiently size and automate the deployment of resources based on workload sizes. This way you can have full control over how many resources are deployed for each workload, and which resources can be decommissioned due to inactivity. This means you can eliminate unnecessary expenses for idle workloads.

Final thoughts

Using discovery analytics is critical to the success of IoT, however this can be challenging because of the scale of data generated by IoT platforms. Designing and deploying a data lake helps address these challenges by allowing you to control costs, enable real-time access to data and ensure governance.

A data lake is virtually a bottomless repository that can be filled with data of any shape and size. That being said, when working with IoT data it is possible to create an unmanageable data swamp if you don’t follow the best practices to design and implement your data lake strategy.

By eliminating data movement, leveraging flexible scalability, tailoring the size of resources that you need to handle heavy analytical workloads and making data easily accessible to all users, you can amplify the value of your IoT data and the value that it adds to your business.

All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.