The amount of data that we produce is truly mind-boggling. This influx of data is generated by all of the connected devices that organizations and consumers use every day. By 2020, it’s predicted that there will be nearly 30.73 billion IoT connected devices.
The data generated from IoT devices is only useful if you analyze it. Since this kind of data is highly unstructured, it’s nearly impossible to analyze it using traditional business intelligence (BI) tools and analytics software, since these tools were designed for analyzing structured data.
Organizations typically put this type of data in data lakes, such as Amazon S3, Azure Data Lake Storage or Hadoop. This means that analysts need to find a new place to combine those datasets before they can query it. As a result, many analysts leave this IoT data untouched, and it flounders in data lakes as underperforming assets. To truly get the most out of this IoT data, organizations need to figure out how to fish this data out of data lakes and into analytics tools.
Use low-cost object stores
Fortunately, object stores can help break down data silos by providing massively scalable, cost-effective storage to collect any type of data in its native format. This is especially important for the massive amounts of data typically associated with IoT. But there’s a catch: object storage is not coupled with compute power, so you’ll need a data lake engine to analyze the data. Using the right data lake engine massively simplifies things. Ideally, you want something that can perform analytics directly on the data lake reducing the need for extract, transform and load as well as data warehouses, and replacing the need for cubes and extracts. Even if you’re using an on-premises data lake, this is still true, though there are more options here.
You’ll also need something that supports standard SQL that can be used for interactive analytics. Data consumers need capabilities such as ad hoc querying, low latency, high concurrency, workload management, BI tools integration, as well as being able to consume any data from any source using the robustness and flexibility of SQL. For most organizations, SQL is the most popular data access language known by users.
Self-service data platforms that don’t require vendor lock-in
Self-service and collaboration are key to empowering data consumers to be independent. A self-service platform should enable any analyst to access all needed data in a single place, no matter the underlying repositories of where and how that data is stored. The platform should implement a unified data layer for self-service data access, so users can retrieve a wide range of physical datasets from many different kinds of repositories, such as IoT data, big data, and customer data, regardless of its format or where it’s located. If you have data stored in data warehouses, data lakes, NoSQL repositories or file systems, the data platform should be able to ingest it and make it easily accessible to users via their favorite tools such as Tableau and Python. This creates a radically wide landscape of data that can be accessed.
An in-memory columnar engine based on Apache Arrow can deliver maximum efficiency in both query speed, memory and computing resources used. Many different engines can access one in-memory representation. This kind of sharing avoids serialization, which slows down in-memory data stores.
Also look for solutions that store metadata and cached data in open formats such as Parquet. Using solutions that require copies made to another service — or into a proprietary data format — lead to high expenses and — since many of these services charge for compute — high costs to gain insight from the data.
An infrastructure that’s based on open source delivers several key benefits to enterprises, including greater security, more thoroughly reviewed code, no vendor lock-in and faster development cycles that build on the work of the community of open source contributors.
IoT security is critical
There are many security challenges associated with IoT data and cloud services. Data breaches have grown in volume, scale and impact. According to an IBM study, the cost of a data breach has risen 12% over the past five years and now costs $3.92 million on average. Due to the growth in connectivity between digital and physical worlds, as well as the acceleration in deployment of IoT and AI technologies, there are many more avenues for cyberattacks.
Another issue is that there is often an insecure flow of data from sensors to the cloud. Keep in mind that the transport layer of an IoT stack has only two standardized protocols: TCP and User Datagram Protocol. These protocols can be attacked through a variety of methods. In addition, analytical systems often bring sensitive data together from many different areas of an organization and are a natural target for cybercrime.
Effective security operations require staying ahead of threats. To ensure your IoT security is reliable and up to date, look for a platform that can expand security options through role-based permissions, as well as limiting masking data access using virtual datasets, all while letting teams help themselves.
In order to truly make IoT data a performing asset, look for a platform that will provide users with governed and secure access to data from any source, without needing to create copies of the data. By eliminating the need to move and copy data, data consumers can easily discover and curate data using their favorite tools, without needing to depend on IT for their data requests. This type of an open source data platform empowers business analysts and data scientists to be self-directed in their data analysis, so companies gain more value from their IoT data, faster.
All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.