Are time series databases the key to handling the IoT data deluge?
It’s pretty obvious that data is being collected at an astonishing — and rapidly increasing — rate. We’re collecting more data, on more systems and across more industries than ever before in human history. Keeping up with that flow of data is one of the major challenges in the IT industry today.
Unfortunately, I believe the growth of data collection is only just beginning, and the amount and velocity of data collection is going to not only grow, but grow at a faster pace than ever before. We are in for a deluge of data.
Why so much data?
The answer to this question is, of course, a long one, but it boils down to the fact that we are instrumenting more systems and more “things” than ever. From the increasing instrumentation of applications and systems — what we now call DevOps — to the exploding growth of IoT, everything around us is beginning to emit data. For now, I’m going to focus on the growth of IoT data to illustrate what we’re in for.
Every analyst has a prediction for how many IoT devices they think will come online by year X. Back in 2017, Gartner reported IoT devices grew by 31% to 8.3 billion devices over the previous year, and predicted that more than 20 billion devices would be online by 2020 (that’s only next year!). For simplicity sake, let’s use that 20 billion number as a baseline example.
How much data is that?
I’ve built many IoT devices — in fact, I have a dozen sitting on my desk right now. Some of these devices produce only a single data stream, meaning they only produce a single data point for each reading. Others produce upward of a dozen data streams. Consumer and industrial sensors, for example, can monitor for much more and produce dozens of data streams per device.
To give a more concrete example around how this data is calculated, let’s say each device produces an average of 10 data streams and writes data out once per second — which is very low for many industrial sensors, for the record. Now, my single-stream sensor reads the CO2 content and writes it out to a database every second. That reading, between 0 and 10,000 parts per million of CO2, can range anywhere from one to five bytes long. So, for the simplicity of calculating, let’s assume each data stream is a 5-byte reading, once per second. We now have a single device, producing 5 bytes per second, multiplied by 10 data streams — that’s only 50 bytes per second!
While this doesn’t seem like much, if you were to multiply this number by 20 billion devices, you’d get about 1 trillion bytes per second — or one terabyte of IoT data. Every second. Of every day. Forever.
My laptop has a 1 TB drive in it, so I’d fill that up in a single second, which is nearly a petabyte of data in a single year.
What are we going to do with all that data?
Now, this is the real question.
All of that data must be ingested into some sort of searchable database in real time. It must be stored, manipulated, queried and acted upon by businesses and organizations every hour of every day to get the most out of the business insights that rich data holds. Mind you, it’s not all going into the same database, but that’s still a lot of data to manage for any organization.
When talking about ingesting and storing data, we also need to take a look at what kind of data it is because not all data is created equally. We can break down IoT data into several buckets. The first is the metadata about the sensors and devices we’re using to collect the data. This can consist of everything from sensor model numbers to date placed in service, physical location and any other data about the sensor itself. This data is typically not updated often and probably doesn’t change much over time.
The really valuable data is the sensor data itself. Sensor data is typically time-stamped readings from a sensor, sent in a constant stream from device to storage platform. It could be a CO2 reading, environmental data or data from heart rate monitors, industrial equipment and so forth. No matter where this data comes from, it almost always follows the basic formula of <data reading>@time-stamp. This, some of you may recognize, is time series data — data for which time is a critical component.
How do we store time series data?
There are as many possibilities for storing time series data as there are databases in the world. You could store it in a traditional relational database management system (RDBMS), as unstructured data in a NoSQL database or even in a spreadsheet or CSV file. But just because you can do something doesn’t mean that you should.
Traditional RDBMSes are designed to store access and update relational tables of data, while unstructured NoSQL databases are suited to store and retrieve, well, unstructured data. IoT data, as we’ve seen, is none of these things. It is highly specific time series data, and for that, you need a time series database.
Time series databases are designed specifically to ingest, store and query time series data because it’s different than other types of data. It requires really high ingestion rates and the ability to query data across time to understand trends and business insights from the data.
The growth of time series data as a category
As time series data has grown, so has awareness of the need for specific systems for time series data. This growing data problem, and the growth of time series databases, has created a whole new category of database vendors. That’s why, over the past 24 months, time series databases have been the fastest growing segment of the database market.
With the growth of IoT data, it’s easy to see why.
All IoT Agenda network contributors are responsible for the content and accuracy of their posts. Opinions are of the writers and do not necessarily convey the thoughts of IoT Agenda.