Use device reliability engineering to derisk IoT products

Device reliability engineering uses over-the-air updates, performance metrics and remote debugging to help ensure that a product can run as smoothly as possible.

François BaldassariGuest Contributor

Published: 25 Oct 2022

Developer teams know this cycle well. They finally get a product to launch and celebrate over the next 48 hours when everything feels amazing. But, soon, the cold reality and hard work of having a product in the field set in: There will never be a launch without issues.

It's not a failure of talent or technology. Jack Ganssle, firmware author and speaker, contends that every 1,000 lines of code has between 10 and 100 defects. As most IoT devices have thousands of lines of code, Ganssle's logic suggests hundreds -- possibly thousands -- of defects exist within each one. While many are insignificant, some defects are severe and can't be ignored.

No matter how good an organization's QA processes are, some issues only surface in production. Why? If a bug happens once every 10,000 hours, that bug is extremely hard to find in a handful of devices in QA. But, once there are 10,000 units in the field, that bug surfaces every hour.

Bugs, security issues, missing features -- all these result in unhappy customers and, often, a barrage of customer complaints. And planning for these post-launch issues should be part of the product lifecycle. This approach, called device reliability engineering (DRE), includes the engineering practices, infrastructures and tools that can be used to manage reliability at scale, post-launch.

With the inevitability of bugs and issues in the field, adopting three key DRE techniques can help teams everywhere derisk a product launch.

1. Comprehensive OTA

Consider over-the-air (OTA) updates, which are wireless deliveries of new software, firmware or other data to connected devices, i.e., an insurance policy; without it, the only option is a product recall. With OTA, developers can push out updates and keep devices operational.

Correctly architecting OTA makes it more likely to work well, which includes ensuring optimal test coverage. Successful systems support cohorts, staged rollouts and firmware signing.

Cohorts are when developers group their devices and update each group separately. Cohorts are a simple way to test releases, enabling A/B testing and other kinds of experimentation. It also comes in handy when working with multiple industrial customers that want updates on a different schedule.

Staged rollouts offer the ability to incrementally push new updates to the device fleet. Because every release introduces risk, rolling out updates incrementally limits the blast radius of any new issue and can prevent an issue from impacting all customers at once. Ideally, developers can set a system to direct reported issues into the OTA system; if no issues are reported, they can automatically increase the rollout increments until it reaches the entire fleet.

Firmware signing is a method that proves a file was created by a trusted source and hasn't been tampered with. It does this by creating a verifiable signature for a file. By implementing signature verification in a bootloader, developers can identify the authenticity of a given firmware update, and the bootloader can decide to either warn the user, void the device's warranty or simply refuse to run an unauthenticated binary.

2. Performance metrics

Post-launch, IoT developers must have access to hard data, which is critical to help them monitor the device fleet health.

The five most useful metrics are connectivity, battery life, memory usage, sensor performance and system responsiveness. The system collecting these metrics needs to have three essential characteristics:

Low overhead. Collecting metrics shouldn't impact device performance.
Easy to extend. When teams decide to add a metric, it can't require three different teams to collaborate.
Privacy preserving. Given the regulatory landscape in California, Europe and other locations where a device may be used, privacy protections must be built in from the start.

Typically, there are two primary use cases for these metrics. The first is looking at device metrics. By collecting different data points for individual devices, developers can investigate specific reports of devices misbehaving, either through customer support or engineering teams. Organizations should be able to capture device timelines so, when a customer calls with a battery life complaint, the customer support team or engineers can quickly see operational correlations, such as battery use and writing to flash. A strong metric system makes that possible.

A key thing to remember about capturing performance metrics is that it can be done asynchronously, an especially critical feature for limited-connectivity devices. Beyond individual metrics, there should be some level of aggregation and dashboards to give an indication of overall fleet performance and a way to quickly identify data trends.

The other main use case is for alert configuration. A metric system should have a way to configure alerts. Set up the system to send alerts to email, IM or incident management platforms when certain conditions are met. Rather than waiting for someone to look at the charts, alerts bring issues to the team's immediate attention.

3. Remote debugging

Consider the various steps involved with traditional debugging. Typically, it starts with several different reports of different customer issues; all may be the same problem, but they aren't described similarly enough for design teams to know. The support team answers the phone or responds to individual emails. Eventually, organizations gather feedback and then convert it into a couple different logs for customers to collect manually. With this data, teams get devices back in the lab, and then, they are sent to the engineers. It's time-intensive and expensive.

Remote debugging converts this long and expensive process into something automated that can happen much faster. It creates a way for devices to report issues in an automatic fashion by feeding them into a cloud pipeline that analyzes that data, collates the reports into error instances that are deduplicated and then shares those reports with engineering.

Core dumps are a standard debugging technique. They are automatic, detailed diagnostics that are captured whenever issues occur. They come with logs, backtraces and memory and give engineers the information they need to resolve the problem. Developers must collect diagnostic data, upload it and create some way to look at it.

Connectivity has transformed device development by extending the product lifecycle. Before, shipping a product meant there was likely no interaction again. Now, the product lifecycle continues well after a product has been sent out into customers' hands.

By adopting DRE techniques, developers can derisk a product launch, prepare for the inevitability of post-launch issues and deliver a continuously improving, higher-quality product overall.

About the author
François Baldassari is founder and CEO of Memfault, a connected device observability platform provider. An embedded software engineer by trade, Baldassari's passion for tooling and automation in software engineering drove him to start Memfault. Previous to Memfault, Baldassari led the firmware team at Oculus and built the OS at Pebble. Baldassari has a B.S. in electrical engineering from Brown University.

Use device reliability engineering to derisk IoT products

Device reliability engineering uses over-the-air updates, performance metrics and remote debugging to help ensure that a product can run as smoothly as possible.

1. Comprehensive OTA

2. Performance metrics

3. Remote debugging

Dig Deeper on Internet of things platform

What is internet of things device management (IoT device management)?

What's the best way to protect against HDD failure?

Update from eSync Alliance accelerates software-defined vehicle roadmap

7 causes of SSD failure and how to deal with them