Cellular IoT Devices Fail; Make Sure They Fail Gracefully

Illustration: © IoT For All

There is a sort of paradox built into big cellular IoT deployments. The instant you send your devices into the field, you lose a lot of control over performance. Now and then, at least, connections in your cellular IoT devices will fail. You can’t control what people do with your device. You can’t control user behavior, and some of that behavior will disrupt any network connection. 

Speaking of networks, there’s another thing you can’t control. The last big, global survey of mobile network operators (MNOs) on the subject of failure is from 2016. But even old news suggests the limits of network infrastructure. Back then, 30% percent of responding MNOs said they had network outages and service problems up to three times a year. Even more, 34 percent admitted to more than 15 outages or “service degradations” per year. More recently, J.D. Power found the proliferating cellular devices of 2022 leading to more widespread Original Postress-releases/2022-us-wireless-network-quality-performance-study-volume-2" target="_blank" rel="noreferrer noopener">problems with “network quality.” 

Cellular connectivity is great for IoT, and it should get better and better as new technologies like 5G New Radio go mainstream. But it probably won’t ever be 100 percent. You can and should design devices for consistent connectivity. Still, when your products hit the market, expect the unexpected. If cellular IoT designers can’t prevent connectivity failures, they can program devices to fail gracefully. Here’s a glimpse of what that might look like.

“The instant you send your devices into the field, you lose a lot of control over performance. Now and then, at least, connections in your cellular IoT devices will fail.”


Designing for Graceful Failure

In the context of cellular IoT connectivity failure, what do we mean by graceful? Four things, really. Let’s look at each of these goals in turn. 

#1: Fail-Over Connectivity

As we mentioned, networks sometimes fail. But where one network fails, a well-designed device can fail over to a backup. Depending on the device, you may include fail-over modes that shift to WiFi, satellite, or just another cellular network. A successful fail-over will keep your device operating until it can reconnect to the primary network. But it’s also possible for redundant connections to fail, too. That can be dangerous. Picture smart traffic lights at a busy intersection, for example. That’s why you must also program firmware for another layer of protection, which brings us to the next item on our list.  

#2: Default Failure Modes

Your firmware must include instructions on what to do when connections are hopelessly lost. In the traffic light example, devices might default to a standard, non-smart but serviceable pattern that keeps cars from intersecting along with the intersection. Safe failure modes will look different from one device to another. The trick is to anticipate real-world scenarios and design basic device behavior that keeps users safe until a connection can be re-established. 

#3: Preventing Cascading Network Problems

Poorly programmed IoT devices are persistent if nothing else. They don’t just make mistakes; they repeat them to the point of disaster. Say a smart thermometer is programmed to send more frequent notifications the higher the temperature rises. Then, say the sensor breaks and the system interprets the lack of signal as a temperature of infinity. That device could start sending notifications every second; it could send so much data that the network becomes congested. Then other devices might start repeatedly re-sending their own failed transmissions. In the end, that single runaway device could cause a signal storm that brings down the whole network.  

That’s a problem for connectivity platforms, firmware designers, or both. Somewhere in the system, devices need a rate limit on data. No matter what the cause of a signal storm is, the effect will never be cascading network failure.    

#4: Graceful Recovery

Finally, devices must reconnect to the network without tearing it to the ground. The real risk, following a network outage, is the signal storm. If 100,000 devices try to reconnect to the network at once, you’ll have a congestion problem that could start a traffic fiasco all over again. The simplest way to ensure graceful recovery is to program reconnection attempts with an exponential back-off. A device can try to reconnect. If the connection doesn’t work the first time, it can try again. But between each attempt, there’s an exponentially increasing buffer of wait time. That helps to prevent network collisions that lead to signal storms.

Include Experts

Of course, we can’t stress enough how different every IoT deployment is. The examples we discuss above may or may not apply to your project. The best thing you can do to create cellular IoT products that fail gracefully is to confer with proven IoT experts. Get started with a list of IoT services every product creator needs. You may not be able to prevent the occasional connection failure, but you can control your device’s response—and limit the impact on users. That’s more than good design. It’s downright graceful.

Cellular IoT Devices Fail; Make Sure They Fail Gracefully