Today I am happy to announce the latest addition to the Sternum platform, an anomaly detection engine that uses AI learning to auto-identify unexpected problems, alarming trends and unusual behaviors.
Awareness of such anomalies brings early attention to emerging security and performance issues, allowing them to be addressed before they impact the device, the user or the deployment environment.
In this blog post, I`ll talk about how these new capabilities synergize with our existing observability offering, and how–put together–they can dramatically improve your ability to manage high-volume fleets without losing sight of any individual device.
Peering Inside the ‘Black Box’
Any number of fleet visibility/observability tools provide basic information on devices deployed in the field, such as where a device is located (geo-IP), what it’s connecting to (ports), and what it is using (firmware versions). These tools, however, don’t give you much information about what’s going on inside the individual device, treating it as sort of a “black box”. And of course, lacking device-level visibility, they also fall short when it comes to spotting emerging issues until those manifest themselves as full-blown incidents.
At Sternum, we already solved the “black box” problem with an introduction of our Observability SDK– a lightweight solution, purpose-built for IoT, which can be easily customized to trace any metric from your devices.
For instance, if you manufacture infusion pumps, you can easily monitor the pressure to ensure they don’t reach dangerous levels. Or, if you produce industrial controllers, you can track temperature readings to ensure the hydraulics are not compromised… and so on and so forth.
Access to such granular device-level information could be invaluable. And yet, all this data can also become overwhelming.
The most common way of dealing with this situation is by setting threshold rules for different alert conditions. Doing this helps narrow down the focus and yet the process of setting up alerts could prove to be tedious, forcing you to constantly balance between what’s ‘noisy’ and what’s ‘important’. Moreover, no matter how meticulously set up, no manual configuration can ever account for every possible scenario.
And this is exactly where our new anomaly detection capabilities step in, leveraging automation for a better way of managing alert data at scale.
The process starts with a short learning period, usually of only 48 hours, during which the AI engine uses telemetry from the SDK to generate a baseline profile for each device in your fleet–defining exactly how that device is expected to function and interact (or be interacted with) by users and other systems.
Note that this learning requires no additional action or configuration changes on your end. As soon as you connect your device to our platform the AI will automatically start building up the profile based on the metrics you’re tracing, which also has the side benefit of ensuring that it only focuses on metrics that matter the most.
Once the baseline profile is outlined, the anomaly detection kicks in, constantly tracking the device and automatically highlighting any significant deviations such as:
- Communication pattern violations
- Abnormal presence of an event or several events together
- Absence of an event (e.g., an unfulfilled update request)
- Abnormal number of events (e.g., update requests)
- Unusual value of a variable detection (e.g., an unfamiliar connection to internal IPC)
- A new combination of several variables
- Sequence violations (e.g., command execution w/o authentication)
Exposing these allows anomaly detection to act as an ‘extra set of eyes’, showing you things you would otherwise miss or wouldn’t even think to expect.
In our UI, each identified anomaly becomes an alert, which can be viewed in correlation to all other events, for a broader context.
Furthermore, every alert also comes with its own drill-down investigation view that offers additional details. For instance, the screenshot below shows an investigation of a ‘Failed Update’ event. If missed, this could leave the device exposed to security and operational issues.
Another example follows. This one shows a communication issue, auto-identified by the anomaly detection engine.
If not spotted, this could have gone unnoticed for a while, potentially escalating into a much bigger problem.
Learning from the Unexpected
The above, of course, are just a few examples collected during the first week of the feature rollout. We expect to encounter many more in the upcoming month. After all, the number of potential breakage scenarios is literally endless.
And yet, the value of anomaly detection goes beyond proactive troubleshooting and rapid root cause analysis. To me, every encounter with an unexpected behavior (a new error type, blind spots you didn’t know about, a usage spike etc) also promotes a deeper understanding of the device and its functions. This, in turn, fuels future design decisions, offering new ideas for innovation.
Curious to learn more? Want to see our anomaly detection in action? Visit here to get your free demo