Zephyr OS and On-Device Runtime Protection – Embedded Open Source Summit

Natali Tshuva
Natali Tshuva

16  min read | min read | 17/07/2023


Transcript

Natali Tshuva: Thank you everyone for joining me today. I’m Natali, the CEO and cofounder of Sternum and we are here to talk about security especially in the context of embedded devices and how hackers perceive embedded devices vulnerabilities and opportunities to exploitations.

So it’s going to be interesting and we are going to also do a bit of a deep dive into the technical aspects of exploitations and also integrating security solutions.

While the talk will be focused on Zephyr operation system, it is relevant to all of the embedded Linux and real time operating systems in general.

So I just like this quote, especially the beginning. “The challenge for connected device manufacturers is to strike a delicate balance between security, reliability, scalability.” I think more importantly performance and being deterministic in the way they operate, right? They cannot fail.

A little bit about myself. So I grew up studying computer engineering, started when I was 14, which paved the way to Unit 8200, the Israeli Intelligence Unit. This is where I actually got my cybersecurity education and I was an expert in finding zero-day vulnerabilities in Linux kernel especially, but also Wi-Fi drivers, bootloaders and Android operating systems.

Finding the vulnerabilities is the part that most people focus on but actually the exploitation was the most interesting part of the job because taking a vulnerability and then creating an exploit chain where you can actually get remote access to devices or assets and then gain intelligence is really challenging even more than finding the vulnerable piece of code.

So this was really what I did was then leading research and development teams until cofounding Sternum, in a mission to actually secure and bring visibility into the most embedded, resource-constraint, mission-critical systems. I’m talking about PLCs, medical devices and gateways, everything you can think of.

Here I am talking to you today. We are working with leaders across the globe raise $40 million to date and we’re happy to be one of the sponsors of the Zephyr Foundation.

So I want to start with just an overview of how we protect applications and the different security layers used in the industry, not in the embedded industry. We need to take a look at three main layers – the operating system and infrastructure, the user space where we actually deploy applications and the application itself, our code, our program.

So in the infrastructure itself, we see techniques like memory isolation, segmentations, stack canaries. We see secure API with the user space, secure boot and over-the-air updates, hardware security in some places. That’s really the infrastructure that we are provided way to secure our systems.

Then comes user space security. In the user space, we also have stack canaries and other compiler flags to make our code more secure. We also usually have continuous monitoring in modern systems, endpoints, servers. We monitor them continuously to see what’s going on with the user space in real time.

We have permissions and policies to manage it, to manage processes and we have endpoint protection that is protecting the entire user space.

Then there is application security and I’m talking about what we’re doing during development of application and what we’re doing post-production. So usually during development, we will see static analysis tool, software composition tool, dynamic analysis, best practices from engineering perspective like implementing encryption, data protection, user management.

In post-production, once the device beats endpoint, cloud and so on, we usually see some endpoint protection, something running on the asset preventing attacks in real time. We usually see some zero-day attack prevention and detection, not only focusing on preventing known vulnerabilities but also the next threat, what’s called zero-day vulnerability that is not yet known. We have usually real-time alerts. Something is going on. We immediately get alerts. Actually the situation today is that we have too many alerts in the traditional industry and that’s a problem as well and we have application performance monitoring and other tools to just monitor our own application.

So that’s the standard layers used today in the industry but let’s take a look in the embedded industry status for a second. So from an infrastructure perspective and especially if we’re talking about SAFERTOS or Zephyr, then some of the capabilities exist. Trust zone, OTA, secure boot, memory isolation.

When we go a step higher to the user space security, sometimes we have permission and policies usually depending on how much the engineers invested in it. Stack canaries if you compile your code with it and if the third parties provided to you comes with that security mechanism. Then you have it.

Continuous monitoring and endpoint protection is usually absent. In the application security stage, dynamic analysis is missing during development, which means that we miss complex bugs, memory leaks and so on.

In post-production, we’re actually missing most of the capabilities. We’re missing zero-day prevention, endpoint protection, continuous monitoring, real-time alerting. So we are really left blind in the post-production scenario on our own application.

The meaning of that means that vulnerabilities that attack the application, the user space, zero-day vulnerabilities or advanced threats cannot be handled during the existing tools.

So what happened to real-time monitoring and protection? Why is it so difficult to take these standard tools from the industry and embed them into our own applications?

So of course IoT and embedded security is very different. From various perspectives, one is the deterministic nature. Endpoint servers, they have user interface. You can surf the web. You can download applications. You can get phishing emails.

So the attack surface is really different. In the IoT and embedded space, the nature of the system is deterministic. It’s predictable. There is minimum input channels, which actually means that most of the threats are software vulnerabilities, not phishing attempts and things that antiviruses are aimed to handle with. The different attack landscape means many different solutions.

Third, limited available resources, compute memory battery. So when you hear about integrating a new solution, usually the first thing someone asks especially an engineer is what is the overhead. How difficult is it to integrate and how much CPU and memory is it going to take?

Those resources are so limited in some of the systems that imagining integrating something so advance seemed very difficult. The diversification is also difficult. If you build an endpoint protection, usually you protect Windows or Linux devices or servers. If you want to protect embedded devices, you will need to support more than 100 different operating systems.

How do you do that when most solutions use the kernel and the operating system mechanisms to protect? So that’s very difficult where the operating systems are changing and diversified.

So what do we see today in terms of security? So we see that OS security features covers the first layer. I mentioned that Zephyr has some great security features that give you the infrastructure to build secure applications.

We see some best practices used in the industry, static analysis, software bill of materials, encryption, vulnerability management. A lot of passive or post-attack tools that help us patch or understand the software that is within our devices.

But what are the gaps? So first, static analysis tool misses around 50 percent of the vulnerabilities. So instead of going out with 50 bugs per thousand lines of code, you’re going out with 7. This is still a lot when you have a major code base.

SBOM and vulnerability management takes care of public well-known vulnerabilities. They do not handle zero-days. So it means that around 50 percent of the exploitations are exploiting zero days. None of those tools can really handle it.

Patching takes time and money. Patching is not easy in the embedded space. It’s not like updating your iPhone or mobile device. In some cases, if you are regulated, it can take six months. Six months is a lot of time to be exposed to a one-day vulnerability.

Encryption. Sometimes people think if they encrypt the data, they are secure. So when an endpoint gets attacks, the data on the endpoint is decrypted. If someone is hacking your mobile, it can read your text, right? Because you can read your text.

So encryption really secures the data during transit. It cannot prevent software vulnerability exploitation and it cannot protect your data if you are exploited.

No real-time application security and monitoring in most cases. OS memory protection does not prevent memory vulnerabilities in user space and in your own application.

What I mean by that is sometimes I hear people think that if Zephyr has stack protection or memory separation, they are not vulnerable to let’s say heap overflow. That’s not true. If you a heap overflow in your own code, Zephyr cannot prevent it and we see a lot of those.

The result is that embedded endpoints are far behind and blind when they go to production.

So we have no monitoring and protection in real time and that leads to everything that we all saw in the past few years. The recent security that’s in topic right now is that we’re seeing more and more attacks and not just attacks. When we’re not monitoring our application in real time, in the real environment when they operate, we also see recalls, problems, battery issues, communication issues, customers’ calls and complaints.

It’s hard to debug when you don’t have real-time visibility and logs into what’s going on. So the blindness leads not only to security issues but also to performance and quality issues.

This is a nice video. I want to take a few seconds to take a look at it, exploiting a device that is actually implementing all of best practices in the market.

[Video plays]

Natali: It’s hard to hear. So what you’re seeing on the screen, because it’s hard to hear, is two people from I think Germany exploiting a CISCO business router remotely via a buffer overflow vulnerability and actually get root access to the device.

This exploit is actually available publicly. There might be some business out there that are still unpatched and the router uses encryption and all of the best practices and actually still simple, programming bug enabled them to exploit the device.

The reason I am showing that is that actually vulnerabilities are inevitable and endless. We can’t really release products without vulnerabilities and it’s not me saying so. It’s just the numbers and the statistics.

Seventy percent of patch Tuesdays, patch Tuesdays that Microsoft patched that happened every Tuesday, every week, is due to memory vulnerabilities. So if Microsoft has memory vulnerabilities every week, how can we produce products without vulnerabilities? We have 2000 CVEs, new CVEs each month.

Fifty vulnerabilities per thousand lines of code as I mentioned and third parties for example, they scanned Urgent/11, Blueborne, Amnesia, Ripple20. All of them are publicly available open source libraries that were scanned by multiple static analysis tools and all of them missed those vulnerabilities.

If you have a third party with a vulnerability, static analysis misses it. You need to go out and release a patch. So the only reasonable conclusion from a hacker’s perspective is that every device I’m going to look at, every library I’m going to look at, TCP/IP, Bluetooth, I’m going to find a vulnerability and honestly, this is how we felt when we looked for vulnerabilities. There is no way we’re not going to find one.

This is why I personally don’t believe that we should target vulnerability-less devices because that’s just not possible. There are also many attack vectors. So how actually hackers get into a device? So I want to show you this simple draft. It shows really how many components can be part of the entry point for the hacker.

So first inside the devices. All of them. It doesn’t matter if you’re a medical device or a smart plug. You have chips and models within the device. Those have software within them, parsing packets, Wi-Fi packets, Bluetooth packets. This software can contain bugs and issues. You have third party code.

So a medical device and a gateway can use the same third party. The Bluetooth library, the TCP/IP library, this is how hackers can actually gain access to multiple different devices in attacking scale instead of just investigating or researching just one device.

Then your device communicates outside. So there are protocol vulnerabilities. Bluetooth, unparsing user input properly, connecting to a mobile application, network vulnerabilities and so on. So every input that the hacker can control a packet that is coming from the outside to the inside, a USB connection to the device, everything like that is an attack vector.

If you are completely not connected and you don’t have any physical connections as well, then nobody can hack you even if you have vulnerabilities.

So a vulnerability in order to be exploited has to be exposed and triggered remotely.

So the situation is – put it simply, as defenders, we have to stop all of the vulnerabilities and as attackers, we only need to find one. So usually hackers are the Messi in the game and they usually score even if you stop 99 percent of the problems.

So let’s take a look at actually the video that we’ve seen and break down the attack very simply. There is hacker on the internet. There is a CVE memory corruption exploit publicly available. It downloads the exploits, finds an exposed router and take over the router or a VPN gateway completely.

From that point on, even if the router itself is not interesting, it has a way to full enterprise network exposed behind the router. So it can perform lateral movement, change control, disrupt service from ransomware and so on.

At this point, after we are already attacked, we have limited options. React, patch, try to stop, try to detect as fast as possible and we don’t have data.

Another example, Sternum actually disclosed multiple vulnerabilities in the past few months. In Zyxel devices, in Belkin devices, in QNAP devices and this is actually a critical vulnerability we discovered in Wemo smart plug.

The company actually said that they are not going to patch because the device was three years old. So that was a problem to some consumers and now Belkin decided that they are going to patch. But a vulnerability itself was again memory corruption.

So Belkin is using all best practices, static analysis, encryption but still finding a memory corruption was very much possible and it was – it enabled us to exploit remotely and gain a complete takeover on the device. You can read the full research on our website. It’s actually very interesting including firmware extraction and how you go into researching a device.

So the current approach is, as we already understood, is reacting. Patching is reactive and costly. Encryption – and this is a quote that I like by Adi Shamir, the S in RSA. Actually the inventor of the most used encryption algorithms.

“Usually there are much simpler ways of penetrating the security system than cracking the crypto.”

So nobody is going to crack the cryptographic algorithm if they can just find the buffer overflow and take over the device. That’s the easiest way to do it and we already discussed that.

So what can we do to protect against zero-day and unpatched vulnerabilities? So as you already understood by now, I’m not a big believer in preventing vulnerabilities. But the way of exploiting vulnerabilities actually has a unique fingerprint. When we come to exploit a buffer overflow, there are some things that we have to do.

So exploitation has – or the attack chain has four main steps. The first step is the vulnerability. The second step is weaponizing code, taking advantage of the vulnerabilities to manipulate the system behavior. So this is from Wikipedia.

To exploit a vulnerability, an attacker must have at least one applicable tool or technique that can connect to a system weakness. So exploitation is a piece of software, sequence of commands that takes advantage of a bug or vulnerability to cause unintended or unanticipated behavior.

What can we infer from it? First exploitation is software. It has a chain of commands to execute something on the system. Second, it connects to a specific system weakness. The way to exploit buffer overflow isn’t like exploiting a command injection. The vulnerability actually is enforcing a specific exploitation technique. I need different techniques for different vulnerability types.

Third, it must cause an unanticipated or unintended behavior. So if you remember, embedded devices are very deterministic. So if they are not behaving as intended, we can assume that there is something interrupting the legitimate behavior of the device.

So what does it mean to prevent exploitations, how really zero-day prevention works? So we at Sternum call it the exploitation fingerprint. We have four patterns on that and it’s really about bringing the benefits of endpoint protection and RASP, runtime application self protection, into the embedded space.

Some of this fingerprint includes for example preventing memory overflows, command injections, information leaks, manipulation of execution flow. Why are we focusing on those things? Because manipulation of execution flow means that someone is trying to manipulate the deterministic execution flow that can be calculated on each software to cause unintended behaviors.

If we can calculate the expected behavior, we can know in real time that there are deviations from the legitimate behavior. Memory corruptions. In order to exploit a memory corruption, you have to corrupt the memory. You have to overflow buffers for example.

So what if we monitor in real time on the memory to make sure there are no overflows and no corruptions in real time? So during development, we would be able to find memory leaks and overflows and in post-production, we would be able to prevent exploitations and so on and so forth.

So it really means the power flips because now the hacker actually needs to find an exploitation technique and a vulnerability that is not protected type of a vulnerability. Not a specific one. That he can exploit while bypassing all the continuous monitoring, anomaly detection and deterministic protection embedded on the system.

That could be really tricky. So when I used to find stack overflows for example and the system used stack canaries, it will go to waste. I just couldn’t exploit it because the canary would prevent me from doing so. So I had to look for something else more sophisticated.

So if you have stack protection and heap protection and command injection protection and you also have continuous monitoring and anomaly detection, it becomes really hard to do something to your system without you noticing or preventing.

So for example, we deployed our solution on some of the devices that we found to be vulnerable and actually we were able to stop the exploitation. So now using this exploitation fingerprint technique, instead of a complete takeover, notification was sent to the customer that his device is under attack including the line of code, memory addresses that were attacked, visibility into the bigger picture and device integrity and real-time operation is actually maintained.

Uptime and all the mission-critical – Wemo is not mission-critical device but other devices, it could be very significant and more importantly, no reaction required. Everything happens automatically.

So now really the question at this point is, “Sounds interesting. But integrating runtime protection and monitoring to an embedded device? This must be a nightmare and will take me a few months.”

So let me just walk you through the Zephyr integration with Sternum and how easy it is. It is as easy to integrate to embedded Linux and FreeRTOS and other systems.

So first you just add our directory and a few lines to the CMakeLists of Zephyr. No other changes are necessary. Sternum runtime security will immediately auto activate and can be controlled directly from Kconfig. Lastly, you can define your own traces. We didn’t really discuss the Sternum solution but Sternum allows you to collect any kind of logs, events and data from your embedded system.

You can really easily define that on our system. Collecting button presses, battery status, error codes, logs, crash logs and then you can monitor those events in real time.

So it really takes a few minutes, maybe 10 minutes, to activate and maybe three days to decide to monitor some things. Then you get really four layers of security – memory protection, control flow integrity, antiexploitation on the operating system itself like command injections, monitoring for file operations and so on. Lastly, based on the data that we collect and you decide to collect, we apply automatic anomaly detection that is running from your specific customized data to detect specific anomalies including what was the expected range, what is the legitimate pattern and what is actually going on in real time.

The Sternum platform in general includes three components – embedded security which we already discussed, continuous monitoring which is about live remote monitoring and analysis powered by anomaly detection and AI, and business and operational insights that give you a complete view of many analytics and trends happening on your post-production fleets.

How it really looks like. So some simple examples. Our system is divided to security, business intelligence, continuous monitoring and compliance and some elements or workspace that helps you define events, monitors, traces and manage the fleet of devices.

The kind of things that you can view is really up to you but here are a few examples. Attack types, data sent by firmware version, reported errors, devices that are affected by recent CVEs. You can see zero days or attack attempts. You can see here it was prevented. You can respond. You can click on the alert and view the exact details including the IP addresses, the line of code that has the vulnerability.

You can see correlations between different events. You can see security anomalies. You can actually get crash reasons and crash reports from the devices and really so on and so forth including downtime per customer. Some users want to monitor batteries because they cause a lot of pain to the organization,

What you don’t see on the screen is really our device view and debugging capabilities. Well, you’re going to each of those events and anomaly, a crash, an attack. You really get a complete view of what’s going on in real time and it can reduce solving issues from 5 months process to 15 minutes.

So security and monitoring are actually tied together and this is why we’re doing that. Being able to collect any kind of data and learn the specific data from devices, not predefined values like CPU or memory, combined with on-device proactive protection, very similar to endpoint protection in other industries, only brought with less than three percent overhead and around seven percent increase in code size. You are able to really do a lot in terms of operations, security and scalability.

Sternum is the device-centric security and data platform with the three layers I mentioned before. We are already embedded in millions of medical, industrial and consumer devices and you can read some of the recent case studies on our website.

Thank you very much.

[Applause]

Q&A Session

Natali: Time for questions. Yeah.

Participant: … memory corruptions, stack overflows, things like that, right? So I would love to hear your opinion on going forth. If you believe like the promise of RUST of avoiding those vulnerabilities, if you believe that’s kind of true, if people should pivot to those kind of memory-safe languages.

Natali: To be able to …

Participant: REST, memory-safe languages.

Natali: Yeah. So no, I don’t believe that. So first there are few types of vulnerabilities, software vulnerabilities, logical vulnerabilities, design vulnerabilities and I think deterministic protection or runtime protection that prevents zero-days can be applied only to software vulnerabilities, software mistakes like overflows and command injections.

For the other types, you really need continuous monitoring and antimalware techniques and this is why we provide the cloud platform that continuously monitors the overall behavior because we can’t prevent everything. We can prevent specific set of vulnerabilities that are very common but it’s not enough.

Then you are talking and asking about secure languages. Unfortunately, I’ve been to places where we exploited applications like WhatsApp and others. Usually they are written in Java or Python. Those still have those types of issues. Moreover, they are based on libraries that are eventually using memory allocations. If it’s the kernel, so it’s – of course it helps. It’s not like coding in C and C++. But if it was really preventing memory vulnerabilities, we wouldn’t have so many of them.

Like 70 percent of the vulnerabilities are eventually memory vulnerabilities in applications, in mobile applications and so on. Sure. Yeah.

Participant: So the SOCs I regularly work with usually combine a processor system and an FPGA. Have you ever looked into – or do you see any potential in memory security concepts where – you know, for example critical data is in a block RAM within the FPGA and it’s continuously monitored by a state machine so that basically the FPGA bitstream is a second source of security because we would have to tamper with that as well. Are such concepts suitable for mitigation of memory leakage issues?

Natali: Yes. So should I repeat the question or …? We’re good? OK. So of course every layer that you had and I even had hardware security as part of the infrastructure layers and I’m talking about trust zones and what you just mentioned. It helps. It helps secure the data but it just means there is another hop that the attacker needs to do.

So for example, there is always APIs between trust zone of FPGA to the application processor. If someone is controlling your application processor, because they found a memory issue, of course it’s not protected just because you have hardware security.

Then now, if there is also a problem in the API, you can exploit it and go into the trust zone or secure environment and then yes, click the data and tamper with other state machines and tools.

It’s harder because you need two vulnerabilities now and not one and this is why those tools are great. But one question to be asked is, “Are you OK with your application processor to be hacked?” and second, “Can we be sure that there are zero vulnerabilities between the secure environment and the application?” and usually that’s a tough question.

Participant: Thanks.

Natali: Sure. Yeah.

Participant: Quick question. What are the system resources required to run your monitoring system in terms of memory and secure cycles?

Natali: Yeah. So around three percent of CPU overhead or latency and it can be reduced to around 1.5 percent in really mission-critical systems. There is code increase of around 7 percent of the existing code size. So we protect the entire code including third party libraries, which means as you have more code, we need more space to protect it.

Third, memory consumption is really close to zero in real time systems. In embedded Linux, it’s around – bigger than that but around five kilobytes of RAM [0:35:37] [Phonetic].

Participant: Just to bounce on that, so it will be possible – so you say three percent. But does that mean the worst case execution time or that’s an average?

Natali: No, this is the average. This is the average and the reason it’s an average is if you have a function, let’s say for each three lines of code, it performs a “memcpy”. Then we protect each “memcpy” and that means a lot of protections on these specific functions.

So there are specific opcodes that are being executed for each protected line of code and the average is between one to three percent. But if you have a function that is very critical, you can put it on a blacklist and it will just remain untouched.

OK. Thank you everyone. If you want to see a live demo, step over to our booth.

[Applause]

JUMP TO SECTION

Enter data to download case study

By submitting this form, you agree to our Privacy Policy.