
Detecting and responding to attacks on embedded devices is a fundamentally different discipline from IT security operations. Embedded devices cannot run endpoint detection agents, have kilobytes rather than gigabytes of storage for logs, operate for years between maintenance windows and often lack a persistent management interface that an incident responder can reach remotely. Yet the same core principle applies: without logs, you cannot detect an attack; without a response plan, you cannot contain one; and without post-incident analysis, the same vulnerability will be exploited again. This article builds the complete embedded threat detection and incident response capability: designing tamper-evident on-device logging under storage constraints, establishing behavioral baselines and identifying the anomaly patterns that indicate compromise, building and running a six-phase incident response lifecycle, writing scenario-specific playbooks for the attack types most likely to target embedded devices, and conducting the post-incident checks and root cause analysis that turn a security incident into a permanent improvement.
You cannot detect, investigate or recover from an attack you have no record of. This is true in all of computing, but it is particularly acute for embedded devices because the attack itself often leaves no visible trace in hardware state: a command injected over MQTT, a firmware downgrade installed via an open OTA endpoint, or authentication credentials extracted through a debug interface all leave the device functioning normally while the attacker operates with full access.
Logs serve three functions in the context of embedded security. First, they enable real-time detection: a rule that fires when five consecutive authentication failures appear in the log within sixty seconds is the trigger for an automated lockout or alert. Second, they enable forensic reconstruction: after an incident, the log provides the timeline of what happened, in what sequence and from which source. Without a log, the investigation starts from zero. Third, they provide compliance evidence: regulations including the EU Cyber Resilience Act, IEC 62443 for industrial systems and FDA guidance for medical devices all require that security-relevant events be recorded and that the records be available for audit.
The challenge for embedded devices is resource constraint. A desktop SIEM (Security Information and Event Management) system can store gigabytes of logs and run megabytes of detection logic. A microcontroller running bare-metal firmware may have 256 KB of total flash and no network stack capable of forwarding events in real time. Effective embedded logging requires careful decisions about what events to record, how much space to allocate to log storage, how to protect the log from tampering, and when and how to forward events off-device.
Log every event that would be relevant to a security investigation or that indicates an attempt to violate a security control. The six categories of security-relevant events for embedded devices:
Every authentication attempt: the timestamp, the authentication method (certificate, password, challenge-response), the outcome (success or failure), the source identifier (IP address, device ID, MAC address where applicable) and the account or certificate involved. Failed authentication attempts are more forensically valuable than successful ones: a pattern of failures from a single source is an active brute-force attack in progress. A pattern of failures followed immediately by a success may indicate a successful credential guessing attack.
Every attempt to access a resource or execute an action that the authenticated identity does not have permission for. An authenticated device attempting to subscribe to MQTT topics outside its authorised namespace, or a valid session attempting to invoke the firmware update API without the required privilege level, both indicate either a misconfigured device or an attacker operating with stolen credentials trying to escalate access.
Any modification to security-relevant configuration: changes to network endpoints, authentication credentials, TLS certificate stores, firewall rules, access control lists, or the security settings of the device itself (RDP level, flash protection, JTAG state). Configuration changes made outside the normal update process are a high-fidelity indicator of compromise.
Every firmware update attempt: the initiating source, the firmware version and image hash, the signature verification outcome, the installation outcome, and the post-update firmware version. Log both successful and rejected updates. A stream of rejected firmware update attempts from the same source indicates an attacker attempting to install malicious firmware. A successful update from an unexpected source at an unexpected time warrants immediate investigation.
New inbound connections (source IP, port, protocol), outbound connections to unexpected destinations, and connection failures (certificate validation failures, refused connections). For resource-constrained devices that cannot log every packet, log connection establishment and termination events rather than per-packet events. The connection log is sufficient to identify unexpected communication partners.
Unexpected resets (especially watchdog resets, which may indicate fault injection or a hung attack loop), stack overflow faults, memory allocation failures, cryptographic operation failures, and clock or voltage anomalies where hardware monitoring is in place. These events are individually ambiguous but collectively form a pattern that distinguishes hardware failure from active attack.
Every log entry should contain five fields that enable both real-time detection and post-incident reconstruction:
/* Structured security log entry for embedded firmware.
Stored in a circular buffer in a dedicated flash sector.
Forwarded to the cloud backend over MQTT TLS when connectivity is available.
Each entry is HMAC-SHA256 authenticated (see tamper-evident section below). */
#include
#include
#include "hmac_sha256.h" /* Platform HMAC implementation */
/* Security event types - extend as needed for your device */
typedef enum {
SEC_EVENT_AUTH_SUCCESS = 0x01,
SEC_EVENT_AUTH_FAILURE = 0x02,
SEC_EVENT_AUTHZ_DENIED = 0x03,
SEC_EVENT_CONFIG_CHANGED = 0x04,
SEC_EVENT_FW_UPDATE_ATTEMPT = 0x05,
SEC_EVENT_FW_UPDATE_SUCCESS = 0x06,
SEC_EVENT_FW_UPDATE_REJECTED = 0x07,
SEC_EVENT_CONN_NEW = 0x08,
SEC_EVENT_CONN_FAILED = 0x09,
SEC_EVENT_TAMPER_DETECTED = 0x0A,
SEC_EVENT_WATCHDOG_RESET = 0x0B,
SEC_EVENT_CRYPTO_FAILURE = 0x0C,
SEC_EVENT_CERT_PIN_MISMATCH = 0x0D,
SEC_EVENT_BOOT_INTEGRITY_FAIL = 0x0E
} SecurityEventType;
/* Outcome codes for events that have a success/failure dimension */
typedef enum {
OUTCOME_SUCCESS = 0,
OUTCOME_FAILURE = 1,
OUTCOME_ERROR = 2
} EventOutcome;
/* Log entry structure: 64 bytes total for efficient flash storage.
Fields are packed to minimise size while retaining all forensically
relevant information. The HMAC covers all fields except itself. */
typedef struct __attribute__((packed)) {
uint32_t sequence; /* Monotonically increasing, never resets */
uint32_t timestamp_sec; /* Unix timestamp from RTC */
SecurityEventType event_type; /* What happened */
EventOutcome outcome; /* Success, failure or error */
uint8_t source_id[6]; /* Source MAC, device ID or IP bytes */
uint8_t detail[18]; /* Event-specific detail (version, topic) */
uint8_t hmac[16]; /* Truncated HMAC-SHA256 (first 16 bytes) */
} SecurityLogEntry; /* Total: 4+4+1+1+6+18+16 = 50 bytes (+2 padding = 52) */
/* Log a security event to the circular buffer and queue for forwarding */
void log_security_event(SecurityEventType event_type,
EventOutcome outcome,
const uint8_t *source_id,
const uint8_t *detail,
size_t detail_len) {
SecurityLogEntry entry = {0};
entry.sequence = atomic_increment_get(&g_log_sequence);
entry.timestamp_sec = rtc_get_unix_timestamp();
entry.event_type = event_type;
entry.outcome = outcome;
if (source_id != NULL) {
memcpy(entry.source_id, source_id, sizeof(entry.source_id));
}
if (detail != NULL && detail_len > 0) {
size_t copy_len = (detail_len < sizeof(entry.detail))
? detail_len : sizeof(entry.detail);
memcpy(entry.detail, detail, copy_len);
}
/* Compute HMAC over all fields except the HMAC field itself.
Key is the device's log signing key, stored in encrypted NVS. */
hmac_sha256_truncated(
g_log_signing_key, LOG_SIGNING_KEY_LEN,
(const uint8_t *)&entry,
offsetof(SecurityLogEntry, hmac),
entry.hmac,
sizeof(entry.hmac)
);
/* Write to flash circular buffer and enqueue for MQTT forwarding */
flash_log_write(&entry, sizeof(entry));
mqtt_log_queue_push(&entry, sizeof(entry));
}
Log files are often less well protected than the data they describe. A log that records the plaintext password from a failed authentication attempt stores the credential in a location that may be accessible to an attacker who compromises the logging subsystem, may be transmitted to a SIEM with weaker access controls than the device itself, and may be retained for years beyond the session's lifetime. Never log the following:
Embedded devices face a fundamental tension between log completeness and storage capacity. A 2 MB dedicated log partition holding 52-byte log entries stores approximately 40,000 events. At ten security events per day under normal operation, that represents over ten years of capacity. Under active attack (thousands of failed authentication attempts per minute), the same partition fills in under 30 minutes.
Three design decisions manage this tension:
Implement log storage as a circular buffer in a dedicated flash sector. When the buffer is full, new entries overwrite the oldest ones. This guarantees recent events are always preserved, which matters most for detection (recent events) rather than historical audit (older events). The sequence number field in each entry ensures that a reader can detect wrapping and reconstruct the correct chronological order.
Flash memory has a finite write cycle count, typically 10,000 to 100,000 program/erase cycles per sector. Writing a new log entry to the same address on every event would exhaust a flash sector rapidly. Use a dedicated logging library that distributes writes across multiple pages within the log sector, or use an external EEPROM with higher write endurance for the log partition. FRAM (Ferroelectric RAM) provides effectively unlimited write cycles and is available in I2C packages suitable for embedded designs where high write frequency is anticipated.
Under active attack, the log may receive thousands of identical events (authentication failures from the same source). Rate-limit identical event types: after ten consecutive authentication failures from the same source within 60 seconds, log a single "AUTH_BURST_FAILURE" event with the count rather than ten thousand individual entries. This preserves log space for detection and retains the forensically relevant information (that a burst occurred, from which source, over what time window) without filling the log partition with duplicate entries.
A local log file that an attacker can modify after the fact is worse than no log: it provides a false alibi. Tamper-evident logging uses cryptographic techniques to ensure that any modification to or deletion of a log entry is detectable even without a real-time connection to a remote log server.
The two standard techniques are:
Per-entry HMAC: Each log entry includes an HMAC computed over its content using a device-unique secret key (stored in the secure element or encrypted NVS). An entry that has been modified will fail HMAC verification. This is the approach shown in the log entry structure above. It detects modification of individual entries but does not detect deletion of an entire range of entries.
Hash chaining: In addition to the per-entry HMAC, each entry includes the hash of the previous entry. This creates a chain where tampering with any entry breaks all subsequent chain verifications, and deleting entries from the middle of the log is detectable because the chain link from the entry before the deletion to the entry after it will not verify. The sequence number field reinforces this: gaps in sequence numbers indicate deleted entries even if the hash chain is not broken.
/* Hash-chained log entry verification.
Called during log export or audit to verify the entire log chain.
Returns the index of the first entry that fails verification,
or LOG_VERIFY_OK if all entries pass. */
#define LOG_VERIFY_OK UINT32_MAX
#define PREVIOUS_HASH_LEN 32
typedef struct __attribute__((packed)) {
SecurityLogEntry entry;
uint8_t prev_hash[PREVIOUS_HASH_LEN]; /* SHA-256 of previous entry */
} ChainedLogEntry;
uint32_t verify_log_chain(const ChainedLogEntry *entries, uint32_t count) {
uint8_t computed_prev_hash[PREVIOUS_HASH_LEN] = {0}; /* Genesis block: all zeros */
uint8_t entry_hash[PREVIOUS_HASH_LEN];
for (uint32_t i = 0; i < count; i++) {
/* Step 1: Verify that this entry's prev_hash matches the hash of the
previous entry (computed_prev_hash from the last iteration). */
if (!constant_time_memcmp(entries[i].prev_hash,
computed_prev_hash,
PREVIOUS_HASH_LEN)) {
return i; /* Chain broken at entry i: deletion or modification detected */
}
/* Step 2: Verify the per-entry HMAC over the entry content */
uint8_t expected_hmac[16];
hmac_sha256_truncated(
g_log_signing_key, LOG_SIGNING_KEY_LEN,
(const uint8_t *)&entries[i].entry,
offsetof(SecurityLogEntry, hmac),
expected_hmac, sizeof(expected_hmac)
);
if (!constant_time_memcmp(entries[i].entry.hmac,
expected_hmac, sizeof(expected_hmac))) {
return i; /* Entry i has been tampered with */
}
/* Step 3: Compute the hash of the current entry to use as prev_hash
for the next entry verification. */
sha256((const uint8_t *)&entries[i].entry,
sizeof(SecurityLogEntry),
computed_prev_hash);
}
return LOG_VERIFY_OK;
}
On-device logs are a last resort: they exist for devices that lose network connectivity or for post-incident forensics when remote logging was not available. The primary log store for any internet-connected embedded device should be a remote SIEM or log aggregation service. Remote logs cannot be tampered with by compromising the device, are retained beyond the device's local storage capacity, and can be correlated across hundreds or thousands of devices to detect fleet-wide attack patterns.
Forward security events to the remote log server over the same authenticated TLS channel used for telemetry data. Use MQTT QoS 1 for log message delivery: this provides at-least-once delivery semantics, ensuring that events are not silently lost during brief connectivity interruptions. Buffer events locally (in the on-device circular buffer) during connectivity loss and forward the buffered events on reconnection, preserving the sequence number and timestamp from when the event occurred.
/* MQTT log forwarding with QoS 1 and local buffering on disconnect.
Log entries are published to a device-specific secure log topic.
The broker's topic ACL restricts each device to publishing to its
own log topic only (as shown in the Section 6 MQTT ACL example). */
#define LOG_TOPIC_FMT "devices/%s/security-log"
#define LOG_TOPIC_MAXLEN 64
/* Publish a log entry to the remote SIEM via MQTT.
If not connected, the entry is buffered locally and will be
published when connectivity is restored. */
void mqtt_log_forward(const SecurityLogEntry *entry) {
if (entry == NULL) return;
/* Serialise the entry to JSON for SIEM compatibility.
In production, use a proper JSON serialisation library.
Field names should match your SIEM's expected schema. */
char payload[256];
int payload_len = snprintf(payload, sizeof(payload),
"{"
"\"seq\":%"PRIu32","
"\"ts\":%"PRIu32","
"\"type\":%d,"
"\"outcome\":%d,"
"\"src\":\"%02x:%02x:%02x:%02x:%02x:%02x\""
"}",
entry->sequence,
entry->timestamp_sec,
(int)entry->event_type,
(int)entry->outcome,
entry->source_id[0], entry->source_id[1], entry->source_id[2],
entry->source_id[3], entry->source_id[4], entry->source_id[5]
);
/* Check for snprintf truncation (payload_len >= sizeof(payload)) */
if (payload_len < 0 || payload_len >= (int)sizeof(payload)) {
/* Log a truncation event locally; do not forward the malformed payload */
log_internal_error(ERR_LOG_PAYLOAD_TRUNCATED);
return;
}
char topic[LOG_TOPIC_MAXLEN];
snprintf(topic, sizeof(topic), LOG_TOPIC_FMT, g_device_id);
/* QoS 1: broker acknowledges delivery; message is retained in local
buffer until acknowledgement received */
MQTTMessage msg = {
.qos = QOS1,
.retained = 0,
.payload = payload,
.payloadlen = (size_t)payload_len
};
int rc = MQTTPublish(&g_mqtt_client, topic, &msg);
if (rc != SUCCESS) {
/* Push to local buffer for retry on reconnection */
local_log_buffer_push(entry);
}
}
Behavioral monitoring detects threats that do not appear as a single anomalous event but as a pattern of activity that deviates from normal device operation. A single authentication failure is noise. One hundred authentication failures in ten seconds from a single IP address is an active attack. A device that normally uploads 2 KB of telemetry per minute suddenly uploading 500 KB per minute has likely been compromised and is exfiltrating data.
Effective behavioral monitoring requires a baseline: a documented characterisation of what normal device behaviour looks like. Without a baseline, you cannot define what "anomalous" means for your device. Establish the baseline during controlled testing, not after deployment when real-world variation has not been characterised.
The baseline for a typical sensor IoT device should document:
The following behavioral indicators, individually or in combination, indicate a device that warrants investigation:
| Indicator | What It May Indicate | Initial Response |
|---|---|---|
| 5+ authentication failures in 60 seconds from the same source | Active brute-force attack on credentials | Temporary block of source IP; alert to security team |
| Successful authentication from a previously unseen source IP | Credential theft and use from a different location | Alert; request re-authentication from known device |
| Outbound connection to an IP not in the baseline destination list | C2 (command and control) callback, data exfiltration | Block at network level; isolate device; investigate |
| Upload volume 10x above normal baseline | Data exfiltration of stored sensor data or credentials | Network isolation; credential rotation |
| Firmware version after update does not match a known release | Malicious firmware installed via compromised OTA path | Immediate network isolation; reflash from known-good image |
| Watchdog resets more than 3x the normal weekly rate | Active fault injection, software instability post-compromise | Log correlation; check for firmware integrity; physical inspection |
| Tamper switch triggered | Physical access to device enclosure | Tamper response sequence: key zeroization, lockdown, alert |
| Configuration change outside maintenance window | Unauthorised remote access, compromised management channel | Revert configuration; revoke active sessions; investigate |
Anomalies in embedded device behaviour fall into six categories. Recognising the category helps narrow down the cause and the appropriate initial response.
Point anomalies are single events that are statistically unusual: one authentication failure is a point anomaly relative to a device that normally has zero. Low signal individually; high signal when they occur on a device that normally never generates them.
Contextual anomalies are events that would be normal in one context but are unusual in another. A firmware update is a normal event; a firmware update at 3 AM on a Saturday, when the maintenance window is Tuesday mornings, is a contextual anomaly. Access from an IP address in a country where no authorised users are located is contextually anomalous even if the authentication itself succeeds.
Collective anomalies are groups of events that individually appear normal but form a suspicious pattern collectively. Five authentication failures from five different IP addresses over five hours would each appear normal individually, but the pattern of distributed, slow authentication testing is a distributed credential guessing attack.
Temporal anomalies are events that occur at the wrong time relative to expected patterns: network activity during scheduled off-hours, authentication events after a device has been officially decommissioned, or a firmware update initiated between expected release cycle dates.
Spatial anomalies involve unexpected sources or destinations: connections from geographic regions inconsistent with the device's deployment location, or outbound connections to IP ranges associated with known malicious infrastructure.
Behavioral anomalies involve deviation from the device's characteristic interaction pattern: a temperature sensor that suddenly begins making HTTPS requests to an external API, a device that begins initiating connections to other devices on the local network when it has never done so before, or a device that begins consuming significantly more CPU cycles than the baseline for its application.
Three complementary detection approaches cover the space of embedded threat detection. Using all three in combination minimises both false negatives (attacks that go undetected) and false positives (legitimate activity flagged as suspicious).
The simplest and most predictable approach: define an upper or lower bound for a metric, and generate an alert when the bound is crossed. Authentication failures exceeding N per minute, CPU utilisation exceeding X percent for more than Y minutes, upload volume exceeding Z bytes per hour. Threshold detection is deterministic, has zero computational overhead beyond the comparison itself, and is appropriate for bare-metal firmware where no complex analytics are possible. The limitation is that sophisticated attackers operate just below the threshold: five failed attempts per minute if the threshold is ten.
/* Threshold-based authentication failure detector.
Counts failures per source over a sliding 60-second window.
When threshold exceeded, locks out the source and logs the event. */
#define AUTH_FAIL_WINDOW_SEC 60
#define AUTH_FAIL_THRESHOLD 5
#define MAX_TRACKED_SOURCES 16
typedef struct {
uint8_t source_id[6]; /* Source MAC or IP bytes */
uint32_t window_start; /* Start of current counting window (Unix time) */
uint16_t failure_count; /* Failures in current window */
bool locked_out; /* Whether this source is currently blocked */
} AuthFailTracker;
static AuthFailTracker g_auth_trackers[MAX_TRACKED_SOURCES];
/* Call this on every authentication failure.
Returns true if the source should be blocked. */
bool auth_failure_detector(const uint8_t *source_id, uint32_t current_time) {
AuthFailTracker *tracker = NULL;
AuthFailTracker *oldest = NULL;
/* Find existing tracker for this source, or the oldest slot to reuse */
for (int i = 0; i < MAX_TRACKED_SOURCES; i++) {
if (memcmp(g_auth_trackers[i].source_id, source_id, 6) == 0) {
tracker = &g_auth_trackers[i];
break;
}
if (oldest == NULL ||
g_auth_trackers[i].window_start < oldest->window_start) {
oldest = &g_auth_trackers[i];
}
}
/* Create a new entry if this source is not being tracked yet */
if (tracker == NULL) {
tracker = oldest;
memcpy(tracker->source_id, source_id, 6);
tracker->window_start = current_time;
tracker->failure_count = 0;
tracker->locked_out = false;
}
/* Reset the window if it has expired */
if ((current_time - tracker->window_start) > AUTH_FAIL_WINDOW_SEC) {
tracker->window_start = current_time;
tracker->failure_count = 0;
tracker->locked_out = false;
}
tracker->failure_count++;
if (tracker->failure_count >= AUTH_FAIL_THRESHOLD) {
if (!tracker->locked_out) {
/* First time reaching threshold: log the burst event */
log_security_event(SEC_EVENT_AUTH_FAILURE,
OUTCOME_FAILURE,
source_id,
(const uint8_t *)&tracker->failure_count,
sizeof(tracker->failure_count));
tracker->locked_out = true;
}
return true; /* Block this source */
}
return false;
}
Statistical detection compares current behaviour against a characterised baseline and alerts when the current value deviates beyond a defined number of standard deviations from the mean. On embedded Linux devices with sufficient resources, this can be implemented using exponential moving averages: lightweight enough to run continuously without significant CPU overhead, adaptive to gradual legitimate changes in device behaviour, and sensitive to sudden deviations.
Signature detection matches observed events against known attack patterns. A known Mirai scan pattern (rapid TCP SYN to a sequence of ports), a known format string attack payload in an incoming packet, or a known malicious firmware hash are all detectable by signature. Signatures require maintenance: the signature database must be updated when new attack patterns emerge. For constrained devices, maintain a minimal set of high-confidence signatures that cover the attack types most commonly observed against your device class.
An alert system that generates constant false positives rapidly becomes ignored, which is worse than no alert system. Alert fatigue is a documented contributing factor in major security incidents: the signal was there, but it was buried in noise. Managing false positives is an ongoing operational task, not a one-time configuration activity.
Four practices that keep the false positive rate manageable:
Tune thresholds against real traffic: After initial deployment, monitor the alert rate over two to four weeks of normal operation without taking action on low-severity alerts. Use this period to characterise the normal variation in your metrics and set thresholds that sit above the 99th percentile of normal variation, not just above the mean.
Whitelist known-good activity: Maintenance windows, scheduled update deployments and known monitoring tools all generate activity that looks suspicious out of context. Add explicit whitelist entries for these activities with a time window and source restriction so the detection system knows to treat them as expected.
Correlate across multiple indicators before alerting: A single authentication failure from a new IP address is low confidence. The same IP failing authentication, then immediately attempting an HTTP probe of the management interface, and then appearing in an outbound connection log, is high confidence. Require two or more correlated indicators before generating a high-severity alert. Cloud-side SIEM platforms (Splunk, Elastic SIEM, AWS GuardDuty for IoT) automate this correlation across the fleet.
Review and update detection rules on a schedule: Detection rules that were accurate at deployment drift as device behaviour evolves. Review alert rates and false positive rates monthly and update rules accordingly. Retire rules that have generated no true positives in six months; they are adding noise without value.
The incident response lifecycle for embedded device security incidents follows the same six-phase structure used in enterprise IT security, with adaptations for the specific constraints of embedded systems: limited remote access, physical deployment locations that may not be immediately reachable, and OTA update pipelines that may take days to reach all affected devices.
Done before any incident occurs. Preparation includes: writing the incident response plan, defining team roles and contact lists, setting up the logging and monitoring infrastructure, preparing the tools needed for investigation (firmware analysis environment, network capture capability, a tested recovery process), and conducting tabletop exercises where the team walks through simulated incident scenarios. A team that has never practised its incident response procedures will make costly mistakes under the time pressure of a real incident.
Identifying that a security incident has occurred or is in progress. Detection sources for embedded devices include: SIEM alerts triggered by the anomaly detection rules, reports from customers or field technicians observing abnormal device behaviour, intelligence feeds reporting a vulnerability in a component you use, and routine security review finding unexpected firmware versions in the fleet inventory.
Stopping the attack from spreading or causing further damage. Containment for embedded devices takes two forms. Short-term containment is immediate action to limit damage without fully restoring normal operation: revoking the credentials associated with the compromised device, blocking the device's IP address at the network firewall, or pushing an emergency configuration change that disables the exploited feature. Long-term containment is stable isolation that can be maintained while the investigation and remediation proceed: moving the compromised device fleet to a quarantine VLAN with no access to production systems, or issuing revocation for the affected device certificates.
Removing the threat from the environment. For embedded devices, eradication typically means: reflashing from a known-good firmware image (not the same version that was compromised if the compromise was through the firmware itself), rotating all credentials associated with the affected devices, and verifying that the exploit that enabled the compromise is patched in the new firmware version. If the attack was through a supply chain compromise or a compromised signing key, the scope of eradication expands to the entire production signing infrastructure.
Restoring normal operations with confidence that the threat has been removed. Recovery includes: deploying the patched firmware to the affected device fleet, restoring devices from quarantine to production network access after verifying they are running the correct firmware version, monitoring the recovered devices closely for 30 to 60 days for signs of re-compromise, and updating device credentials in the cloud backend to reflect rotated keys.
Analysing the incident to prevent recurrence. Conducted within two weeks of incident closure while details are fresh. Produces the threat register updates, code fixes, monitoring rule improvements and process changes that close the gaps exposed by the incident. The lessons learned meeting is not a blame exercise: it is a structured process to improve the security posture of the product and the team's response capability.
A clear role assignment ensures that critical tasks are not duplicated or missed during the chaos of an active incident. For small embedded development teams, some roles will be combined; the important thing is that each function has a named owner for any given incident.
| Role | Responsibilities | Who Typically Fills It |
|---|---|---|
| Incident Commander | Coordinates the overall response, makes escalation decisions, declares incident severity, owns the timeline | Engineering lead or security lead |
| Technical Lead | Leads technical investigation: firmware analysis, log analysis, vulnerability identification, patch development | Senior firmware engineer or security engineer |
| Communications Lead | Manages notifications to customers, management, regulators and (if required) law enforcement; drafts public disclosures | Product manager or engineering manager |
| Documentation Lead | Maintains the incident timeline in real time; records every action taken, every finding, every decision made | Rotating assignment; any team member not on critical path |
| Legal and Compliance | Advises on regulatory notification obligations, handles evidence preservation for potential law enforcement referral | Legal counsel (internal or external) |
A playbook is a step-by-step procedure for responding to a specific incident type. Writing playbooks in advance means that during an active incident, the responder is executing a tested procedure rather than improvising under pressure. The four playbooks most relevant to embedded IoT devices:
Trigger: Authentication to cloud backend from an unexpected geographic location or source; multiple simultaneous active sessions for a device that should have only one.
PLAYBOOK: Compromised Device Credentials
Severity: High
Estimated response time: 2-4 hours
IMMEDIATE ACTIONS (within 30 minutes):
1. Revoke the device certificate in the cloud CA (marks it as invalid;
cloud backend will refuse new connections using this certificate).
2. Block the device's current source IP at the network edge firewall.
3. Revoke all active sessions associated with the device ID in the backend.
4. Log the incident with current timestamp and indicators observed.
INVESTIGATION ACTIONS (within 4 hours):
5. Pull the device's security log from SIEM for the 48 hours preceding
detection. Look for: first authentication from unexpected source,
configuration changes, firmware update events.
6. Determine whether the credential was extracted from the device or
obtained through another channel (phishing, insider threat, build
pipeline exposure).
7. Check whether other devices in the fleet show the same anomaly
(indicating a fleet-wide credential compromise vs. single-device).
REMEDIATION:
8. Issue a new device certificate via the provisioning infrastructure.
9. Deploy updated firmware if the credential extraction was via a
firmware vulnerability.
10. Push new certificate to device via emergency OTA update.
11. Restore device to production network after verifying new certificate
is active and old certificate is revoked.
12. Update threat model with the credential extraction vector identified.
Trigger: Firmware version in fleet inventory does not match any known release hash; device behaviour anomalies consistent with a compromised agent (unexpected outbound connections, resource spike).
PLAYBOOK: Malicious or Unauthorised Firmware
Severity: Critical
Estimated response time: 4-8 hours
IMMEDIATE ACTIONS (within 15 minutes):
1. Immediately move affected devices to quarantine VLAN with no
outbound internet access (limits C2 callback and data exfiltration).
2. Preserve a copy of the flash image from one affected device via
the OTA infrastructure (for forensic analysis) before reflashing.
3. Notify incident commander; this is a Critical severity incident.
INVESTIGATION ACTIONS (within 2 hours):
4. Analyse the unknown firmware image: binwalk extraction, strings
analysis, comparison to last known-good image using binary diff.
5. Check OTA server logs: when was the firmware image delivered? From
which source IP was the delivery triggered?
6. Check signing key audit logs: was the production signing key used?
If yes, assume the signing key is compromised: escalate to
signing infrastructure incident response.
7. Determine scope: how many devices received the firmware?
REMEDIATION:
8. Build and sign a verified clean firmware image from a known-good
source commit.
9. Deploy via emergency OTA push to all affected devices.
10. Verify post-update firmware hash on each device matches the
expected hash for the clean image.
11. Release devices from quarantine VLAN after hash verification.
12. If signing key was compromised: rotate signing key, re-sign all
production firmware images, re-provision root public keys to
all fleet devices via a separately signed configuration update.
Trigger: Fleet-wide network saturation; devices reporting connection failures; cloud backend API rate limit exhaustion.
Trigger: Tamper switch or conductive mesh sensor activated; device reports SEC_EVENT_TAMPER_DETECTED to SIEM; device enters lockdown mode autonomously.
Who you notify during an incident, in what order and with what information, is as important as the technical response. Notification failures during incidents regularly produce secondary problems: customers discover a breach from a news article rather than from the company, regulatory authorities impose penalties for notification delays, or internal teams take conflicting actions because they were not informed.
Define notification triggers and timelines in advance:
Declaring an incident resolved before confirming that the threat has been completely eliminated is a common failure mode. A device that has been reflashed but whose credential remains unrevoked, or a patch that addresses the exploited vulnerability but leaves a related vulnerability open, provides false confidence while leaving the device exposed.
The recovery verification checklist:
Root cause analysis answers three questions for every security incident: how did it happen (the entry point and exploit chain), why was it not prevented (which controls failed or were absent), and how will it be prevented from recurring (the specific changes being made). A root cause analysis that does not produce actionable items with owners and deadlines is an exercise in documentation without improvement.
The five-whys technique is effective for embedded security incidents. Starting from the observable effect, ask "why" repeatedly until the underlying systemic cause is reached. The goal is to reach a cause that is within your control to change.
Example applied to a hardcoded credential discovered in production firmware:
The root cause is a gap in the security requirements and the static analysis configuration. The corrective action is: add a security requirement prohibiting hardcoded credentials, add a Semgrep rule that detects strings matching credential patterns in the source code, and add this check to the CI pipeline as a blocking step. This change prevents the same class of failure in all future firmware, not just the specific credential that was exposed.
Measuring incident response performance over time identifies where the process is working and where investment is needed. Track these six metrics for every incident and review trends quarterly:
| Metric | Definition | Target Benchmark |
|---|---|---|
| Mean Time to Detect (MTTD) | Time from incident start to detection by the security team | Under 4 hours for high-severity; under 24 hours for medium |
| Mean Time to Respond (MTTR) | Time from detection to initial containment action | Under 1 hour for critical; under 4 hours for high |
| Mean Time to Contain | Time from detection to confirmed containment of the threat | Under 8 hours for critical; under 24 hours for high |
| Mean Time to Recovery | Time from detection to full restoration of normal operations | Defined per incident type in the playbook |
| Fleet Patch Velocity | Percentage of affected devices receiving the security patch within 72 hours of availability | Above 95% within 7 days for critical vulnerabilities |
| Recurrence Rate | Percentage of incident types that recur within 6 months | Zero recurrence of same root cause |
Threat detection and incident response for embedded devices is not the same problem as IT security operations, but the underlying discipline is identical: you need visibility into what is happening on your devices, the ability to recognise when something is wrong, a practised process for responding effectively and the analytical rigour to prevent the same thing from happening again. Tamper-evident logging solves the visibility problem within the constraints of embedded storage. Behavioral baselines and the three detection approaches solve the recognition problem. The six-phase incident response lifecycle and scenario-specific playbooks solve the response problem. Root cause analysis and the measurement of detection and response metrics solve the recurrence problem. Each piece is individually achievable with the tools and techniques described here. Together they give you the confidence that when an attack occurs, and for internet-connected embedded devices it is a question of when rather than whether, you will know about it, know what to do about it, and be able to demonstrate to customers and regulators that you handled it correctly.






