Light Dark

ESP32Discover the power of the ESP32! Dive into in-depth tutorials, project ideas, and technical insights for Espressif’s flagship microcontroller. Perfect for hobbyists, makers, and developers looking to build advanced IoT applications with Wi-Fi and Bluetooth capabilities.
ESP8266Unlock the potential of the ESP8266. Explore beginner-friendly guides, cost-effective project builds, and essential tips for using this popular and affordable Wi-Fi enabled microcontroller. Ideal for those venturing into the world of IoT on a budget.
Embedded Security
ArduinoYour gateway to the Arduino universe. From beginner basics to advanced robotics and sensor integration, find comprehensive guides, project inspiration, and community highlights for the iconic Arduino platform. Learn to bring your electronic ideas to life!
ComparisonIn this category you will find the comparison between different development boards as well iot clouds and the latest news about iot and embedded industry.

ESP32Discover the power of the ESP32! Dive into in-depth tutorials, project ideas, and technical insights for Espressif’s flagship microcontroller. Perfect for hobbyists, makers, and developers looking to build advanced IoT applications with Wi-Fi and Bluetooth capabilities.
ESP8266Unlock the potential of the ESP8266. Explore beginner-friendly guides, cost-effective project builds, and essential tips for using this popular and affordable Wi-Fi enabled microcontroller. Ideal for those venturing into the world of IoT on a budget.
Embedded Security
ArduinoYour gateway to the Arduino universe. From beginner basics to advanced robotics and sensor integration, find comprehensive guides, project inspiration, and community highlights for the iconic Arduino platform. Learn to bring your electronic ideas to life!
ComparisonIn this category you will find the comparison between different development boards as well iot clouds and the latest news about iot and embedded industry.

Now Reading: Threat Detection and Incident Response for Embedded Devices

01
Threat Detection and Incident Response for Embedded Devices

Light Dark

ESP32//Discover the power of the ESP32! Dive into in-depth tutorials, project ideas, and technical insights for Espressif’s flagship microcontroller. Perfect for hobbyists, makers, and developers looking to build advanced IoT applications with Wi-Fi and Bluetooth capabilities.
ESP8266//Unlock the potential of the ESP8266. Explore beginner-friendly guides, cost-effective project builds, and essential tips for using this popular and affordable Wi-Fi enabled microcontroller. Ideal for those venturing into the world of IoT on a budget.
Embedded Security//
Arduino//Your gateway to the Arduino universe. From beginner basics to advanced robotics and sensor integration, find comprehensive guides, project inspiration, and community highlights for the iconic Arduino platform. Learn to bring your electronic ideas to life!
Comparison//In this category you will find the comparison between different development boards as well iot clouds and the latest news about iot and embedded industry.

Home
Embedded Security
Threat Detection and Incident Response for Embedded Devices

Threat Detection and Incident Response for Embedded Devices

MuhammadEmbedded Security6 days ago11 Views

Detecting and responding to attacks on embedded devices is a fundamentally different discipline from IT security operations. Embedded devices cannot run endpoint detection agents, have kilobytes rather than gigabytes of storage for logs, operate for years between maintenance windows and often lack a persistent management interface that an incident responder can reach remotely. Yet the same core principle applies: without logs, you cannot detect an attack; without a response plan, you cannot contain one; and without post-incident analysis, the same vulnerability will be exploited again. This article builds the complete embedded threat detection and incident response capability: designing tamper-evident on-device logging under storage constraints, establishing behavioral baselines and identifying the anomaly patterns that indicate compromise, building and running a six-phase incident response lifecycle, writing scenario-specific playbooks for the attack types most likely to target embedded devices, and conducting the post-incident checks and root cause analysis that turn a security incident into a permanent improvement.

Why Logging Is the Foundation of Threat Detection

You cannot detect, investigate or recover from an attack you have no record of. This is true in all of computing, but it is particularly acute for embedded devices because the attack itself often leaves no visible trace in hardware state: a command injected over MQTT, a firmware downgrade installed via an open OTA endpoint, or authentication credentials extracted through a debug interface all leave the device functioning normally while the attacker operates with full access.

Logs serve three functions in the context of embedded security. First, they enable real-time detection: a rule that fires when five consecutive authentication failures appear in the log within sixty seconds is the trigger for an automated lockout or alert. Second, they enable forensic reconstruction: after an incident, the log provides the timeline of what happened, in what sequence and from which source. Without a log, the investigation starts from zero. Third, they provide compliance evidence: regulations including the EU Cyber Resilience Act, IEC 62443 for industrial systems and FDA guidance for medical devices all require that security-relevant events be recorded and that the records be available for audit.

The challenge for embedded devices is resource constraint. A desktop SIEM (Security Information and Event Management) system can store gigabytes of logs and run megabytes of detection logic. A microcontroller running bare-metal firmware may have 256 KB of total flash and no network stack capable of forwarding events in real time. Effective embedded logging requires careful decisions about what events to record, how much space to allocate to log storage, how to protect the log from tampering, and when and how to forward events off-device.

What to Log on an Embedded Device

Log every event that would be relevant to a security investigation or that indicates an attempt to violate a security control. The six categories of security-relevant events for embedded devices:

Authentication Events

Every authentication attempt: the timestamp, the authentication method (certificate, password, challenge-response), the outcome (success or failure), the source identifier (IP address, device ID, MAC address where applicable) and the account or certificate involved. Failed authentication attempts are more forensically valuable than successful ones: a pattern of failures from a single source is an active brute-force attack in progress. A pattern of failures followed immediately by a success may indicate a successful credential guessing attack.

Authorisation Failures

Every attempt to access a resource or execute an action that the authenticated identity does not have permission for. An authenticated device attempting to subscribe to MQTT topics outside its authorised namespace, or a valid session attempting to invoke the firmware update API without the required privilege level, both indicate either a misconfigured device or an attacker operating with stolen credentials trying to escalate access.

Configuration Changes

Any modification to security-relevant configuration: changes to network endpoints, authentication credentials, TLS certificate stores, firewall rules, access control lists, or the security settings of the device itself (RDP level, flash protection, JTAG state). Configuration changes made outside the normal update process are a high-fidelity indicator of compromise.

Firmware Update Events

Every firmware update attempt: the initiating source, the firmware version and image hash, the signature verification outcome, the installation outcome, and the post-update firmware version. Log both successful and rejected updates. A stream of rejected firmware update attempts from the same source indicates an attacker attempting to install malicious firmware. A successful update from an unexpected source at an unexpected time warrants immediate investigation.

Network Connection Events

New inbound connections (source IP, port, protocol), outbound connections to unexpected destinations, and connection failures (certificate validation failures, refused connections). For resource-constrained devices that cannot log every packet, log connection establishment and termination events rather than per-packet events. The connection log is sufficient to identify unexpected communication partners.

Error Conditions and System Events

Unexpected resets (especially watchdog resets, which may indicate fault injection or a hung attack loop), stack overflow faults, memory allocation failures, cryptographic operation failures, and clock or voltage anomalies where hardware monitoring is in place. These events are individually ambiguous but collectively form a pattern that distinguishes hardware failure from active attack.

Log Entry Format and Content

Every log entry should contain five fields that enable both real-time detection and post-incident reconstruction:

/* Structured security log entry for embedded firmware.
   Stored in a circular buffer in a dedicated flash sector.
   Forwarded to the cloud backend over MQTT TLS when connectivity is available.
   Each entry is HMAC-SHA256 authenticated (see tamper-evident section below). */

#include 
#include 
#include "hmac_sha256.h"   /* Platform HMAC implementation */

/* Security event types - extend as needed for your device */
typedef enum {
    SEC_EVENT_AUTH_SUCCESS        = 0x01,
    SEC_EVENT_AUTH_FAILURE        = 0x02,
    SEC_EVENT_AUTHZ_DENIED        = 0x03,
    SEC_EVENT_CONFIG_CHANGED      = 0x04,
    SEC_EVENT_FW_UPDATE_ATTEMPT   = 0x05,
    SEC_EVENT_FW_UPDATE_SUCCESS   = 0x06,
    SEC_EVENT_FW_UPDATE_REJECTED  = 0x07,
    SEC_EVENT_CONN_NEW            = 0x08,
    SEC_EVENT_CONN_FAILED         = 0x09,
    SEC_EVENT_TAMPER_DETECTED     = 0x0A,
    SEC_EVENT_WATCHDOG_RESET      = 0x0B,
    SEC_EVENT_CRYPTO_FAILURE      = 0x0C,
    SEC_EVENT_CERT_PIN_MISMATCH   = 0x0D,
    SEC_EVENT_BOOT_INTEGRITY_FAIL = 0x0E
} SecurityEventType;

/* Outcome codes for events that have a success/failure dimension */
typedef enum {
    OUTCOME_SUCCESS = 0,
    OUTCOME_FAILURE = 1,
    OUTCOME_ERROR   = 2
} EventOutcome;

/* Log entry structure: 64 bytes total for efficient flash storage.
   Fields are packed to minimise size while retaining all forensically
   relevant information. The HMAC covers all fields except itself. */
typedef struct __attribute__((packed)) {
    uint32_t          sequence;        /* Monotonically increasing, never resets */
    uint32_t          timestamp_sec;   /* Unix timestamp from RTC               */
    SecurityEventType event_type;      /* What happened                          */
    EventOutcome      outcome;         /* Success, failure or error              */
    uint8_t           source_id[6];    /* Source MAC, device ID or IP bytes      */
    uint8_t           detail[18];      /* Event-specific detail (version, topic) */
    uint8_t           hmac[16];        /* Truncated HMAC-SHA256 (first 16 bytes) */
} SecurityLogEntry;   /* Total: 4+4+1+1+6+18+16 = 50 bytes (+2 padding = 52)   */

/* Log a security event to the circular buffer and queue for forwarding */
void log_security_event(SecurityEventType event_type,
                        EventOutcome       outcome,
                        const uint8_t     *source_id,
                        const uint8_t     *detail,
                        size_t             detail_len) {
    SecurityLogEntry entry = {0};

    entry.sequence      = atomic_increment_get(&g_log_sequence);
    entry.timestamp_sec = rtc_get_unix_timestamp();
    entry.event_type    = event_type;
    entry.outcome       = outcome;

    if (source_id != NULL) {
        memcpy(entry.source_id, source_id, sizeof(entry.source_id));
    }

    if (detail != NULL && detail_len > 0) {
        size_t copy_len = (detail_len < sizeof(entry.detail))
                          ? detail_len : sizeof(entry.detail);
        memcpy(entry.detail, detail, copy_len);
    }

    /* Compute HMAC over all fields except the HMAC field itself.
       Key is the device's log signing key, stored in encrypted NVS. */
    hmac_sha256_truncated(
        g_log_signing_key, LOG_SIGNING_KEY_LEN,
        (const uint8_t *)&entry,
        offsetof(SecurityLogEntry, hmac),
        entry.hmac,
        sizeof(entry.hmac)
    );

    /* Write to flash circular buffer and enqueue for MQTT forwarding */
    flash_log_write(&entry, sizeof(entry));
    mqtt_log_queue_push(&entry, sizeof(entry));
}

What Not to Log

Log files are often less well protected than the data they describe. A log that records the plaintext password from a failed authentication attempt stores the credential in a location that may be accessible to an attacker who compromises the logging subsystem, may be transmitted to a SIEM with weaker access controls than the device itself, and may be retained for years beyond the session's lifetime. Never log the following:

Passwords and authentication secrets: Log that an authentication attempt occurred and whether it succeeded or failed. Never log the submitted password, even in hashed form for comparison.
Encryption keys and key material: A log entry that records "AES key rotation successful, new key: 4a7f3c1b..." has stored a copy of the key in plaintext in the log, defeating the key storage security entirely.
Full TLS certificates or private key material: Log certificate fingerprints and serial numbers for certificate-related events. Never log the raw certificate PEM or private key.
PII (Personally Identifiable Information): On devices that handle personal data (medical devices, smart home devices with user identity), log user identifiers rather than personal attributes. A log entry references "user_id:4821" not "user: Jane Smith, DOB: 1985-03-12".
Full sensor payloads on devices handling sensitive data: A medical device should not log raw patient measurements in a general security log. Reference the measurement event and outcome only.

On-Device Log Storage Under Constraints

Embedded devices face a fundamental tension between log completeness and storage capacity. A 2 MB dedicated log partition holding 52-byte log entries stores approximately 40,000 events. At ten security events per day under normal operation, that represents over ten years of capacity. Under active attack (thousands of failed authentication attempts per minute), the same partition fills in under 30 minutes.

Three design decisions manage this tension:

Circular Buffer with Oldest-First Overwrite

Implement log storage as a circular buffer in a dedicated flash sector. When the buffer is full, new entries overwrite the oldest ones. This guarantees recent events are always preserved, which matters most for detection (recent events) rather than historical audit (older events). The sequence number field in each entry ensures that a reader can detect wrapping and reconstruct the correct chronological order.

Flash Wear Levelling

Flash memory has a finite write cycle count, typically 10,000 to 100,000 program/erase cycles per sector. Writing a new log entry to the same address on every event would exhaust a flash sector rapidly. Use a dedicated logging library that distributes writes across multiple pages within the log sector, or use an external EEPROM with higher write endurance for the log partition. FRAM (Ferroelectric RAM) provides effectively unlimited write cycles and is available in I2C packages suitable for embedded designs where high write frequency is anticipated.

Rate Limiting and Log Compression

Under active attack, the log may receive thousands of identical events (authentication failures from the same source). Rate-limit identical event types: after ten consecutive authentication failures from the same source within 60 seconds, log a single "AUTH_BURST_FAILURE" event with the count rather than ten thousand individual entries. This preserves log space for detection and retains the forensically relevant information (that a burst occurred, from which source, over what time window) without filling the log partition with duplicate entries.

Tamper-Evident Logs with HMAC Chaining

A local log file that an attacker can modify after the fact is worse than no log: it provides a false alibi. Tamper-evident logging uses cryptographic techniques to ensure that any modification to or deletion of a log entry is detectable even without a real-time connection to a remote log server.

The two standard techniques are:

Per-entry HMAC: Each log entry includes an HMAC computed over its content using a device-unique secret key (stored in the secure element or encrypted NVS). An entry that has been modified will fail HMAC verification. This is the approach shown in the log entry structure above. It detects modification of individual entries but does not detect deletion of an entire range of entries.

Hash chaining: In addition to the per-entry HMAC, each entry includes the hash of the previous entry. This creates a chain where tampering with any entry breaks all subsequent chain verifications, and deleting entries from the middle of the log is detectable because the chain link from the entry before the deletion to the entry after it will not verify. The sequence number field reinforces this: gaps in sequence numbers indicate deleted entries even if the hash chain is not broken.

/* Hash-chained log entry verification.
   Called during log export or audit to verify the entire log chain.
   Returns the index of the first entry that fails verification,
   or LOG_VERIFY_OK if all entries pass. */

#define LOG_VERIFY_OK         UINT32_MAX
#define PREVIOUS_HASH_LEN     32

typedef struct __attribute__((packed)) {
    SecurityLogEntry entry;
    uint8_t          prev_hash[PREVIOUS_HASH_LEN];   /* SHA-256 of previous entry */
} ChainedLogEntry;

uint32_t verify_log_chain(const ChainedLogEntry *entries, uint32_t count) {
    uint8_t computed_prev_hash[PREVIOUS_HASH_LEN] = {0};  /* Genesis block: all zeros */
    uint8_t entry_hash[PREVIOUS_HASH_LEN];

    for (uint32_t i = 0; i < count; i++) {
        /* Step 1: Verify that this entry's prev_hash matches the hash of the
           previous entry (computed_prev_hash from the last iteration). */
        if (!constant_time_memcmp(entries[i].prev_hash,
                                   computed_prev_hash,
                                   PREVIOUS_HASH_LEN)) {
            return i;   /* Chain broken at entry i: deletion or modification detected */
        }

        /* Step 2: Verify the per-entry HMAC over the entry content */
        uint8_t expected_hmac[16];
        hmac_sha256_truncated(
            g_log_signing_key, LOG_SIGNING_KEY_LEN,
            (const uint8_t *)&entries[i].entry,
            offsetof(SecurityLogEntry, hmac),
            expected_hmac, sizeof(expected_hmac)
        );
        if (!constant_time_memcmp(entries[i].entry.hmac,
                                   expected_hmac, sizeof(expected_hmac))) {
            return i;   /* Entry i has been tampered with */
        }

        /* Step 3: Compute the hash of the current entry to use as prev_hash
           for the next entry verification. */
        sha256((const uint8_t *)&entries[i].entry,
               sizeof(SecurityLogEntry),
               computed_prev_hash);
    }

    return LOG_VERIFY_OK;
}

Remote Log Forwarding

On-device logs are a last resort: they exist for devices that lose network connectivity or for post-incident forensics when remote logging was not available. The primary log store for any internet-connected embedded device should be a remote SIEM or log aggregation service. Remote logs cannot be tampered with by compromising the device, are retained beyond the device's local storage capacity, and can be correlated across hundreds or thousands of devices to detect fleet-wide attack patterns.

Forward security events to the remote log server over the same authenticated TLS channel used for telemetry data. Use MQTT QoS 1 for log message delivery: this provides at-least-once delivery semantics, ensuring that events are not silently lost during brief connectivity interruptions. Buffer events locally (in the on-device circular buffer) during connectivity loss and forward the buffered events on reconnection, preserving the sequence number and timestamp from when the event occurred.

/* MQTT log forwarding with QoS 1 and local buffering on disconnect.
   Log entries are published to a device-specific secure log topic.
   The broker's topic ACL restricts each device to publishing to its
   own log topic only (as shown in the Section 6 MQTT ACL example). */

#define LOG_TOPIC_FMT    "devices/%s/security-log"
#define LOG_TOPIC_MAXLEN 64

/* Publish a log entry to the remote SIEM via MQTT.
   If not connected, the entry is buffered locally and will be
   published when connectivity is restored. */
void mqtt_log_forward(const SecurityLogEntry *entry) {
    if (entry == NULL) return;

    /* Serialise the entry to JSON for SIEM compatibility.
       In production, use a proper JSON serialisation library.
       Field names should match your SIEM's expected schema. */
    char payload[256];
    int payload_len = snprintf(payload, sizeof(payload),
        "{"
        "\"seq\":%"PRIu32","
        "\"ts\":%"PRIu32","
        "\"type\":%d,"
        "\"outcome\":%d,"
        "\"src\":\"%02x:%02x:%02x:%02x:%02x:%02x\""
        "}",
        entry->sequence,
        entry->timestamp_sec,
        (int)entry->event_type,
        (int)entry->outcome,
        entry->source_id[0], entry->source_id[1], entry->source_id[2],
        entry->source_id[3], entry->source_id[4], entry->source_id[5]
    );

    /* Check for snprintf truncation (payload_len >= sizeof(payload)) */
    if (payload_len < 0 || payload_len >= (int)sizeof(payload)) {
        /* Log a truncation event locally; do not forward the malformed payload */
        log_internal_error(ERR_LOG_PAYLOAD_TRUNCATED);
        return;
    }

    char topic[LOG_TOPIC_MAXLEN];
    snprintf(topic, sizeof(topic), LOG_TOPIC_FMT, g_device_id);

    /* QoS 1: broker acknowledges delivery; message is retained in local
       buffer until acknowledgement received */
    MQTTMessage msg = {
        .qos     = QOS1,
        .retained = 0,
        .payload  = payload,
        .payloadlen = (size_t)payload_len
    };

    int rc = MQTTPublish(&g_mqtt_client, topic, &msg);
    if (rc != SUCCESS) {
        /* Push to local buffer for retry on reconnection */
        local_log_buffer_push(entry);
    }
}

Behavioral Monitoring: Establishing Baselines

Behavioral monitoring detects threats that do not appear as a single anomalous event but as a pattern of activity that deviates from normal device operation. A single authentication failure is noise. One hundred authentication failures in ten seconds from a single IP address is an active attack. A device that normally uploads 2 KB of telemetry per minute suddenly uploading 500 KB per minute has likely been compromised and is exfiltrating data.

Effective behavioral monitoring requires a baseline: a documented characterisation of what normal device behaviour looks like. Without a baseline, you cannot define what "anomalous" means for your device. Establish the baseline during controlled testing, not after deployment when real-world variation has not been characterised.

The baseline for a typical sensor IoT device should document:

Network traffic pattern: Expected outbound destinations (broker hostname/IP, NTP server, OTA update server), expected upload volume per time period, expected connection frequency, and expected protocol and port combinations. Any connection to a destination not on this list is an anomaly.
Authentication event rate: Expected number of authentication events per day under normal operation. On a device that authenticates once at boot and then maintains a persistent session, more than two to three authentication events per day may indicate session disruption attacks or credential cycling.
Resource utilisation: Normal CPU utilisation range, normal heap usage range, normal stack high-water mark. A sustained CPU utilisation above the baseline maximum indicates either unexpected computation (a background process, a mining payload, or a busy-loop from a fault injection attack) or an overloaded device.
Reboot frequency: Normal reset rate for the device class. Watchdog resets more frequently than once per week on a stable device, or a sudden increase in reset frequency, indicates either software instability introduced by a recent update or active fault injection attempts.
Firmware update frequency: Expected update cadence for the device fleet. An update event outside the normal release cycle is suspicious, especially if the firmware version after the update does not match a known release.

Suspicious Behavioral Indicators

The following behavioral indicators, individually or in combination, indicate a device that warrants investigation:

Indicator	What It May Indicate	Initial Response
5+ authentication failures in 60 seconds from the same source	Active brute-force attack on credentials	Temporary block of source IP; alert to security team
Successful authentication from a previously unseen source IP	Credential theft and use from a different location	Alert; request re-authentication from known device
Outbound connection to an IP not in the baseline destination list	C2 (command and control) callback, data exfiltration	Block at network level; isolate device; investigate
Upload volume 10x above normal baseline	Data exfiltration of stored sensor data or credentials	Network isolation; credential rotation
Firmware version after update does not match a known release	Malicious firmware installed via compromised OTA path	Immediate network isolation; reflash from known-good image
Watchdog resets more than 3x the normal weekly rate	Active fault injection, software instability post-compromise	Log correlation; check for firmware integrity; physical inspection
Tamper switch triggered	Physical access to device enclosure	Tamper response sequence: key zeroization, lockdown, alert
Configuration change outside maintenance window	Unauthorised remote access, compromised management channel	Revert configuration; revoke active sessions; investigate

Types of Anomalies and What They Indicate

Anomalies in embedded device behaviour fall into six categories. Recognising the category helps narrow down the cause and the appropriate initial response.

Point anomalies are single events that are statistically unusual: one authentication failure is a point anomaly relative to a device that normally has zero. Low signal individually; high signal when they occur on a device that normally never generates them.

Contextual anomalies are events that would be normal in one context but are unusual in another. A firmware update is a normal event; a firmware update at 3 AM on a Saturday, when the maintenance window is Tuesday mornings, is a contextual anomaly. Access from an IP address in a country where no authorised users are located is contextually anomalous even if the authentication itself succeeds.

Collective anomalies are groups of events that individually appear normal but form a suspicious pattern collectively. Five authentication failures from five different IP addresses over five hours would each appear normal individually, but the pattern of distributed, slow authentication testing is a distributed credential guessing attack.

Temporal anomalies are events that occur at the wrong time relative to expected patterns: network activity during scheduled off-hours, authentication events after a device has been officially decommissioned, or a firmware update initiated between expected release cycle dates.

Spatial anomalies involve unexpected sources or destinations: connections from geographic regions inconsistent with the device's deployment location, or outbound connections to IP ranges associated with known malicious infrastructure.

Behavioral anomalies involve deviation from the device's characteristic interaction pattern: a temperature sensor that suddenly begins making HTTPS requests to an external API, a device that begins initiating connections to other devices on the local network when it has never done so before, or a device that begins consuming significantly more CPU cycles than the baseline for its application.

Detection Techniques: Threshold, Statistical and Signature

Three complementary detection approaches cover the space of embedded threat detection. Using all three in combination minimises both false negatives (attacks that go undetected) and false positives (legitimate activity flagged as suspicious).

Threshold-Based Detection

The simplest and most predictable approach: define an upper or lower bound for a metric, and generate an alert when the bound is crossed. Authentication failures exceeding N per minute, CPU utilisation exceeding X percent for more than Y minutes, upload volume exceeding Z bytes per hour. Threshold detection is deterministic, has zero computational overhead beyond the comparison itself, and is appropriate for bare-metal firmware where no complex analytics are possible. The limitation is that sophisticated attackers operate just below the threshold: five failed attempts per minute if the threshold is ten.

/* Threshold-based authentication failure detector.
   Counts failures per source over a sliding 60-second window.
   When threshold exceeded, locks out the source and logs the event. */

#define AUTH_FAIL_WINDOW_SEC  60
#define AUTH_FAIL_THRESHOLD   5
#define MAX_TRACKED_SOURCES   16

typedef struct {
    uint8_t  source_id[6];   /* Source MAC or IP bytes */
    uint32_t window_start;   /* Start of current counting window (Unix time) */
    uint16_t failure_count;  /* Failures in current window */
    bool     locked_out;     /* Whether this source is currently blocked */
} AuthFailTracker;

static AuthFailTracker g_auth_trackers[MAX_TRACKED_SOURCES];

/* Call this on every authentication failure.
   Returns true if the source should be blocked. */
bool auth_failure_detector(const uint8_t *source_id, uint32_t current_time) {
    AuthFailTracker *tracker = NULL;
    AuthFailTracker *oldest  = NULL;

    /* Find existing tracker for this source, or the oldest slot to reuse */
    for (int i = 0; i < MAX_TRACKED_SOURCES; i++) {
        if (memcmp(g_auth_trackers[i].source_id, source_id, 6) == 0) {
            tracker = &g_auth_trackers[i];
            break;
        }
        if (oldest == NULL ||
            g_auth_trackers[i].window_start < oldest->window_start) {
            oldest = &g_auth_trackers[i];
        }
    }

    /* Create a new entry if this source is not being tracked yet */
    if (tracker == NULL) {
        tracker = oldest;
        memcpy(tracker->source_id, source_id, 6);
        tracker->window_start  = current_time;
        tracker->failure_count = 0;
        tracker->locked_out    = false;
    }

    /* Reset the window if it has expired */
    if ((current_time - tracker->window_start) > AUTH_FAIL_WINDOW_SEC) {
        tracker->window_start  = current_time;
        tracker->failure_count = 0;
        tracker->locked_out    = false;
    }

    tracker->failure_count++;

    if (tracker->failure_count >= AUTH_FAIL_THRESHOLD) {
        if (!tracker->locked_out) {
            /* First time reaching threshold: log the burst event */
            log_security_event(SEC_EVENT_AUTH_FAILURE,
                               OUTCOME_FAILURE,
                               source_id,
                               (const uint8_t *)&tracker->failure_count,
                               sizeof(tracker->failure_count));
            tracker->locked_out = true;
        }
        return true;   /* Block this source */
    }

    return false;
}

Statistical Detection

Statistical detection compares current behaviour against a characterised baseline and alerts when the current value deviates beyond a defined number of standard deviations from the mean. On embedded Linux devices with sufficient resources, this can be implemented using exponential moving averages: lightweight enough to run continuously without significant CPU overhead, adaptive to gradual legitimate changes in device behaviour, and sensitive to sudden deviations.

Signature-Based Detection

Signature detection matches observed events against known attack patterns. A known Mirai scan pattern (rapid TCP SYN to a sequence of ports), a known format string attack payload in an incoming packet, or a known malicious firmware hash are all detectable by signature. Signatures require maintenance: the signature database must be updated when new attack patterns emerge. For constrained devices, maintain a minimal set of high-confidence signatures that cover the attack types most commonly observed against your device class.

Alert Management and False Positive Reduction

An alert system that generates constant false positives rapidly becomes ignored, which is worse than no alert system. Alert fatigue is a documented contributing factor in major security incidents: the signal was there, but it was buried in noise. Managing false positives is an ongoing operational task, not a one-time configuration activity.

Four practices that keep the false positive rate manageable:

Tune thresholds against real traffic: After initial deployment, monitor the alert rate over two to four weeks of normal operation without taking action on low-severity alerts. Use this period to characterise the normal variation in your metrics and set thresholds that sit above the 99th percentile of normal variation, not just above the mean.

Whitelist known-good activity: Maintenance windows, scheduled update deployments and known monitoring tools all generate activity that looks suspicious out of context. Add explicit whitelist entries for these activities with a time window and source restriction so the detection system knows to treat them as expected.

Correlate across multiple indicators before alerting: A single authentication failure from a new IP address is low confidence. The same IP failing authentication, then immediately attempting an HTTP probe of the management interface, and then appearing in an outbound connection log, is high confidence. Require two or more correlated indicators before generating a high-severity alert. Cloud-side SIEM platforms (Splunk, Elastic SIEM, AWS GuardDuty for IoT) automate this correlation across the fleet.

Review and update detection rules on a schedule: Detection rules that were accurate at deployment drift as device behaviour evolves. Review alert rates and false positive rates monthly and update rules accordingly. Retire rules that have generated no true positives in six months; they are adding noise without value.

The Six-Phase Incident Response Lifecycle

The incident response lifecycle for embedded device security incidents follows the same six-phase structure used in enterprise IT security, with adaptations for the specific constraints of embedded systems: limited remote access, physical deployment locations that may not be immediately reachable, and OTA update pipelines that may take days to reach all affected devices.

Phase 1: Preparation

Done before any incident occurs. Preparation includes: writing the incident response plan, defining team roles and contact lists, setting up the logging and monitoring infrastructure, preparing the tools needed for investigation (firmware analysis environment, network capture capability, a tested recovery process), and conducting tabletop exercises where the team walks through simulated incident scenarios. A team that has never practised its incident response procedures will make costly mistakes under the time pressure of a real incident.

Phase 2: Detection

Identifying that a security incident has occurred or is in progress. Detection sources for embedded devices include: SIEM alerts triggered by the anomaly detection rules, reports from customers or field technicians observing abnormal device behaviour, intelligence feeds reporting a vulnerability in a component you use, and routine security review finding unexpected firmware versions in the fleet inventory.

Phase 3: Containment

Stopping the attack from spreading or causing further damage. Containment for embedded devices takes two forms. Short-term containment is immediate action to limit damage without fully restoring normal operation: revoking the credentials associated with the compromised device, blocking the device's IP address at the network firewall, or pushing an emergency configuration change that disables the exploited feature. Long-term containment is stable isolation that can be maintained while the investigation and remediation proceed: moving the compromised device fleet to a quarantine VLAN with no access to production systems, or issuing revocation for the affected device certificates.

Phase 4: Eradication

Removing the threat from the environment. For embedded devices, eradication typically means: reflashing from a known-good firmware image (not the same version that was compromised if the compromise was through the firmware itself), rotating all credentials associated with the affected devices, and verifying that the exploit that enabled the compromise is patched in the new firmware version. If the attack was through a supply chain compromise or a compromised signing key, the scope of eradication expands to the entire production signing infrastructure.

Phase 5: Recovery

Restoring normal operations with confidence that the threat has been removed. Recovery includes: deploying the patched firmware to the affected device fleet, restoring devices from quarantine to production network access after verifying they are running the correct firmware version, monitoring the recovered devices closely for 30 to 60 days for signs of re-compromise, and updating device credentials in the cloud backend to reflect rotated keys.

Phase 6: Lessons Learned

Analysing the incident to prevent recurrence. Conducted within two weeks of incident closure while details are fresh. Produces the threat register updates, code fixes, monitoring rule improvements and process changes that close the gaps exposed by the incident. The lessons learned meeting is not a blame exercise: it is a structured process to improve the security posture of the product and the team's response capability.

Incident Response Team Roles

A clear role assignment ensures that critical tasks are not duplicated or missed during the chaos of an active incident. For small embedded development teams, some roles will be combined; the important thing is that each function has a named owner for any given incident.

Role	Responsibilities	Who Typically Fills It
Incident Commander	Coordinates the overall response, makes escalation decisions, declares incident severity, owns the timeline	Engineering lead or security lead
Technical Lead	Leads technical investigation: firmware analysis, log analysis, vulnerability identification, patch development	Senior firmware engineer or security engineer
Communications Lead	Manages notifications to customers, management, regulators and (if required) law enforcement; drafts public disclosures	Product manager or engineering manager
Documentation Lead	Maintains the incident timeline in real time; records every action taken, every finding, every decision made	Rotating assignment; any team member not on critical path
Legal and Compliance	Advises on regulatory notification obligations, handles evidence preservation for potential law enforcement referral	Legal counsel (internal or external)

Response Playbooks for Embedded Attack Scenarios

A playbook is a step-by-step procedure for responding to a specific incident type. Writing playbooks in advance means that during an active incident, the responder is executing a tested procedure rather than improvising under pressure. The four playbooks most relevant to embedded IoT devices:

Playbook 1: Compromised Device Credentials

Trigger: Authentication to cloud backend from an unexpected geographic location or source; multiple simultaneous active sessions for a device that should have only one.

PLAYBOOK: Compromised Device Credentials
Severity: High
Estimated response time: 2-4 hours

IMMEDIATE ACTIONS (within 30 minutes):
1. Revoke the device certificate in the cloud CA (marks it as invalid;
   cloud backend will refuse new connections using this certificate).
2. Block the device's current source IP at the network edge firewall.
3. Revoke all active sessions associated with the device ID in the backend.
4. Log the incident with current timestamp and indicators observed.

INVESTIGATION ACTIONS (within 4 hours):
5. Pull the device's security log from SIEM for the 48 hours preceding
   detection. Look for: first authentication from unexpected source,
   configuration changes, firmware update events.
6. Determine whether the credential was extracted from the device or
   obtained through another channel (phishing, insider threat, build
   pipeline exposure).
7. Check whether other devices in the fleet show the same anomaly
   (indicating a fleet-wide credential compromise vs. single-device).

REMEDIATION:
8. Issue a new device certificate via the provisioning infrastructure.
9. Deploy updated firmware if the credential extraction was via a
   firmware vulnerability.
10. Push new certificate to device via emergency OTA update.
11. Restore device to production network after verifying new certificate
    is active and old certificate is revoked.
12. Update threat model with the credential extraction vector identified.

Playbook 2: Malicious Firmware Installed

Trigger: Firmware version in fleet inventory does not match any known release hash; device behaviour anomalies consistent with a compromised agent (unexpected outbound connections, resource spike).

PLAYBOOK: Malicious or Unauthorised Firmware
Severity: Critical
Estimated response time: 4-8 hours

IMMEDIATE ACTIONS (within 15 minutes):
1. Immediately move affected devices to quarantine VLAN with no
   outbound internet access (limits C2 callback and data exfiltration).
2. Preserve a copy of the flash image from one affected device via
   the OTA infrastructure (for forensic analysis) before reflashing.
3. Notify incident commander; this is a Critical severity incident.

INVESTIGATION ACTIONS (within 2 hours):
4. Analyse the unknown firmware image: binwalk extraction, strings
   analysis, comparison to last known-good image using binary diff.
5. Check OTA server logs: when was the firmware image delivered? From
   which source IP was the delivery triggered?
6. Check signing key audit logs: was the production signing key used?
   If yes, assume the signing key is compromised: escalate to
   signing infrastructure incident response.
7. Determine scope: how many devices received the firmware?

REMEDIATION:
8. Build and sign a verified clean firmware image from a known-good
   source commit.
9. Deploy via emergency OTA push to all affected devices.
10. Verify post-update firmware hash on each device matches the
    expected hash for the clean image.
11. Release devices from quarantine VLAN after hash verification.
12. If signing key was compromised: rotate signing key, re-sign all
    production firmware images, re-provision root public keys to
    all fleet devices via a separately signed configuration update.

Playbook 3: Active DDoS Against Device Fleet

Trigger: Fleet-wide network saturation; devices reporting connection failures; cloud backend API rate limit exhaustion.

Playbook 4: Physical Tamper Detected

Trigger: Tamper switch or conductive mesh sensor activated; device reports SEC_EVENT_TAMPER_DETECTED to SIEM; device enters lockdown mode autonomously.

Incident Communication Plan

Who you notify during an incident, in what order and with what information, is as important as the technical response. Notification failures during incidents regularly produce secondary problems: customers discover a breach from a news article rather than from the company, regulatory authorities impose penalties for notification delays, or internal teams take conflicting actions because they were not informed.

Define notification triggers and timelines in advance:

Internal security team: Immediately on detection of any medium-severity or higher event. The security team lead is the first call, not the last.
Engineering leadership: Within 30 minutes for high-severity incidents; within 2 hours for medium-severity. Leadership needs enough information to make resource allocation decisions, not a full technical briefing.
Affected customers: As soon as the scope of impact is known. Regulatory requirements vary: GDPR (General Data Protection Regulation) requires notification to supervisory authorities within 72 hours of discovering a personal data breach; sector-specific regulations may require faster notification. Do not delay customer notification until the incident is fully resolved: notify with what is known, state what is being investigated and commit to a follow-up timeline.
Regulatory authorities: Per the applicable regulation timeline. Maintain a contact list of relevant regulatory authorities (national data protection authority, sector regulator) that can be used without searching during an incident.
Law enforcement: For incidents involving criminal activity (extortion, state-sponsored attack), consult legal counsel on whether and how to engage law enforcement. Evidence preservation procedures must be followed if law enforcement referral is a possibility.

Post-Incident Checks and Recovery Verification

Declaring an incident resolved before confirming that the threat has been completely eliminated is a common failure mode. A device that has been reflashed but whose credential remains unrevoked, or a patch that addresses the exploited vulnerability but leaves a related vulnerability open, provides false confidence while leaving the device exposed.

The recovery verification checklist:

Firmware version on all affected devices confirmed against the expected hash for the patched release.
All credentials (device certificates, API keys, WiFi passwords) associated with the affected device(s) rotated.
Revoked certificates confirmed as rejected by the cloud backend: attempt a connection with a revoked certificate and confirm refusal.
Security logs show no continuation of the attack pattern after the containment actions.
Devices removed from quarantine VLAN are monitored with enhanced alert thresholds for a minimum of 30 days.
Vulnerability that was exploited is confirmed as patched: reproduce the exploit against the patched firmware in a test environment and confirm it fails.
Security scan of the OTA infrastructure confirms no residual malicious images on the update server.
Build pipeline audit confirms no unauthorised access or modifications during the incident window.

Root Cause Analysis

Root cause analysis answers three questions for every security incident: how did it happen (the entry point and exploit chain), why was it not prevented (which controls failed or were absent), and how will it be prevented from recurring (the specific changes being made). A root cause analysis that does not produce actionable items with owners and deadlines is an exercise in documentation without improvement.

The five-whys technique is effective for embedded security incidents. Starting from the observable effect, ask "why" repeatedly until the underlying systemic cause is reached. The goal is to reach a cause that is within your control to change.

Example applied to a hardcoded credential discovered in production firmware:

Why was the device compromised? A hardcoded WiFi credential in the firmware was extracted.
Why was there a hardcoded credential in the firmware? The developer added it during development and did not remove it before the production build.
Why was it not caught before shipment? Static analysis was not configured to detect hardcoded strings in the credential format.
Why was static analysis not configured for this? There was no documented list of banned patterns for the static analysis tool.
Why was there no documented list? The security requirements did not include a requirement for the build system to detect hardcoded credentials.

The root cause is a gap in the security requirements and the static analysis configuration. The corrective action is: add a security requirement prohibiting hardcoded credentials, add a Semgrep rule that detects strings matching credential patterns in the source code, and add this check to the CI pipeline as a blocking step. This change prevents the same class of failure in all future firmware, not just the specific credential that was exposed.

Incident Response Metrics

Measuring incident response performance over time identifies where the process is working and where investment is needed. Track these six metrics for every incident and review trends quarterly:

Metric	Definition	Target Benchmark
Mean Time to Detect (MTTD)	Time from incident start to detection by the security team	Under 4 hours for high-severity; under 24 hours for medium
Mean Time to Respond (MTTR)	Time from detection to initial containment action	Under 1 hour for critical; under 4 hours for high
Mean Time to Contain	Time from detection to confirmed containment of the threat	Under 8 hours for critical; under 24 hours for high
Mean Time to Recovery	Time from detection to full restoration of normal operations	Defined per incident type in the playbook
Fleet Patch Velocity	Percentage of affected devices receiving the security patch within 72 hours of availability	Above 95% within 7 days for critical vulnerabilities
Recurrence Rate	Percentage of incident types that recur within 6 months	Zero recurrence of same root cause

Conclusion

Threat detection and incident response for embedded devices is not the same problem as IT security operations, but the underlying discipline is identical: you need visibility into what is happening on your devices, the ability to recognise when something is wrong, a practised process for responding effectively and the analytical rigour to prevent the same thing from happening again. Tamper-evident logging solves the visibility problem within the constraints of embedded storage. Behavioral baselines and the three detection approaches solve the recognition problem. The six-phase incident response lifecycle and scenario-specific playbooks solve the response problem. Root cause analysis and the measurement of detection and response metrics solve the recurrence problem. Each piece is individually achievable with the tools and techniques described here. Together they give you the confidence that when an attack occurs, and for internet-connected embedded devices it is a question of when rather than whether, you will know about it, know what to do about it, and be able to demonstrate to customers and regulators that you handled it correctly.

Upvote0PointsDownvote

0 Votes: 0 Upvotes, 0 Downvotes (0 Points)