Deployment, Updates and Maintenance for Embedded Devices

MuhammadMuhammadEmbedded Security6 days ago11 Views

The firmware update mechanism is simultaneously the most important security capability in a deployed embedded device and one of the most dangerous attack surfaces it exposes. A robust OTA (Over-the-Air) update process is how you patch vulnerabilities discovered after deployment and keep a device fleet defensible across a five-to-fifteen-year operational lifetime. A broken or insecure update process is how an attacker delivers malicious firmware to thousands of devices at once, or how a single power failure during a write operation permanently bricks a field unit with no recovery path. This article covers the complete lifecycle of deployed embedded device security: designing a secure OTA update mechanism with dual-bank failover, choosing and hardening remote management protocols, managing security risks at every stage from manufacturing through secure disposal, implementing a patch management process that works for heterogeneous embedded fleets, and building a long-term support strategy that keeps devices defensible without requiring constant re-engineering.

OTA Update Threats

Five categories of attack specifically target the firmware update process. Understanding each one is the starting point for building a system that defeats all of them:

Malicious firmware installation: An attacker delivers a firmware image containing a backdoor, a cryptominer, a botnet agent, or destructive code. This is the primary motivation for firmware signing: the device must be able to verify that the firmware image came from the authorised manufacturer before writing it to flash.

Man-in-the-middle modification: An attacker intercepts the firmware delivery channel and replaces the legitimate firmware with a malicious image, or modifies fields in the image header (such as the version number) while in transit. This is defeated by verifying the firmware signature at the device level over the complete image, not just over a header or a hash that was checked at the server.

Rollback and downgrade attack: An attacker forces the device to install an older, signed firmware version that contains a known vulnerability that was patched in the current version. All older versions are validly signed, so the device’s signature verification accepts them. The anti-rollback counter (discussed in Section 5 of this course) defeats this by rejecting any firmware image whose minimum version field is below the counter burned into OTP memory.

Bricking through interrupted update: A power failure, hardware fault or network interruption during a flash write operation corrupts the firmware image, leaving the device unable to boot. This is not an attack in the traditional sense but it has the same effect as a successful DoS: the device is permanently non-functional. Dual-bank flash layout with automatic rollback is the standard mitigation.

Downgrade through update server compromise: An attacker compromises the OTA update server and replaces the available firmware image with an older version or a malicious one. The device’s signature verification still applies: a validly signed older image will pass signature verification but fail the anti-rollback check. A malicious image will fail signature verification at the device. The signing key, stored offline in an HSM (Hardware Security Module) and never present on the update server, is the critical separation.

The Six Requirements of a Secure OTA System

A secure OTA update system must satisfy six properties simultaneously. A system missing any one of them has a meaningful security gap:

Requirement What It Provides Mechanism
Authentication Proof that the firmware came from the authorised manufacturer ECDSA or RSA digital signature verified at the device
Integrity Confirmation that the firmware has not been modified since signing Cryptographic hash covers the entire firmware image; verified before installation
Confidentiality Protection of firmware intellectual property from extraction from the OTA channel TLS transport encryption; optionally AES-encrypted firmware image
Rollback protection Prevention of intentional or accidental downgrade to a known-vulnerable version Minimum version field in signed header; monotonic counter in OTP
Recovery mechanism Restoration to a working state if an update fails partway through Dual-bank flash with automatic rollback; failsafe bootloader with minimal recovery mode
Authorisation Control over which entities can initiate an update Update initiation requires authenticated session with update-specific permission; out-of-band commands on data channel rejected

Secure OTA Workflow: Download, Verify, Install

The secure OTA workflow has three stages that must be executed in strict sequence. No bytes are written to the application flash partition until stage two is complete and successful.

Stage 1: Download

The device connects to the update server over mutual TLS. The server presents its certificate; the device verifies it against the pinned update server public key (not just against the root CA). The server verifies the device’s client certificate. The firmware image and its accompanying manifest (containing the image hash, version number, size, minimum version for anti-rollback, and the signature) are downloaded over this authenticated, encrypted channel into a staging area in RAM or into the inactive flash bank.

Validate size before download: if the manifest states a firmware size larger than the available flash partition, abort before downloading a single byte. This prevents a memory exhaustion DoS through a crafted manifest.

Stage 2: Verify

Before writing anything to the active flash region, perform all of the following checks in order. Any single failure aborts the update entirely and leaves the current firmware in place:

  1. Verify the manifest signature against the provisioned root public key.
  2. Verify the firmware image hash against the hash in the signed manifest.
  3. Check the anti-rollback counter: reject if minimum_version in the manifest is below the current OTP counter value.
  4. Check the image size matches the manifest size field exactly.
  5. Check the magic number and header format for basic sanity.

Stage 3: Install

Write the verified image to the inactive flash bank. Use a watchdog timer during the write operation: if the write hangs, the watchdog will reset the device, which will boot from the still-intact active bank. After the write completes, verify the hash of the written flash contents against the manifest hash a second time (belt-and-suspenders: catches flash write errors that the hardware did not report). Set the boot flag to the new bank and reset. On the next boot, the bootloader verifies the new image before jumping to it. If verification passes, the boot flag is confirmed and the old bank is marked as available for the next update. If verification fails, the bootloader falls back to the old bank and logs the failure.

Dual-Bank Flash Layout for Safe Updates

Dual-bank flash is the standard architecture for safe OTA updates on resource-sufficient embedded targets. The flash is divided into two equal partitions, each large enough to hold a complete firmware image. At any time, one bank is active (currently running) and one bank is inactive (available for staging an update).

Flash Memory Layout (Dual-Bank OTA Example - STM32F4, 1MB Flash)
───────────────────────────────────────────────────────────────
Address Range    Size    Region         Description
───────────────────────────────────────────────────────────────
0x08000000       32KB    Bootloader     First-stage bootloader (write-protected)
                                        Contains root public key for signature verification
                                        Contains anti-rollback counter reference
                                        Write-protected via option bytes
───────────────────────────────────────────────────────────────
0x08008000       4KB     Metadata       Boot flags, active bank indicator, version info
                                        Not write-protected (bootloader writes here)
───────────────────────────────────────────────────────────────
0x08009000       480KB   Bank A         Firmware slot A
                                        Currently active (example state)
                                        Bootloader-verified before execution
───────────────────────────────────────────────────────────────
0x08081000       480KB   Bank B         Firmware slot B
                                        Inactive: available for OTA staging
                                        Written during update, verified before swap
───────────────────────────────────────────────────────────────
0x080F9000       28KB    NVS/Config     Device configuration (encrypted)
                                        Preserved across firmware updates
───────────────────────────────────────────────────────────────
/* Bootloader bank selection logic (simplified).
   Runs immediately after hardware initialisation, before any application code.
   Both Bank A and Bank B are verified; the most recent valid bank is booted.
   If the newly written bank fails verification, the bootloader falls back
   to the previously running bank automatically. */

#include "flash_driver.h"
#include "crypto_verify.h"

typedef enum {
    BANK_A = 0,
    BANK_B = 1
} FlashBank;

typedef struct {
    FlashBank active_bank;       /* Bank currently designated as active    */
    FlashBank pending_bank;      /* Bank written but not yet confirmed      */
    bool      pending_valid;     /* True if pending bank passed verification */
    uint32_t  boot_attempt_count;/* How many times we have tried this bank  */
    uint32_t  confirmed;         /* 0xDEADBEEF when active bank is confirmed */
} BootMetadata;

#define MAX_BOOT_ATTEMPTS   3
#define CONFIRMED_MAGIC     0xDEADBEEFu
#define METADATA_ADDR       0x08008000u

static BootMetadata *meta = (BootMetadata *)METADATA_ADDR;

void bootloader_select_and_boot(void) {
    FlashBank boot_bank = meta->active_bank;

    /* If there is a pending bank awaiting trial, try it first */
    if (meta->pending_valid) {
        if (meta->boot_attempt_count < MAX_BOOT_ATTEMPTS) {
            boot_bank = meta->pending_bank;
            flash_write_word(&meta->boot_attempt_count,
                             meta->boot_attempt_count + 1);
        } else {
            /* Pending bank has failed to boot MAX_BOOT_ATTEMPTS times.
               Mark it as invalid and revert to the known-good active bank. */
            flash_write_word((uint32_t *)&meta->pending_valid, 0);
            log_boot_event(BOOT_EVENT_ROLLBACK, boot_bank);
            boot_bank = meta->active_bank;
        }
    }

    /* Verify the selected bank's signature before jumping to it */
    VerifyResult result = verify_firmware_bank(boot_bank);
    if (result != VERIFY_OK) {
        /* Verification failed: try the other bank */
        boot_bank = (boot_bank == BANK_A) ? BANK_B : BANK_A;
        result = verify_firmware_bank(boot_bank);
        if (result != VERIFY_OK) {
            /* Both banks failed: enter recovery mode */
            enter_recovery_mode();
            /* Does not return */
        }
    }

    log_boot_event(BOOT_EVENT_SUCCESS, boot_bank);
    jump_to_firmware(get_bank_start_address(boot_bank));
    /* Does not return */
}

/* Called by the application after it has run successfully for the
   confirmation period (e.g., 60 seconds of normal operation).
   Confirms the pending bank as the new active bank and clears the
   pending flag. Without this call, the next reset will retry the
   pending bank up to MAX_BOOT_ATTEMPTS times then revert. */
void confirm_firmware_update(void) {
    BootMetadata new_meta = *meta;
    new_meta.active_bank       = meta->pending_bank;
    new_meta.pending_valid     = false;
    new_meta.boot_attempt_count = 0;
    new_meta.confirmed          = CONFIRMED_MAGIC;
    flash_write_block(METADATA_ADDR, &new_meta, sizeof(BootMetadata));
    log_boot_event(BOOT_EVENT_CONFIRMED, new_meta.active_bank);
}

The confirmation step is critical and commonly omitted. Without it, a firmware update that causes a boot loop (the new firmware crashes on startup) is self-healing: after MAX_BOOT_ATTEMPTS failures, the bootloader automatically reverts to the previous firmware. The application must actively confirm that it has booted successfully and is functioning correctly before the new firmware is adopted as permanent.

OTA Implementation on ESP32

ESP-IDF provides a mature OTA subsystem that implements dual-partition OTA with automatic rollback. The partition table defines two OTA application partitions (ota_0 and ota_1) and an OTA data partition that tracks which slot is active:

/* ESP32 OTA update handler using esp_https_ota.
   Connects to the update server over HTTPS with mutual TLS,
   downloads the signed firmware image into the inactive OTA partition,
   verifies the image (ESP-IDF verifies the secure boot signature automatically
   when secure boot is enabled), and schedules the device for restart.
   The application calls esp_ota_mark_app_valid_cancel_rollback() after
   confirming normal operation post-restart. */

#include "esp_https_ota.h"
#include "esp_ota_ops.h"
#include "esp_log.h"

static const char *TAG = "OTA";

/* Root CA certificate for the OTA update server.
   Embedded in the firmware binary; verified before any data is downloaded. */
extern const uint8_t ota_server_root_ca_pem_start[] asm("_binary_ota_server_ca_pem_start");
extern const uint8_t ota_server_root_ca_pem_end[]   asm("_binary_ota_server_ca_pem_end");

typedef enum {
    OTA_RESULT_OK,
    OTA_RESULT_SAME_VERSION,
    OTA_RESULT_VERIFY_FAILED,
    OTA_RESULT_DOWNLOAD_FAILED,
    OTA_RESULT_ROLLBACK_BLOCKED
} OtaResult;

OtaResult perform_ota_update(const char *update_url) {
    ESP_LOGI(TAG, "Starting OTA update from: %s", update_url);

    /* Configure HTTPS connection with server certificate verification */
    esp_http_client_config_t http_config = {
        .url                   = update_url,
        .cert_pem              = (const char *)ota_server_root_ca_pem_start,
        .timeout_ms            = 10000,
        .keep_alive_enable     = true,
    };

    esp_https_ota_config_t ota_config = {
        .http_config = &http_config,
    };

    /* Check current firmware version to prevent unnecessary updates */
    const esp_app_desc_t *current_desc = esp_app_get_description();
    ESP_LOGI(TAG, "Current firmware version: %s", current_desc->version);

    /* Perform OTA download and write to inactive partition.
       esp_https_ota() handles:
       - HTTPS download with server certificate verification
       - Writing to the inactive OTA partition
       - Firmware image validation (magic bytes, image header)
       - Secure boot signature verification (when secure boot is enabled) */
    esp_err_t ret = esp_https_ota(&ota_config);

    if (ret == ESP_ERR_OTA_VERSION_MORE_RECENT) {
        ESP_LOGW(TAG, "OTA rejected: server firmware older than current");
        return OTA_RESULT_ROLLBACK_BLOCKED;
    }

    if (ret != ESP_OK) {
        ESP_LOGE(TAG, "OTA download/verify failed: %s", esp_err_to_name(ret));
        log_security_event(SEC_EVENT_FW_UPDATE_REJECTED, OUTCOME_FAILURE,
                           NULL, NULL, 0);
        return OTA_RESULT_DOWNLOAD_FAILED;
    }

    log_security_event(SEC_EVENT_FW_UPDATE_ATTEMPT, OUTCOME_SUCCESS,
                       NULL, NULL, 0);

    ESP_LOGI(TAG, "OTA image verified. Restarting to apply update.");
    esp_restart();

    return OTA_RESULT_OK;   /* Not reached; kept for compiler satisfaction */
}

/* Call this from the application main loop after the device has been
   running normally for the confirmation period (60-300 seconds is typical).
   If not called before a reset occurs, ESP-IDF will automatically roll back
   to the previous firmware on the next boot. */
void confirm_successful_boot(void) {
    esp_err_t ret = esp_ota_mark_app_valid_cancel_rollback();
    if (ret == ESP_OK) {
        ESP_LOGI(TAG, "Firmware update confirmed. Rollback cancelled.");
        log_security_event(SEC_EVENT_FW_UPDATE_SUCCESS, OUTCOME_SUCCESS,
                           NULL, NULL, 0);
    } else {
        ESP_LOGE(TAG, "Failed to confirm update: %s", esp_err_to_name(ret));
    }
}

OTA Implementation on STM32 with SBSFU

STM32 OTA is most commonly implemented using the SBSFU (Secure Boot and Secure Firmware Update) reference package from ST, which provides a complete dual-image OTA system with ECDSA signature verification, anti-rollback through header version fields, and a failsafe first-stage bootloader. For bare-metal custom implementations, the dual-bank logic shown earlier in this article applies directly.

For a minimal custom OTA implementation on STM32F4 without SBSFU, the key flash operations are:

/* Minimal OTA write-to-inactive-bank implementation for STM32F4.
   Called after full image verification passes (signature, hash, anti-rollback).
   Erases the inactive bank sectors then writes the new image page by page.
   A watchdog must be kicked during the write to prevent timeout on large images. */

#include "stm32f4xx_hal.h"

#define BANK_B_START_ADDR    0x08081000UL
#define BANK_B_SECTOR_START  6           /* First sector of Bank B on STM32F4 */
#define BANK_B_SECTOR_COUNT  6           /* Number of sectors in Bank B        */
#define FLASH_PAGE_SIZE      256         /* Bytes per flash page               */

typedef enum {
    FLASH_WRITE_OK,
    FLASH_WRITE_ERR_ERASE,
    FLASH_WRITE_ERR_WRITE,
    FLASH_WRITE_ERR_VERIFY
} FlashWriteResult;

FlashWriteResult write_firmware_to_bank_b(const uint8_t *image, uint32_t image_len) {
    FLASH_EraseInitTypeDef erase_init;
    uint32_t sector_error;
    HAL_StatusTypeDef status;

    HAL_FLASH_Unlock();

    /* Step 1: Erase all sectors in Bank B */
    erase_init.TypeErase    = FLASH_TYPEERASE_SECTORS;
    erase_init.Sector       = BANK_B_SECTOR_START;
    erase_init.NbSectors    = BANK_B_SECTOR_COUNT;
    erase_init.VoltageRange = FLASH_VOLTAGE_RANGE_3;

    status = HAL_FLASHEx_Erase(&erase_init, §or_error);
    if (status != HAL_OK) {
        HAL_FLASH_Lock();
        return FLASH_WRITE_ERR_ERASE;
    }

    /* Step 2: Write the new image word-by-word */
    uint32_t dest_addr = BANK_B_START_ADDR;
    const uint32_t *src_word = (const uint32_t *)image;
    uint32_t words_to_write  = (image_len + 3) / 4;

    for (uint32_t i = 0; i < words_to_write; i++) {
        status = HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD,
                                   dest_addr,
                                   src_word[i]);
        if (status != HAL_OK) {
            HAL_FLASH_Lock();
            return FLASH_WRITE_ERR_WRITE;
        }
        dest_addr += 4;

        /* Kick the watchdog periodically to prevent reset during long write */
        if ((i % 64) == 0) {
            watchdog_kick();
        }
    }

    HAL_FLASH_Lock();

    /* Step 3: Verify written flash matches the source image */
    if (memcmp((void *)BANK_B_START_ADDR, image, image_len) != 0) {
        return FLASH_WRITE_ERR_VERIFY;
    }

    return FLASH_WRITE_OK;
}

Delta Updates for Bandwidth-Constrained Devices

A full firmware image for a medium-complexity embedded device is typically 100 KB to 1 MB. For devices on constrained cellular (NB-IoT, LTE-M) connections billed by the megabyte, or devices on low-power LPWAN (Low-Power Wide-Area Network) links with severe bandwidth limits, transmitting a full image for every patch is prohibitively expensive. Delta updates send only the binary difference between the current firmware version and the new one, reducing payload size by 80 to 95% for typical security patch releases that change only a few functions.

The two commonly used delta update implementations in embedded firmware are:

  • bsdiff/bspatch: General binary diff/patch algorithm. Produces small diffs but requires enough RAM to hold the decompression working set (typically 1.5x the output image size). Suitable for embedded Linux targets with tens of megabytes of RAM.
  • Janpatch: A stream-based binary patch library designed for microcontrollers with as little as 4 KB of RAM. It reads the old image from flash, the patch from the incoming stream, and writes the new image to the inactive bank without requiring the full new image in RAM simultaneously.

Delta updates must include the same signature, hash and anti-rollback protections as full image updates. The delta patch itself is signed; the device applies the patch to produce the new image and then verifies the resulting image's hash against the manifest before marking the new bank as pending.

Remote Device Management

Remote management enables operators to monitor device health, push configuration changes, collect logs, perform remote diagnostics and initiate firmware updates across an entire fleet without physical access. It is also a significant attack surface: a remote management channel that can update firmware and change configuration is exactly what an attacker wants to compromise.

The security requirements for any remote management system:

  • Mutual authentication: The device verifies the management server's identity (certificate pinning) and the management server verifies the device's identity (client certificate or token). Neither party accepts management commands from an unauthenticated counterpart.
  • Encrypted transport: All management traffic over TLS 1.2 or 1.3. No management channels on plaintext protocols.
  • Authorised commands only: The device maintains a whitelist of operations the management channel is permitted to perform. A firmware update command must come through a dedicated, separately authenticated update channel, not through the same MQTT topic as telemetry configuration changes.
  • Command authentication: Individual commands are signed or carried on a session that required mutual authentication. Replay attacks on management commands are prevented by nonce or sequence number in each command.
  • Comprehensive audit log: Every management action is logged with timestamp, initiating server identity, command type and outcome. The log is forwarded to the SIEM and is tamper-evident.

LwM2M: The Standard for Embedded Device Management

LwM2M (Lightweight Machine-to-Machine) is the OMA (Open Mobile Alliance) standard protocol for embedded device management. It runs over CoAP (Constrained Application Protocol) with DTLS, making it suitable for devices that cannot support the overhead of MQTT over TLS. LwM2M defines a standardised object model for common device management operations including firmware update (Object 5), device information (Object 3), connectivity monitoring (Object 4) and security credentials (Object 0).

LwM2M's firmware update object implements the full secure OTA workflow as a standardised protocol exchange:

LwM2M Firmware Update Object (ID: 5) — Resource Summary
─────────────────────────────────────────────────────────────
Resource ID  Name              Access   Description
─────────────────────────────────────────────────────────────
/5/0/0       Package           W        Firmware binary (push delivery)
/5/0/1       Package URI       W        URL for pull delivery (device downloads)
/5/0/2       Update            E        Execute to trigger installation
/5/0/3       State             R        0=Idle, 1=Downloading, 2=Downloaded, 3=Updating
/5/0/5       Update Result     R        0=Init, 1=Success, 2=Insufficient flash,
                                        3=Out of RAM, 4=Connection lost,
                                        5=Integrity check failure, 6=Unsupported pkg,
                                        7=Invalid URI, 8=Update failed, 9=Unsupported protocol
/5/0/6       PkgName           R        Firmware package name
/5/0/7       PkgVersion        R        Firmware package version
/5/0/9       FirmwareUpdateDel W        Delivery method: pull (0), push (1), both (2)
─────────────────────────────────────────────────────────────

Update Result code 5 (Integrity check failure) is what the device
returns when the firmware signature verification fails. The server
receives this result and can alert the operations team to a potential
supply chain or delivery channel attack.

Popular open-source LwM2M client implementations for embedded targets include Eclipse Wakaama (C, suitable for bare-metal with CoAP stack), Anjay (C, commercial with community edition, well-maintained) and the Zephyr RTOS built-in LwM2M client.

Secure Management API Design

For devices managed through a custom REST API (common for embedded Linux gateways), apply these design principles to prevent the management API from becoming the easiest path into the device fleet:

# Example: Secure management API endpoint with authentication,
# rate limiting, input validation and audit logging.
# Using Python/FastAPI for illustration; the principles apply to any backend.

from fastapi import FastAPI, Depends, HTTPException, Request
from fastapi.security import HTTPBearer
import time
import hashlib

app = FastAPI()
security = HTTPBearer()

# Rate limiter: track requests per client per time window
rate_limit_store: dict = {}

def check_rate_limit(client_id: str, max_per_minute: int = 10) -> bool:
    """Return False if client has exceeded the rate limit."""
    now = time.time()
    window_start = now - 60
    
    if client_id not in rate_limit_store:
        rate_limit_store[client_id] = []
    
    # Clean up old entries outside the window
    rate_limit_store[client_id] = [
        t for t in rate_limit_store[client_id] if t > window_start
    ]
    
    if len(rate_limit_store[client_id]) >= max_per_minute:
        return False
    
    rate_limit_store[client_id].append(now)
    return True

def verify_device_token(token: str) -> str:
    """Validate the device JWT and return the device ID, or raise 401."""
    # In production: verify JWT signature, expiry and scope claims
    # using a library like python-jose with your signing key.
    # This stub returns the device ID from a valid token.
    try:
        payload = verify_jwt(token)   # Raises on invalid/expired token
        return payload["device_id"]
    except Exception:
        raise HTTPException(status_code=401, detail="Invalid or expired token")

@app.post("/api/v1/devices/{device_id}/config")
async def update_device_config(
    device_id: str,
    request: Request,
    config: DeviceConfigUpdate,
    credentials = Depends(security)
):
    """Apply a configuration update to a specific device."""
    
    # 1. Authenticate: verify the caller's token
    caller_device_id = verify_device_token(credentials.credentials)
    
    # 2. Authorise: the caller can only update their own device's config,
    #    not another device's (unless they have an admin-scoped token)
    if caller_device_id != device_id and not has_admin_scope(credentials.credentials):
        audit_log(caller_device_id, "CONFIG_UPDATE_DENIED",
                  target=device_id, reason="insufficient_scope")
        raise HTTPException(status_code=403, detail="Forbidden")
    
    # 3. Rate limit: prevent API abuse from a compromised device credential
    if not check_rate_limit(caller_device_id, max_per_minute=5):
        audit_log(caller_device_id, "RATE_LIMIT_EXCEEDED", target=device_id)
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    
    # 4. Input validation: validate config fields before applying
    if not validate_device_config(config):
        audit_log(caller_device_id, "CONFIG_VALIDATION_FAILED", target=device_id)
        raise HTTPException(status_code=422, detail="Invalid configuration")
    
    # 5. Apply the configuration change
    result = apply_config_to_device(device_id, config)
    
    # 6. Audit log: record the successful action with full context
    audit_log(
        actor=caller_device_id,
        action="CONFIG_UPDATED",
        target=device_id,
        details={"fields_changed": list(config.dict().keys())},
        source_ip=request.client.host
    )
    
    return {"status": "applied", "device_id": device_id}

Security Risks Across the Device Lifecycle

Security risks are not uniform across a device's lifetime. Each lifecycle phase introduces distinct threats that require phase-specific controls. Designing security only for the operational phase and ignoring manufacturing, deployment and end-of-life is a common gap that leads to vulnerabilities in the boundary stages.

Manufacturing Phase

The factory floor is one of the highest-risk environments for embedded device security. Risks include: supply chain attacks (counterfeit or backdoored components introduced by a malicious supplier), unauthorised device cloning (a factory worker copies the provisioning fixture output to create unauthorised devices that pass authentication checks), provisioning errors (wrong credentials loaded to the wrong device batch, or debug firmware shipped instead of production firmware) and insider threats (modified firmware or credentials inserted by factory staff with access to the provisioning station).

Controls: vendor qualification process for all component suppliers, hardware component authentication using secure elements with manufacturer-signed certificates, provisioning station access control with individual operator authentication and audit logging, automated post-provisioning verification tests that confirm the correct firmware version, credentials and security configuration, and a batch traceability system that links each device serial number to its provisioning session record.

Deployment Phase

Risks during initial installation at the deployment site: insecure default configuration left in place because the installer did not follow the hardening guide, physical tampering during installation at a site with poor physical access control, insecure network placement (device connected to the wrong VLAN or directly to the internet without the intended firewall), and documentation leaks (installation guides containing device credentials emailed to installers or left in unprotected shared drives).

Controls: automated configuration verification as part of the installation test (the device reports its security posture and the installation system rejects it if any hardening step was missed), site access control requirements in the installation specification, network connectivity verification against the intended network architecture as part of commissioning, and credential delivery through a secure channel rather than in plaintext documentation.

Operational Phase

The longest phase and the one with the most exposure to network attacks. Key risks: unpatched vulnerabilities accumulate as new CVEs are discovered in used components, configuration drift as successive maintenance changes move settings away from the hardened baseline, credential compromise through social engineering, insider threat or credential reuse, and physical access by field technicians who may not follow security procedures.

Controls: the patch management and fleet monitoring processes covered in this article, configuration drift detection by comparing periodic attestation reports against the baseline, regular credential rotation, and field service procedures that require operator authentication and log all physical access events.

End-of-Life Phase

A device that has passed its end-of-life date but remains operational represents an accumulating security liability. Its firmware will no longer receive patches; known vulnerabilities will remain unaddressed; its cryptographic credentials may have expired. If it remains on the network, it provides an attacker with a persistent, unpatched foothold.

Secure Device Disposal

A device removed from service still contains credentials, encryption keys, customer data and potentially proprietary firmware. If disposed of without proper data destruction, it becomes an intelligence resource for an attacker who acquires it through secondary markets, recycling facilities or physical dumpster diving.

/* Secure device wipe procedure.
   Called before decommissioning or returning a device for repair.
   Overwrites all sensitive data and revokes cloud credentials. */

#include "nvs_flash.h"         /* ESP-IDF NVS API */
#include "esp_efuse.h"
#include "secure_storage.h"

typedef enum {
    WIPE_OK,
    WIPE_ERR_NVS,
    WIPE_ERR_KEY_DESTROY,
    WIPE_ERR_CLOUD_REVOKE
} WipeResult;

WipeResult secure_device_wipe(void) {
    WipeResult result = WIPE_OK;

    /* Step 1: Zeroize all key material in secure element.
       The ATECC608B has a secure key slot lock mechanism.
       Writing zeros to a key slot overwrites the stored key.
       Keys in OTP slots cannot be erased: they must be considered
       permanently invalidated at the cloud backend instead. */
    if (secure_element_zeroize_all_slots() != SE_OK) {
        log_security_event(SEC_EVENT_WIPE_PARTIAL, OUTCOME_ERROR, NULL, NULL, 0);
        result = WIPE_ERR_KEY_DESTROY;
    }

    /* Step 2: Erase the encrypted NVS partition.
       nvs_flash_erase() erases the entire NVS partition including
       all stored credentials, configuration and key material.
       On ESP32 with flash encryption, the erased data was already
       ciphertext; this removes the ciphertext itself. */
    esp_err_t err = nvs_flash_erase();
    if (err != ESP_OK) {
        result = WIPE_ERR_NVS;
    }

    /* Step 3: Revoke the device certificate at the cloud CA.
       This marks the device's X.509 certificate as revoked in the CRL
       (Certificate Revocation List) and OCSP responder, preventing
       the certificate from being used for authentication even if
       the private key was somehow extracted before the wipe. */
    CloudApiResult revoke_result = cloud_api_revoke_device_certificate(
        g_device_id, REVOKE_REASON_DECOMMISSIONED
    );
    if (revoke_result != CLOUD_OK) {
        result = WIPE_ERR_CLOUD_REVOKE;
    }

    /* Step 4: Remove the device from the cloud backend device registry.
       The device will no longer be accepted even with a valid certificate
       because its registration record no longer exists. */
    cloud_api_deregister_device(g_device_id);

    /* Step 5: Log the wipe event with disposition record */
    log_security_event(
        SEC_EVENT_DEVICE_WIPED,
        (result == WIPE_OK) ? OUTCOME_SUCCESS : OUTCOME_ERROR,
        NULL, NULL, 0
    );

    return result;
}

For devices handling particularly sensitive data (medical, financial, industrial control), software wipe is not sufficient for the highest-security disposal requirement. The flash chip should be physically destroyed: degaussing does not work on flash memory, but shredding, incineration or drilling through the chip package are effective. Document the physical destruction with photographic evidence and a chain of custody record.

Patch Planning and Vulnerability Management

Embedded patch management is harder than desktop patch management for five structural reasons: device diversity (a fleet may run dozens of hardware variants, each requiring a separately tested firmware build), update logistics (devices may be physically inaccessible, on constrained networks or in environments where downtime is not acceptable), testing burden (compatibility must be verified across all hardware variants before release), OTA infrastructure dependency (a fleet that cannot receive OTA updates requires physical technician visits), and long field lives (a device shipped in 2018 may still be in service in 2028, requiring patches for a component whose vendor has long since dropped support).

A practical patch management programme for an embedded fleet:

Vulnerability Monitoring

Subscribe to the CVE feeds for every third-party component in your firmware: the RTOS, the TLS library, the MQTT client, the JSON parser, the HTTP library and every other open source dependency. The NVD (National Vulnerability Database) provides CVSS (Common Vulnerability Scoring System) scores that give a first-pass severity assessment. Also subscribe to the security advisories published by your silicon vendor (STMicroelectronics, Espressif, NXP, Nordic Semiconductor all publish security bulletins) because vulnerabilities in the SoC hardware or vendor SDK may not appear in the standard CVE feeds.

Severity Classification and Response Times

Severity CVSS Score Definition Target Patch Time
Critical 9.0–10.0 Actively exploited; remote code execution; no authentication required Emergency patch within 72 hours of confirmed exploitability
High 7.0–8.9 Significant exposure; unauthenticated remote access; data exfiltration Patch within 14 days
Medium 4.0–6.9 Meaningful risk; requires authentication or special conditions Patch within 60 days or next scheduled release
Low 0.1–3.9 Limited exposure; requires physical access or rare conditions Patch in next scheduled release; document if deferred

The SBOM

An SBOM (Software Bill of Materials) is a machine-readable inventory of every software component in your firmware, including version numbers and source identifiers. With an SBOM, you can answer the question "does our firmware contain the component affected by CVE-XXXX-YYYY?" in seconds rather than hours. The EU Cyber Resilience Act and FDA 2023 cybersecurity guidance for medical devices both require manufacturers to maintain an SBOM and use it for ongoing vulnerability monitoring. Generate the SBOM at build time using tools like Syft, Trivy or the CycloneDX toolchain, and store it alongside the release artefacts. Update it with every firmware release.

Staged Patch Rollout

Deploying a patch to 100,000 devices simultaneously is the fastest way to brick your entire fleet if the patch has an undetected compatibility issue with a hardware variant or a field configuration combination that was not covered in lab testing. Staged rollout distributes the risk across time by deploying to progressively larger groups:

Staged Rollout Plan (Example: Critical Security Patch)

Phase 0: Internal validation (Day 1-2)
  Target:  Lab test devices + internal development devices (~50 units)
  Purpose: Confirm basic functionality, no obvious boot failures
  Go/No-Go: Zero boot failures; all security tests pass

Phase 1: Canary deployment (Day 3-4)
  Target:  1% of fleet, selected for hardware variant diversity
  Purpose: Detect hardware-specific issues in real-world conditions
  Monitor: Watchdog reset rate, authentication success rate, MQTT uptime
  Go/No-Go: <0.1% increase in watchdog resets; no authentication failures

Phase 2: Early adopter (Day 5-7)
  Target:  10% of fleet (non-critical devices; devices with good connectivity)
  Purpose: Surface configuration-specific issues at meaningful scale
  Monitor: Full telemetry + security log metrics vs. pre-patch baseline
  Go/No-Go: All metrics within 5% of pre-patch baseline

Phase 3: General rollout (Day 8-14)
  Target:  Remaining 89% of fleet in batches
  Monitor: Automated fleet dashboard; any single-metric deviation > 10%
           pauses the rollout automatically
  Go/No-Go: Continuous; automated rollout pause on anomaly

Emergency rollback trigger:
  If at any phase: watchdog reset rate increases >5%, authentication
  failure rate increases >2%, or more than 0.1% of devices fail to
  report telemetry after the update — pause rollout, investigate,
  and if confirmed as patch-caused, initiate rollback to previous version.

Patch Management Policy

A patch management policy documents the organisation's commitments and procedures for keeping devices patched. It is the document regulators and enterprise customers will ask to review. It should cover:

  • Patch schedule: Regular release cadence (e.g., monthly security patches, quarterly feature releases) plus the emergency patch process for critical vulnerabilities.
  • Testing requirements: Minimum test coverage required before a patch can be approved for deployment (functional regression test, security control regression test, hardware variant matrix).
  • Approval process: Who must sign off before a patch is deployed to production (technical lead, security lead, QA lead; separate approvals for emergency vs. scheduled patches).
  • Rollback plan: Defined trigger conditions for rollback and the process for executing it across the fleet.
  • Communication plan: How and when customers are notified of security patches and the risk of not applying them.
  • Tracking and reporting: How patch compliance is measured (percentage of fleet on latest version, percentage within policy SLA) and how often it is reported to management.

Long-Term Support Strategy

An embedded device shipped with a five-to-fifteen-year expected service life commits the manufacturer to providing security updates across that entire period. Without a planned long-term support strategy, devices that were secure at shipment become progressively more vulnerable as new CVEs accumulate against their firmware components and no patches are provided.

The Three-Phase Support Lifecycle

Active support: The device receives both feature updates and security patches on the regular release cadence. New hardware variants may be added. Full engineering team engagement.

Maintenance mode: The device receives security patches only, no new features. The release cadence may slow to quarterly or semi-annual. A smaller maintenance team handles patch backporting. This phase typically covers years three through seven for a product with a ten-year support commitment.

End of life: No further updates of any kind. The manufacturer publishes the EOL (End of Life) date at least twelve months in advance, provides a final security patch release covering all known vulnerabilities, and documents the migration path to a supported successor product. After EOL, the device should be retired or segregated from critical networks.

Architectural Choices That Enable Long-Term Supportability

Decisions made at product design time determine how difficult it will be to maintain security across a long product lifetime:

  • Modular firmware architecture: Separating the BSP (Board Support Package), RTOS, networking stack and application logic into independently updatable layers makes it possible to patch the TLS library without re-testing the entire application, and to upgrade the RTOS without touching the application code.
  • Dependency version pinning with annual review: Record the exact version of every third-party component used in each release. Schedule an annual review to update components to the latest stable release, incorporating security fixes while the component is still actively maintained.
  • Delta update capability: Reduces the bandwidth and time cost of frequent security patches, making it economically viable to patch more frequently on constrained connectivity devices.
  • Documented architecture: A product whose internal architecture is fully documented can be maintained by engineers who were not part of the original development team. Products whose architecture exists only in the original developers' memories become unmaintainable when those developers leave.

Managing Legacy Devices That Cannot Be Updated

Despite best efforts, some deployed devices will reach a state where they cannot be updated: the OTA infrastructure has changed in a way incompatible with the device's old OTA client, a hardware variant has a bug that prevents flash updates, or the device is too resource-constrained to run the new firmware. These devices cannot simply be abandoned without managing the security risk they represent.

The four approaches for managing legacy devices that cannot be patched:

Network isolation: Move unpatched devices to a dedicated VLAN with firewall rules that allow only the specific traffic required for their function, and block all other communication. An unpatched temperature sensor that can only reach the telemetry ingestion endpoint over port 8883 on the operations VLAN is significantly less dangerous than the same device on an unrestricted network.

Compensating controls: Add external security controls that mitigate the unpatched vulnerabilities without requiring a firmware change. A reverse proxy in front of the device's management interface that requires authentication before forwarding requests compensates for a weak authentication vulnerability in the device's own management service. A WAF (Web Application Firewall) rule that blocks the specific exploit pattern for a known CVE compensates for an unpatched vulnerability in the device's HTTP stack.

Increased monitoring: Unpatched devices that remain operational should be monitored with lower alert thresholds than patched devices. An anomaly that might be acceptable noise from a current-firmware device is a higher-confidence indicator of compromise from a device known to have unpatched vulnerabilities.

Retirement planning: Track the number of unpatched devices by vulnerability exposure and age, and use this as input to a hardware refresh planning process. The security risk from a large population of unpatched devices is a legitimate business justification for an accelerated hardware refresh cycle.

Conclusion

Deployment, updates and maintenance are where the security investment of development either pays off or falls apart. A device with excellent firmware security that cannot be patched when a critical vulnerability is discovered will be exploited. A device with a working OTA mechanism that has no anti-rollback protection can have its security stripped by an attacker forcing a downgrade. A device that is wiped and disposed of without revoking its cloud credentials provides the attacker who acquires it with valid authentication material. Each element of this lifecycle, the OTA workflow with dual-bank failover, the remote management protocol with command authorisation, the phase-specific controls across manufacturing through disposal, the staged patch rollout with automated halt criteria, the SBOM-driven vulnerability monitoring, and the long-term support commitment with maintenance mode and EOL planning, is a specific answer to a specific way that a shipped device can be compromised in the field. Getting all of them right is what allows you to ship a device with confidence that you can keep it secure for its full operational lifetime.

Leave a reply

Loading Next Post...
Follow
Search Trending
Popular Now
Loading

Signing-in 3 seconds...

Signing-up 3 seconds...