saveContainerState() should use atomic writes to prevent state.json corruption

### Problem

`saveContainerState()` in `pkg/unikontainers/unikontainers.go` (lines 694–709) writes
`state.json` using `os.WriteFile()` directly. If the urunc process is killed or the
system crashes mid-write, `state.json` ends up partially written — and every subsequent
`urunc start`, `urunc kill`, or `urunc delete` fails with a JSON parse error.

The real pain here is what happens *after* the corruption: the VMM process (QEMU,
Firecracker, etc.) is already running, but urunc can no longer read the state it needs
to manage it. The VM becomes an orphan — still consuming resources, still holding the
TAP device and network config, but invisible to the container lifecycle. You have to
find and kill it manually.

### Root Cause

The current implementation does a direct overwrite:

```go
stateName := filepath.Join(u.BaseDir, stateFilename) return os.WriteFile(stateName, data, 0o644) 

`os.WriteFile()` truncates the file and writes in-place. Any interruption between truncate and write-completion leaves 
the file in a corrupt state causing either empty, partially written, or containing a mix of old and new data.
```

### What the fix should look like:

The codebase already has the correct pattern ,`writePidFile()` in
`pkg/unikontainers/utils.go` (lines 110–125) does exactly this:

1.Write to a temporary file in the same directory (ensures same filesystem for rename)
2.Sync / close the temp file
3.`os.Rename()` the temp file to the target path

`os.Rename()` is atomic on POSIX filesystems when source and destination are on the
same mount. Either the old state.json is there or the new one is — never a half-written
file.

The fix would be straightforward: extract a small `atomicWriteFile()` helper (or just
inline the temp+rename pattern in `saveContainerState()`), and ideally refactor
`writePidFile()` to use the same helper so both paths stay consistent.

### Impact
- Severity: High because corruption leaves running VMs unmanageable
- Likelihood: Low under normal operation, but real under OOM kills, `kill -9`, or system crashes lead exactly the scenarios where you most need state recovery to work
- Blast radius: All VMM backends (QEMU, Firecracker, Solo5, Cloud Hypervisor) are affected since they all depend on `state.json` for lifecycle management

### Reproduction scenario
- Start a unikernel container via urunc 
- Kill the urunc process (e.g., `kill -9`) during a state transition more specifically while `saveContainerState()` is writing
- Attempt any lifecycle operation (`urunc kill`, `urunc delete`)
- Observe JSON parse error; the VMM process is now orphaned

(Timing the kill precisely is tricky in practice, but the window exists on every
state write — and the consequence is severe enough to warrant the fix regardless.)

### Additional context
This was found during a code audit of the state management paths. The fact that
`writePidFile()` already implements the atomic pattern suggests this was a known
concern at some point — `saveContainerState()` just wasn't updated to match.



---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saveContainerState() should use atomic writes to prevent state.json corruption #598

Problem

Root Cause

What the fix should look like:

Impact

Reproduction scenario

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

saveContainerState() should use atomic writes to prevent state.json corruption #598

Description

Problem

Root Cause

What the fix should look like:

Impact

Reproduction scenario

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions