Skip to content

fix: serialize shim shutdown and quiesce guest before poweroff#32

Merged
aledbf merged 14 commits into
mainfrom
aledbf/fix
Jun 15, 2026
Merged

fix: serialize shim shutdown and quiesce guest before poweroff#32
aledbf merged 14 commits into
mainfrom
aledbf/fix

Conversation

@aledbf

@aledbf aledbf commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Serialize shutdown flow: introduce beginRPC / shutdownOnce to prevent new RPCs during shutdown and ensure in-flight calls drain before exit
  • Quiesce guest filesystem: add PrepareShutdown RPC so the host can flush the guest writable layer before VM poweroff, ensuring snapshot/commit consistency
  • Bump toolchain and deps: Go 1.26.2, QEMU 10.2.2, containerd v2.2.2, gRPC v1.78.0, OpenTelemetry v1.38.0, EROFS snapshotter v20260419.01
  • Remove stale api/next.pb.txt descriptor snapshot

Test plan

  • Unit tests added for PrepareShutdown guest service
  • Unit tests added for beginRPC / shutdown serialization in task service
  • task lint passes
  • go test -race ./... passes
  • Integration tests with KVM verify clean shutdown under concurrent RPCs

🤖 Generated with Claude Code

aledbf and others added 11 commits April 9, 2026 08:39
Replace the checksum-flaky golangci-lint install.sh download (which
failed CI with a SHA-256 mismatch on the cached tarball) with a pinned
`go install` that routes through the Go module proxy + GOSUMDB.

Also bump all actions to their latest major versions: checkout v6,
setup-go v6, setup-task v2, setup-buildx-action v4, login-action v4,
action-gh-release v3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CI bump to golangci-lint v2.12.2 surfaced new findings. Address them:

- .golangci.yml: rename deprecated gomodguard -> gomodguard_v2; exclude
  gosec G703 (path traversal taint, same intent as the existing G304
  exclusion) and G602 in tests; configure goconst with ignore-tests and
  min-occurrences: 4 to avoid churn on generic short literals.
- Remove dead code: service.getContainerID (unused).
- Extract constants for repeated high-value literals: QMP/log field keys
  (cpu_id, slot_id, data, container, exec, container_id, resource,
  required), mount types/options (tmpfs, nosuid, noexec, nodev, cgroup2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
newTestService() left service.vmLifecycle nil, but the production
constructor (NewTaskService) always initializes it via NewManager().
The shutdown cleanup orchestrator's VMShutdown phase calls
vmLifecycle.Shutdown unconditionally, so the nil fixture panicked with
a nil pointer dereference in TestShutdownReleasesNetworkOnce (a
linux-only test). This was latent: CI never reached the test step
because the golangci-lint install was failing first.

Initialize vmLifecycle with lifecycle.NewManager() so the fixture
mirrors production; Manager.Shutdown is a safe no-op with no instance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "valid configuration" case pointed confDir at the host's
/etc/cni/net.d, so NewCNIManager failed with "no CNI configuration
files found" on CI runners (and any clean environment) that lack CNI
configs. This was latent: CI never reached the test step before because
the golangci-lint install was failing first.

Write a valid conflist into a t.TempDir() and use that as confDir so the
test no longer depends on host CNI state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aledbf and others added 3 commits June 15, 2026 14:06
The release task failed with curl exit 92 ("HTTP/2 stream not closed
cleanly: PROTOCOL_ERROR") while fetching nerdctl-full from GitHub's CDN,
a known intermittent HTTP/2 issue.

Add a download() helper that forces HTTP/1.1 (eliminating the root
cause), retries transient failures, and writes to a temp file renamed on
success so a failed attempt never leaves a truncated tarball that a later
run would reuse. Use it for the nerdctl-full and erofs-snapshotter
downloads.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each container runs in its own shim process, but CNI plugins (bridge,
firewall) mutate shared host state (iptables chains, bridges). When a new
container's CNI ADD overlaps a prior container's asynchronous CNI DEL,
the two race on the kernel xtables lock and the firewall plugin
intermittently aborts the ADD with an empty error — observed in
TestContainerdRunMultipleContainers (3rd container).

Wrap Setup/Teardown in a host-wide advisory file lock (flock on
/run/spinbox/cni.lock) so the plugin chain runs serially across all shim
processes. Internal cleanup paths use new setupLocked/teardownLocked
helpers to avoid recursive lock acquisition (flock is not reentrant). The
lock is released on fd close, so a crashed shim never leaves it held.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aledbf aledbf merged commit 0153feb into main Jun 15, 2026
5 checks passed
@aledbf aledbf deleted the aledbf/fix branch June 15, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant