OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature by BhargaviGudi · Pull Request #31083 · openshift/origin

BhargaviGudi · 2026-04-28T06:18:19Z

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

additionalArtifactStores (max 10 stores)
additionalImageStores (max 10 stores)
additionalLayerStores (max 5 stores, supports :ref suffix)
Combined storage configurations

Test Coverage

API Validation Tests (Non-disruptive, parallel):

Path format validation (absolute paths, no spaces, valid characters)
Count limits (max 10 for artifact/image stores, max 5 for layer stores)
Duplicate path detection
Combined store configurations

E2E Tests (Disruptive, serial):

ContainerRuntimeConfig lifecycle (create, update, delete)
MachineConfigPool rollout verification
Node configuration validation (storage.conf, CRI-O configs)
Lazy pulling with stargz-store
Fallback behavior and performance testing
Combined storage type scenarios

CodeRabbit Review Fixes

✅ Negative test assertions - Tests now fail explicitly when API accepts invalid configs
✅ stargz-store cleanup - Comprehensive node-level cleanup (uninstall binaries, services, configs)
✅ SNO cluster support - Tests work on Single Node OpenShift (fallback to master+worker nodes)
✅ MicroShift compatibility - Added apigroup tags for proper test filtering
✅ Import cleanup - Removed unused imports after assertion fixes

Suite Tags

Tests are tagged with:

[Feature:AdditionalStorageSupport] - Feature flag
[Suite:openshift/disruptive-longrunning] - Long-running disruptive suite
[apigroup:config.openshift.io][apigroup:machineconfiguration.openshift.io] - MicroShift compatibility

All the test cases ran locally and All are passed

Summary by CodeRabbit

Tests
- Added comprehensive API validation and disruptive E2E suites for additional artifact, image, layer, and combined stores — covering path validation, create/update/delete lifecycles, rollout/cleanup verification, performance/fallback scenarios, and cross-type behaviors. Tests are gated by TechPreviewNoUpgrade and some disruptive cases skip on Azure.
Utilities
- Added worker-node helpers for directory management, rollout coordination, stargz-store deployment/verification, and test orchestration to support additional-store scenarios.
Testdata
- Added embedded test asset for additional-store setup.

openshift-merge-bot · 2026-04-28T06:18:22Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

openshift-ci · 2026-04-28T06:18:23Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-04-28T06:18:23Z

@BhargaviGudi: This pull request references OCPNODE-4055 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

additionalArtifactStores (max 10 stores)

additionalImageStores (max 10 stores)

additionalLayerStores (max 5 stores, supports :ref suffix)

Combined storage configurations Test suite: openshift/additional-storage-support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-28T06:18:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds comprehensive Ginkgo E2E tests and helpers for CRI-O AdditionalLayerStores, AdditionalImageStores, AdditionalArtifactStores, and combined-store scenarios, plus node utilities and a stargz-store deployable; includes parallel API validation suites, serial disruptive lifecycle tests, node tooling, and embedded testdata. (≈30 words)

Changes

Additional Stores Test Suite & Helpers

Layer / File(s)	Summary
MCP/MachineConfig and node directory helpers `test/extended/node/node_utils.go`	Adds `waitForMCPToStartUpdating`, MachineConfig creation/deletion helpers, Ignition `storage.directories` helpers, coordinated directory setup/cleanup flows, and MCP update/rollout polling used by E2E flows.
ContainerRuntimeConfig polling & test pod/DaemonSet helpers `test/extended/node/node_utils.go` (continued)	Adds helpers to construct ContainerRuntimeConfig/CRC objects with Additional*Stores, poll for ContainerRuntimeConfig success/failure conditions, create node-pinned restricted test Pods, wait for DaemonSet readiness, optionally pre-populate images on nodes via podman, and cleanup utilities.
Stargz store deployable helper `test/extended/node/stargz_store_setup.go`	Introduces `StargzStoreSetup` with `NewStargzStoreSetup`, `Deploy`, `Cleanup`, `GetStorePath`, `IsDeployed`, and `VerifyStorageConfContainsStargz`. Implements namespace/ServiceAccount/SCC setup, ConfigMap and privileged DaemonSet installer script, DaemonSet readiness wait, per-worker verification (systemd + mount checks), CRI-O restart handling, snapshot-count helpers, test Pod spec creation, and pod deletion polling.
API validation suites (non-disruptive) `test/extended/node/additional_layer_stores.go`, `test/extended/node/additional_image_stores.go`, `test/extended/node/additional_artifact_stores.go`, `test/extended/node/additional_stores_combined.go`	Adds parallel Ginkgo API validation tests (TC1–TC7/TC1–TC9 variants) that submit `ContainerRuntimeConfig` objects and assert acceptance/rejection for path rules: require absolute paths, reject empty/relative paths, enforce per-type max counts, reject duplicates, spaces/invalid characters, over-length (>256), consecutive slashes; all gated by `TechPreviewNoUpgrade`.
Disruptive E2E lifecycle suites (serial) `test/extended/node/additional_layer_stores.go`, `test/extended/node/additional_image_stores.go`, `test/extended/node/additional_artifact_stores.go`, `test/extended/node/additional_stores_combined.go`	Adds serial disruptive tests (skipped on Azure) that select worker nodes (prefer “pure” workers), create directories on nodes, apply ContainerRuntimeConfig/CRC, wait for MCO/MCP rendered MachineConfig and MCP rollout, verify generated runtime/config files (`/etc/containers/storage.conf`, `/etc/crio/crio.conf.d/01-ctrcfg-*`), assert `crio` active and nodes Ready, validate updates (add/remove paths), and confirm cleanup on deletion.
Functional scenarios: pre-populate, stargz, artifacts, performance `test/extended/node/additional_image_stores.go`, `test/extended/node/additional_layer_stores.go`, `test/extended/node/additional_stores_combined.go`	Adds functional E2E flows: pre-populate additional image store and assert cache-hit vs registry fallback (pod Events), measure startup-time speedup; deploy `stargz-store`, assert stargz snapshot creation/reuse and lazy-pull behavior, stargz fallback scenarios, and artifact store read/write checks.
Update/remove flows and combined-store interactions `test/extended/node/additional_stores_combined.go`, `test/extended/node/additional_*_stores.go`	Adds tests for combined configs validating per-type maxima, duplicate detection within type, shared-path behavior across types (skip if API rejects), updates to add/remove stores and subsequent verification across node configs.
Test helpers & CTR/CRC constructors `test/extended/node/additional_layer_stores.go`, `test/extended/node/additional_image_stores.go`, `test/extended/node/additional_artifact_stores.go` (bottom ranges)	Adds unexported helper constructors used by E2E tests: `createAdditionalLayerStoresCTRCfg`, `createAdditionalImageStoresCTRCfg`, `createAdditionalArtifactStoresCTRCfg`, and supporting polling/validation helpers.
Embedded testdata asset `test/extended/testdata/bindata.go`	Adds embedded `test/extended/testdata/node/additional-storage/nydus-store-setup.yaml` asset and registers it in the generated testdata asset maps.

sequenceDiagram
    participant TestRunner as Test Runner (Ginkgo)
    participant MCO as Machine Config Operator
    participant MCP as MachineConfigPool
    participant Node as Worker Node
    participant CRIO as CRI-O
    participant Stargz as stargz-store

    TestRunner->>MCO: create ContainerRuntimeConfig / CRC
    MCO->>MCP: render MachineConfig
    MCP->>MCP: roll out MachineConfig
    MCP->>Node: apply MachineConfig (write configs, create dirs)
    Node->>CRIO: restart/reload service
    CRIO->>Node: create/update /etc/containers/storage.conf or /etc/crio/... fragment
    TestRunner->>Node: run verification commands (check files, service state)
    alt stargz tests
        TestRunner->>Stargz: deploy privileged DaemonSet installer
        Stargz->>Node: install stargz-store, mount FUSE
        TestRunner->>Node: run pods, inspect snapshot counts and logs
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.48% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Many assertions lack meaningful diagnostic messages. Plain `o.Expect(err).NotTo(o.HaveOccurred())` and comparisons without context messages throughout.	Add meaningful failure messages to assertions lacking them, especially complex assertions in E2E tests checking resource properties and node configurations.

✅ Passed checks (10 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding comprehensive test coverage for the Additional Storage Support feature.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All 42 Ginkgo test cases use stable, deterministic names with no dynamic content. Test titles contain only static descriptive strings with TC identifiers.
Microshift Test Compatibility	✅ Passed	All 8 test suites use [apigroup:config.openshift.io] and [apigroup:machineconfiguration.openshift.io] tags. MicroShift automatically skips these tests as it doesn't serve these API groups.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	E2E tests use getPureWorkerNodes() with SNO fallback, iterate over pureWorkers without multi-node assertions, and don't require multiple nodes. API tests have no node dependencies.
Topology-Aware Scheduling Compatibility	✅ Passed	Test files with SNO support via getPureWorkerNodes() fallback. No topology-unsafe scheduling constraints found in test or DaemonSet definitions.
Ote Binary Stdout Contract	✅ Passed	All files pass the OTE Binary Stdout Contract check. No process-level stdout writes detected. All logging uses framework.Logf (intercepted by Ginkgo).
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	All E2E test suites with external connectivity requirements are properly tagged with [Skipped:Disconnected]. No IPv4 assumptions or IPv4-only logic detected in the code.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/testsuites/standard_suites.go`:
- Around line 504-506: The suite's Qualifiers currently filter for
`"[Suite:openshift/additional-storage-support]"` but the tests are tagged
`"[Feature:AdditionalStorageSupport]"`, so update the Qualifiers slice in the
suite definition (the `Qualifiers: []string{ ... }` entry) to match the test
tag—e.g. replace or add `name.contains("[Feature:AdditionalStorageSupport]")`
(or include both selectors if you want to run both forms) so the new tests are
actually selected.

In `@test/extended/node/additional_artifact_stores.go`:
- Around line 460-481: The test races because it calls waitForMCP(...)
immediately after mutating a ContainerRuntimeConfig; add a prior gate call to
waitForMCPToStartUpdating(...) so the helper observes the pool beginning to
update before waiting for rollout: call waitForMCPToStartUpdating(ctx, mcClient,
"worker", <short timeout>) and check the error, then call waitForMCP(ctx,
mcClient, "worker", 25*time.Minute) as before; apply the same change for all
other ContainerRuntimeConfig mutation sites (e.g., the blocks referencing
ContainerRuntimeConfig, waitForMCP, and generatedMCName).
- Around line 45-89: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted instead of failing the spec; change
the negative-path assertions so each invalid case explicitly fails the test:
inside the loop replace the "if err != nil { framework.Logf(...) } else { ... }"
logic with an assertion that the create returned an error (e.g.,
o.Expect(err).To(o.HaveOccurred())) OR, if the API accepts the object, poll MCO
status and assert the object reaches ContainerRuntimeConfigFailure=True (use the
existing cleanupContainerRuntimeConfig helper for accepted cases). Apply this
pattern for the current g.It block and the other similar blocks referenced
(lines 91-141, 143-177, 209-255, 257-291, 293-325) so tests fail when invalid
configs are accepted instead of merely logging.

In `@test/extended/node/additional_image_stores.go`:
- Around line 534-587: The test currently only logs failures from
prepopulateImageOnNode and registry pulls instead of failing; change it to
assert failures so the spec enforces use of additionalImageStores: replace the
soft log when prepopulateImageOnNode returns an error with a hard assertion
(e.g., o.Expect(err).NotTo(o.HaveOccurred()) or framework.Failf) referencing
prepopulateImageOnNode, and after scanning pod events use the pulledFromRegistry
boolean to assert the pod did NOT pull from the registry
(o.Expect(pulledFromRegistry).To(o.BeFalse()) or framework.Failf if true) so the
test fails when prepopulation fails or when a registry pull is observed for
testPod.
- Around line 612-615: The tests mutate ContainerRuntimeConfig then immediately
call waitForMCP(...), which can race if the MCP is still marked Updated on the
old config; insert a call to waitForMCPToStartUpdating(ctx, mcClient, "worker",
<suitable timeout>) immediately after each
waitForContainerRuntimeConfigSuccess(...) (and before the subsequent
waitForMCP(...)) to ensure the MCP has begun rolling before asserting readiness.
Update all occurrences mentioned (e.g., the instances around
waitForContainerRuntimeConfigSuccess -> waitForMCP at lines referenced and the
other listed locations) so the sequence is:
waitForContainerRuntimeConfigSuccess(...), waitForMCPToStartUpdating(...), then
waitForMCP(...).
- Around line 848-966: The baseline measurement is contaminated because the
image pulled by baselinePod remains in CRI-O local storage on testNode and
speeds up optimizedPod; after baselinePod finishes (or before creating
optimizedPod) remove the image from the node's local container storage so the
optimized test only benefits from additionalImageStores. Specifically, after
waitForPodDeleted/baseline cleanup and before prepopulateImageOnNode/creating
optimizedPod, run a removal on testNode (via ExecOnNodeWithChroot) to delete the
image (e.g., crictl rmi or podman/rpm command appropriate for CRI-O) targeting
testImageDefault, or alternatively change the baseline/optimized pod placement
or pull policies so the baseline pull occurs on a different node than testNode;
update references: baselinePod, optimizedPod, testNode, testImageDefault,
ExecOnNodeWithChroot, prepopulateImageOnNode, waitForPodDeleted accordingly.
- Around line 47-91: Tests currently treat API acceptance alone as success,
masking validation regressions; update the spec so each invalid-path case fails
unless the Create call returns an error or the MachineConfigOperator surfaces
the ContainerRuntimeConfigFailure=True condition. Concretely, in the test that
calls mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and in
the other similar blocks), change the success branch to poll the MCO for the
ContainerRuntimeConfigFailure condition for the created resource (and only defer
cleanupContainerRuntimeConfig if you will delete it on success), and assert the
test fails if neither the Create returned an error nor the MCO condition becomes
True within a timeout. Include these checks for all analogous test blocks
mentioned (lines ranges in the comment).

In `@test/extended/node/additional_layer_stores.go`:
- Around line 721-725: After mutating the ContainerRuntimeConfig (e.g., after
calling waitForContainerRuntimeConfigSuccess with ctrcfg.Name), ensure you block
until the MCP has left its old steady state by calling
waitForMCPToStartUpdating(ctx, mcClient, "worker", <suitable timeout>) before
calling waitForMCP(ctx, mcClient, "worker", ...); specifically, insert a call to
waitForMCPToStartUpdating between waitForContainerRuntimeConfigSuccess(...) and
waitForMCP(...). Apply the same insertion for the other similar flows referenced
(lines around the other waitForMCP usages) so the final waitForMCP polls the new
rollout rather than returning on the previous rendered config.
- Around line 167-213: The test currently treats successful creation of a
ContainerRuntimeConfig with an invalid path as a warning; change it to fail the
test when the API accepts an invalid path. In the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and
currently logs "Warning: Path ... accepted"), replace that branch to defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and then assert
failure using the test's assertion helpers (e.g.,
o.Expect(err).To(o.HaveOccurred(), "expected create to fail for path %s",
tc.path) or explicitly call g.Fail/fatal) so the test fails if the create
succeeds; keep the existing behavior that logs success when creation is
rejected.
- Around line 1448-1453: Replace the hardcoded personal image string assigned to
eStargzImage with the project-managed image lookup using image.LocationFor(...)
and add that image's repository:tag to the allowedImages registry in the image
utility so CI doesn't depend on a personal quay repo; specifically, add the
stable repository entry to allowedImages (in the image registry map) and change
eStargzImage to use image.LocationFor("your-org/your-image:tag") or the chosen
stable org image name following the same pattern used by other node tests.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 209-218: The test currently treats both success and failure as
passing because it only logs on error; change the branching to assert
explicitly: where you call
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}) (and the similar block at the other occurrence), replace
the current if/else with an assertion that err did not occur (e.g.,
o.Expect(err).NotTo(o.HaveOccurred()) or equivalent) when the spec is supposed
to be accepted, then defer cleanupContainerRuntimeConfig and run the o.Expect
checks on created.Spec...; if the test is intended to assert rejection instead,
assert that err did occur (o.Expect(err).To(HaveOccurred())) and remove the
success-path expectations. Reference symbols:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
cleanupContainerRuntimeConfig,
created.Spec.ContainerRuntimeConfig.AdditionalLayerStores/AdditionalImageStores/AdditionalArtifactStores,
and apply the same change to the second block around the other Create call.
- Around line 563-566: After computing pureWorkers via getPureWorkerNodes
(called after getNodesByLabel), add a guard that fails the test when pureWorkers
is empty; specifically, immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) assert that len(pureWorkers) > 0 (e.g.,
o.Expect(len(pureWorkers)).NotTo(o.BeZero()) with a descriptive message) so
TC12/TC13 don’t silently no-op when there are zero pure workers.
- Around line 363-365: The current test only asserts an error occurred when
creating the ContainerRuntimeConfig but may match unrelated failures; after the
existing o.Expect(err).To(o.HaveOccurred()) add a stronger assertion that the
error message references the specific invalid image path used in the test (the
invalid image path value supplied when building ctrcfg). Locate the Create call
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and the
ctrcfg variable and assert err.Error() (or the testing matcher equivalent)
contains the exact invalid image path string used to construct ctrcfg so only
the intended validation rejection passes.
- Around line 70-73: The test is creating real ContainerRuntimeConfig objects
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create with
metav1.CreateOptions{}) which triggers MCO reconciliation; change these TC1–TC10
Create calls to use a dry-run by passing metav1.CreateOptions{DryRun:
[]string{"All"}} so the API validates without persisting, and remove or skip the
matching deferred cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for
those dry-run cases since nothing is created; update every occurrence of
Create(...) and the subsequent defer in the file (references:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
metav1.CreateOptions{}, and cleanupContainerRuntimeConfig) accordingly.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 409-425: The loop over pureWorkers currently swallows errors from
ExecOnNodeWithChroot (used for "systemctl is-active stargz-store" and "mount"),
logging warnings and continuing, which allows Deploy to falsely succeed; change
both error branches to return a wrapped error instead of continuing so the
caller fails fast (include node.Name and the original err in the returned
fmt.Errorf) and keep the existing status check that returns an error when the
service is not "active"; update the branches around ExecOnNodeWithChroot calls
in this block to propagate errors rather than logging and continuing.
- Around line 79-107: The Cleanup method (StargzStoreSetup.Cleanup) currently
only deletes the namespace (s.namespace) and must also undo node-level changes:
for each worker node reachable via s.oc (e.g., using the cluster's node exec/SSH
helper), stop and disable the stargz-store systemd unit, remove the installed
binary and config directories (the host paths the deployment created), reload
systemd, and restart CRI-O to clear side effects; perform these steps before or
after namespace delete, ensure s.deployed is set false only after node cleanup,
and log warnings on non-fatal failures while returning an error only on fatal
failures.
- Around line 45-75: Set the setup as deployed earlier and make cleanup safe to
call on partial failures: in Deploy, mark s.deployed = true immediately after
the cluster-mutating resources exist (e.g., after successful createNamespace and
createConfigMap or at least right after createDaemonSet is called) so
AfterEach/Cleanup will run even on later verification failures; also update
Cleanup/IsDeployed/Cleanup helpers to be idempotent (safe to delete
namespace/daemonset/configmap even if some resources were never created) and
make createDaemonSet, waitForDaemonSetReady, and verifyStargzStoreRunning
tolerant of repeated cleanup attempts.
- Around line 217-220: The test currently downloads the stargz-snapshotter
binary via curl (the stargz-snapshotter release fetch block); remove that
external GitHub download and instead load the binary from a cluster-local or
prepackaged source: change the setup to first check a configurable local path or
environment variable (e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if
present, otherwise fall back to a cluster-local payload URL served inside CI;
also remove the curl/URL usage and add a clear error message if neither local
binary nor cluster payload is available so tests fail fast in restricted-network
environments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 626a6a31-816a-4934-9431-1563660f7dba

📥 Commits

Reviewing files that changed from the base of the PR and between b6f09c2 and bf419f3.

📒 Files selected for processing (7)

pkg/testsuites/standard_suites.go
test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

coderabbitai · 2026-04-28T06:31:31Z

+# Download stargz-snapshotter release
+echo "Downloading stargz-snapshotter %s..."
+curl -L -o /tmp/stargz.tar.gz \
+  https://github.com/containerd/stargz-snapshotter/releases/download/%s/stargz-snapshotter-%s-linux-amd64.tar.gz


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the target file and surrounding context head -n 230 test/extended/node/stargz_store_setup.go | tail -n 25

Repository: openshift/origin

Length of output: 949

🏁 Script executed:

# Search for references to stargz in the test setup rg "stargz" test/extended/node/ -A 3 -B 3

Repository: openshift/origin

Length of output: 50372

🏁 Script executed:

# Check if there are any test image definitions or Dockerfiles fd -e Dockerfile -e dockerfile test/

Repository: openshift/origin

Length of output: 386

🏁 Script executed:

# Look for environment variable handling or cluster-local payload alternatives rg "stargz|cluster.*local|payload" test/extended/ --type go -i

Repository: openshift/origin

Length of output: 50374

Remove GitHub release download dependency during test initialization.

The curl download from a public GitHub release (lines 217-220) introduces an external internet dependency that makes the test vulnerable to network flakiness and restricted network policies. In CI environments with limited or no external connectivity, this turns the test suite into an infrastructure verification rather than a feature verification.

Pre-package the stargz-snapshotter binary (v0.18.2) in the test container image or implement a cluster-local payload source with a fallback mechanism to handle environments where GitHub is unreachable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, The test currently downloads the stargz-snapshotter binary via curl (the stargz-snapshotter release fetch block); remove that external GitHub download and instead load the binary from a cluster-local or prepackaged source: change the setup to first check a configurable local path or environment variable (e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if present, otherwise fall back to a cluster-local payload URL served inside CI; also remove the curl/URL usage and add a clear error message if neither local binary nor cluster payload is available so tests fail fast in restricted-network environments.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (9)

test/extended/node/additional_layer_stores.go (3)

167-213: ⚠️ Potential issue | 🟠 Major

TC5 still allows a green pass when invalid paths are accepted.

In the success branch, the test only logs a warning and continues, so invalid-character validation regressions won’t fail the spec.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 167 - 213, The
test loop currently treats acceptance of invalid paths as a warning instead of
failing the spec; update the loop in the "should reject path with invalid
characters for additionalLayerStores [TC5]" test so that when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create returns no
error for a test case (i.e., the API accepted the invalid path), the test fails
(use the test assertion helper, e.g., o.Expect(err).To(o.HaveOccurred()) or
o.Expect(false).To(o.BeTrue()) with a clear message) rather than merely logging
a warning; keep the cleanup via cleanupContainerRuntimeConfig(ctx, mcClient,
ctrcfg.Name) for the success branch but ensure the assertion runs so the spec
fails on regressions.

721-725: ⚠️ Potential issue | 🟠 Major

Several mutation flows still miss MCP “start updating” synchronization.

Direct waitForMCP(...) calls here can complete against the pre-change state and race subsequent assertions.

Also applies to: 738-740, 791-795, 815-819, 861-866, 938-942, 1049-1053, 1129-1135, 1192-1212, 1496-1500, 1726-1730

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 721 - 725, The
direct waitForMCP(...) calls can race the MCP "start updating" transition;
before each waitForMCP(ctx, mcClient, "worker", ...), insert a call to a
synchronization helper that waits for the MCP to enter the "Updating" state
(e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker", <short timeout>))
so the subsequent waitForMCP observes the in-progress update rather than the
pre-change state; update every occurrence (the shown call and the other
duplicated spots) to perform waitForMCPStartUpdating(...) immediately before
waitForMCP(...).

1435-1453: ⚠️ Potential issue | 🟠 Major

Replace the personal external image dependency with a project-managed image lookup.

Hardcoding quay.io/bgudi/test-small:estargz makes CI fragile (ownership/tag availability risk). Prefer a repository-managed image via the standard test-image registry mechanism.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 1435 - 1453,
Replace the hardcoded personal image string assigned to eStargzImage in the test
(additional_layer_stores.go) with the project-managed test-image lookup helper
(instead of "quay.io/bgudi/test-small:estargz"); call the centralized test image
registry/helper used elsewhere in the repo (for example the project’s test image
getter or framework image helper) to obtain the estargz test image, assign its
result to eStargzImage, and remove the literal so CI uses a repository-managed
image tag.

test/extended/node/additional_image_stores.go (4)

612-616: ⚠️ Potential issue | 🟠 Major

Rollout wait is racy without MCP “start updating” gate.

These flows mutate ContainerRuntimeConfig and then immediately call waitForMCP(...), which can observe the prior steady state and return early.

Proposed sequence for each mutation path

 err = waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, 5*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

+err = waitForMCPToStartUpdating(ctx, mcClient, "worker", 5*time.Minute)
+o.Expect(err).NotTo(o.HaveOccurred())
+
 err = waitForMCP(ctx, mcClient, "worker", 25*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

Also applies to: 629-631, 682-686, 706-710, 753-757, 809-813, 1028-1032, 1100-1104, 1210-1214, 1290-1296, 1353-1357, 1369-1373

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 612 - 616, The
test is racy because after mutating ContainerRuntimeConfig (ctrcfg.Name) the
code calls waitForMCP(...) immediately and may see the old steady state; update
each mutation flow (where waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...) is followed by waitForMCP(ctx, mcClient, "worker", ...)) to
first wait for the MCP to enter the "updating" phase (i.e., a start-updating
gate) before calling waitForMCP for completion: add a helper/check that polls
the MachineConfigPool/MC status for .status.degraded===false and
.status.conditions includes an Updating/Progressing condition (or reuse an
existing waitForMCPState helper if present) using mcClient and ctrcfg.Name, then
only call waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout;
apply this change to all similar sites (the listed line groups).

47-91: ⚠️ Potential issue | 🟠 Major

Negative-path specs can pass even when invalid configs are accepted.

These branches only log on successful create and/or don’t assert the poll result, so validation regressions can still report green.

Proposed pattern fix

- _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
- if err != nil {
-   framework.Logf("... rejected ...")
- } else {
-   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
-   framework.Logf("... accepted ..., checking MCO validation")
- }
+ _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+ if err == nil {
+   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
+   err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+     cfg, getErr := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Get(ctx, ctrcfg.Name, metav1.GetOptions{})
+     if getErr != nil {
+       return false, getErr
+     }
+     for _, condition := range cfg.Status.Conditions {
+       if condition.Type == machineconfigv1.ContainerRuntimeConfigFailure && condition.Status == corev1.ConditionTrue {
+         return true, nil
+       }
+     }
+     return false, nil
+   })
+ }
+ o.Expect(err).To(o.HaveOccurred(), "invalid config must be rejected by API or by MCO failure condition")

Also applies to: 120-142, 145-179, 211-257, 259-293, 295-327

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 47 - 91, The test
currently only logs when an invalid ContainerRuntimeConfig (ctrcfg) is accepted,
so create success can silently pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.

535-538: ⚠️ Potential issue | 🟠 Major

Feature assertion is soft; tests still pass on prepopulation or fallback pull failures.

Both TC10 and TC15 only log when prepopulation fails or registry pulls occur, so the core “use additionalImageStores” behavior isn’t enforced.

Also applies to: 559-587, 933-937, 989-999

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 535 - 538, The
tests currently only log failures from prepopulateImageOnNode and registry pulls
(e.g., calls using prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.

850-883: ⚠️ Potential issue | 🟠 Major

Performance comparison is biased by baseline cache warm-up.

The baseline pod pulls the same image on the same node, but local image cache is not cleared before the optimized run. The “optimized faster” result can be unrelated to additionalImageStores.

Also applies to: 939-987

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 850 - 883, The
baseline measurement is biased by a warm image cache because the same image
(testImageDefault) is pulled on the node and not removed before the optimized
run; fix by ensuring a consistent cache state before each measurement: remove
the image from the node(s) (or use a fresh unique image tag) prior to creating
baselinePod and again prior to the optimized run so both runs start from the
same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.

test/extended/node/additional_artifact_stores.go (2)

45-89: ⚠️ Potential issue | 🟠 Major

Invalid-input tests can false-pass when bad configs are accepted.

Several “should reject …” specs only log acceptance (or poll without asserting final failure), so regressions can pass undetected.

Also applies to: 91-141, 143-177, 209-255, 257-291, 293-325

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 45 - 89, The
test currently just logs when an invalid ContainerRuntimeConfig is accepted (in
the loop around invalidPaths) which can false-pass; change the logic so that
after mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
returns nil you do not merely log but explicitly fail the test if MCO does not
reject the bad config: use the Gomega/o.Expect (or g.Fail) to assert the
creation should not succeed, and if you must accept API creation for negative
tests then replace the log+defer with a deterministic check that the MCO rejects
the config (polling status/conditions or checking the rendered/validated state)
and call o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).

460-481: ⚠️ Potential issue | 🟠 Major

Add MCP update-start gating before readiness waits.

After CRC create/update/delete, these tests call waitForMCP(...) directly. That can return on the old rendered state before rollout actually starts.

Also applies to: 548-567, 618-646, 688-693, 744-749, 810-814, 920-924, 1000-1006, 1063-1083

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 460 - 481, The
test must gate the rollout readiness wait with an MCP "update-start" check after
creating/updating/deleting the ContainerRuntimeConfig and after you discover the
generated MachineConfig (variables/functions: ContainerRuntimeConfig,
generatedMCName, mcClient, waitForMCP); add a short wait that asserts the worker
MachineConfigPool has begun rolling out the new config (for example
implement/use a helper like waitForMCPUpdateStart(ctx, mcClient, "worker",
generatedMCName, shortTimeout) or poll the worker MachineConfigPool status and
return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 883: The call to waitForPodDeleted(ctx, oc, baselinePod.Name,
2*time.Minute) ignores its returned error which can mask pod-deletion timing
failures; update the test to capture and assert the error (e.g., use
require.NoError/Expect/if err != nil { t.Fatalf(...) }) immediately after
calling waitForPodDeleted with the same ctx, oc and baselinePod.Name so the test
fails fast on deletion timeouts instead of proceeding and introducing racey
cleanup behavior.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 1605-1623: The test currently ignores errors from
ExecOnNodeWithChroot calls (used for "crictl rmi", "crictl images", and "ls
/var/lib/stargz-store/store/"), which can hide failures; modify the code around
the ExecOnNodeWithChroot invocations that set rmiOutput, crioImages, and
storeAfterDelete to capture and check the returned error and fail the test when
non-nil (e.g., use o.Expect(err).NotTo(o.HaveOccurred()) or return the error),
and ensure any useful output (rmiOutput, crioImages, storeAfterDelete) is still
logged only after the error check so the test reliably asserts that the CRI-O
image removal and subsequent verifications actually succeeded.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 45-89: The test currently just logs when an invalid
ContainerRuntimeConfig is accepted (in the loop around invalidPaths) which can
false-pass; change the logic so that after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...) returns
nil you do not merely log but explicitly fail the test if MCO does not reject
the bad config: use the Gomega/o.Expect (or g.Fail) to assert the creation
should not succeed, and if you must accept API creation for negative tests then
replace the log+defer with a deterministic check that the MCO rejects the config
(polling status/conditions or checking the rendered/validated state) and call
o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).
- Around line 460-481: The test must gate the rollout readiness wait with an MCP
"update-start" check after creating/updating/deleting the ContainerRuntimeConfig
and after you discover the generated MachineConfig (variables/functions:
ContainerRuntimeConfig, generatedMCName, mcClient, waitForMCP); add a short wait
that asserts the worker MachineConfigPool has begun rolling out the new config
(for example implement/use a helper like waitForMCPUpdateStart(ctx, mcClient,
"worker", generatedMCName, shortTimeout) or poll the worker MachineConfigPool
status and return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.

In `@test/extended/node/additional_image_stores.go`:
- Around line 612-616: The test is racy because after mutating
ContainerRuntimeConfig (ctrcfg.Name) the code calls waitForMCP(...) immediately
and may see the old steady state; update each mutation flow (where
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...) is
followed by waitForMCP(ctx, mcClient, "worker", ...)) to first wait for the MCP
to enter the "updating" phase (i.e., a start-updating gate) before calling
waitForMCP for completion: add a helper/check that polls the
MachineConfigPool/MC status for .status.degraded===false and .status.conditions
includes an Updating/Progressing condition (or reuse an existing waitForMCPState
helper if present) using mcClient and ctrcfg.Name, then only call
waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout; apply
this change to all similar sites (the listed line groups).
- Around line 47-91: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted, so create success can silently
pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.
- Around line 535-538: The tests currently only log failures from
prepopulateImageOnNode and registry pulls (e.g., calls using
prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.
- Around line 850-883: The baseline measurement is biased by a warm image cache
because the same image (testImageDefault) is pulled on the node and not removed
before the optimized run; fix by ensuring a consistent cache state before each
measurement: remove the image from the node(s) (or use a fresh unique image tag)
prior to creating baselinePod and again prior to the optimized run so both runs
start from the same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 167-213: The test loop currently treats acceptance of invalid
paths as a warning instead of failing the spec; update the loop in the "should
reject path with invalid characters for additionalLayerStores [TC5]" test so
that when mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create
returns no error for a test case (i.e., the API accepted the invalid path), the
test fails (use the test assertion helper, e.g.,
o.Expect(err).To(o.HaveOccurred()) or o.Expect(false).To(o.BeTrue()) with a
clear message) rather than merely logging a warning; keep the cleanup via
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for the success branch
but ensure the assertion runs so the spec fails on regressions.
- Around line 721-725: The direct waitForMCP(...) calls can race the MCP "start
updating" transition; before each waitForMCP(ctx, mcClient, "worker", ...),
insert a call to a synchronization helper that waits for the MCP to enter the
"Updating" state (e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker",
<short timeout>)) so the subsequent waitForMCP observes the in-progress update
rather than the pre-change state; update every occurrence (the shown call and
the other duplicated spots) to perform waitForMCPStartUpdating(...) immediately
before waitForMCP(...).
- Around line 1435-1453: Replace the hardcoded personal image string assigned to
eStargzImage in the test (additional_layer_stores.go) with the project-managed
test-image lookup helper (instead of "quay.io/bgudi/test-small:estargz"); call
the centralized test image registry/helper used elsewhere in the repo (for
example the project’s test image getter or framework image helper) to obtain the
estargz test image, assign its result to eStargzImage, and remove the literal so
CI uses a repository-managed image tag.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 60404f2d-f230-4e26-93ed-7e3d1c85b473

📥 Commits

Reviewing files that changed from the base of the PR and between bf419f3 and f09da80.

📒 Files selected for processing (6)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

✅ Files skipped from review due to trivial changes (1)

test/extended/node/stargz_store_setup.go

🚧 Files skipped from review as they are similar to previous changes (1)

test/extended/node/additional_stores_combined.go

BhargaviGudi · 2026-04-30T10:00:14Z

/test all

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (6)

test/extended/node/stargz_store_setup.go (4)
411-425: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on per-node verification command errors.

If ExecOnNodeWithChroot(...) fails on a worker, Deploy() currently logs and continues. That allows success without actually verifying every targeted node.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 411 - 425, The
per-node checks in Deploy currently swallow ExecOnNodeWithChroot errors (they
call framework.Logf and continue), which can allow Deploy to report success
without verifying all nodes; change the behavior so that when
ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl is-active"
call and the "mount" call) the function returns an error immediately (e.g.,
fmt.Errorf) that includes the node name and the original error; update the logic
around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of continuing
so calling code knows verification failed.
79-107: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Cleanup only deletes the namespace; it does not revert host changes.

The service, binary, systemd unit, config, mount, and CRI-O side effects remain on each worker after Cleanup(). That contaminates later specs with modified nodes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 79 - 107, The Cleanup
method on StargzStoreSetup only deletes the namespace but must also undo
host-level changes applied during deployment: update Cleanup to iterate the
affected nodes and idempotently stop/disable the stargz-store systemd unit,
remove the installed binary and config files, unmount any mounts created, revert
any CRI-O config changes (and reload/restart CRI-O/kubelet as needed), and
remove any sidecar/static-pod artifacts; use the StargzStoreSetup.Cleanup
function to call helper routines (or add new helpers) that run the required
commands via the existing cluster exec/ssh mechanism (referencing s.namespace,
s.oc, s.deployed and the Cleanup method) and ensure each step logs failures but
continues so cleanup is best-effort and safe to re-run.
217-220: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid downloading the stargz binary from public GitHub during test setup.

This makes the suite depend on external internet reachability and GitHub availability instead of just cluster behavior. A prepackaged or cluster-local artifact source would make the test much more stable in CI.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, Replace the
external GitHub curl download (the curl -L -o /tmp/stargz.tar.gz and the
hardcoded https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL
and the echo "Downloading stargz-snapshotter %s...") with a
cluster-local/prepackaged artifact lookup: read the stargz-snapshotter tarball
from the test artifact store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.
48-75: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Partial deploy failures can still leak cluster state.

deployed flips only after all verification succeeds, but node mutation starts much earlier. If namespace/configmap/daemonset creation succeeds and a later step fails, callers using IsDeployed() will skip cleanup entirely.

Also applies to: 79-83
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 48 - 75, The current
flow only sets s.deployed = true after all verification succeeds, so partial
failures (namespace/configmap/daemonset created but verification failed) leave
resources in cluster because IsDeployed() returns false; change the logic to
mark the deployment as present as soon as cluster-affecting resources are
created: set s.deployed = true immediately after successful createDaemonSet (or
right after createConfigMap/createDaemonSet sequence) and before calling
waitForDaemonSetReady/verifyStargzStoreRunning, and mirror the same change in
the similar block around lines 79-83 so callers invoking IsDeployed() can
trigger cleanup when verification fails.
test/extended/node/additional_artifact_stores.go (1)
79-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still do not fail when invalid configs are accepted.

Several should reject ... cases only log success on API rejection and otherwise continue, and TC2 still does not assert the MCO failure path. A validation regression can still produce a green run here.

Also applies to: 118-140, 168-177, 246-255, 283-291, 317-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 89, The
test currently only logs when an invalid ContainerRuntimeConfig is accepted
(using mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.
test/extended/node/additional_image_stores.go (1)
80-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still go green when invalid configs are accepted.

Several should reject ... cases only log acceptance, and TC2 also ignores the result of the MCO polling path. If API validation regresses, this suite can still pass without ever asserting ContainerRuntimeConfigFailure=True.

Also applies to: 119-140, 169-177, 247-255, 284-291, 318-325
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 80 - 89, The test
currently only logs when invalid ContainerRuntimeConfig paths are accepted by
the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-570: After calling getPureWorkerNodes(workerNodes) check
whether pureWorkers is empty and skip the spec if so; specifically, add a guard
like if len(pureWorkers) == 0 { o.Skip("no pure worker nodes available; skipping
spec") } immediately after the pureWorkers := getPureWorkerNodes(workerNodes)
line in the block that uses getNodesByLabel and getPureWorkerNodes (this
prevents the node-verification loops from being vacuously successful). Apply the
same guard to the other occurrence where getPureWorkerNodes is used (the block
around the other call at the later occurrence).
- Around line 540-545: Replace the fragile "cat" check with a command that
succeeds but only prints when the file exists so we can assert absence
specifically: use ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present;
fi") and then assert the command returned no error and produced an empty stdout
(meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file
removed successfully", node.Name) and update the expectation to check output ==
"" rather than just expecting ExecOnNodeWithChroot to fail.

In `@test/extended/node/additional_image_stores.go`:
- Around line 718-720: The tests call getNodesByLabel(...) and derive
pureWorkers via getPureWorkerNodes(workerNodes) but don't skip when pureWorkers
is empty, causing SNO/compact clusters to exercise no-op loops and falsely pass;
after computing pureWorkers (the variable), add a guard like if len(pureWorkers)
== 0 { o.Skip("no pure worker nodes for this cluster topology") } where the
first instance occurs (around the getNodesByLabel/getPureWorkerNodes usage) and
apply the same guard at the other similar location that computes pureWorkers
later in the file so these specs are skipped when no pure workers exist.
- Around line 631-644: The hard 2x timing assertion using podStartupTime and
pod2StartupTime is too brittle; change the Expect that currently uses
o.BeNumerically("<", pod2StartupTime/2) to a relaxed check (e.g.,
o.BeNumerically("<", pod2StartupTime)) or remove the numeric timing assertion
entirely and rely on the existing event/source checks for additionalImageStores;
also update the failure message near the Expect and the Logf that prints
"Speedup" to avoid implying a strict 2x requirement.
- Around line 550-576: The test reads Events once right after
waitForPodRunning(...) which can miss the asynchronous "Pulled" event; change
the logic around oc.AdminKubeClient().CoreV1().Events(...).List(...) and the
foundAlreadyPresentEvent check to poll until a timeout (e.g., using
framework/poll or wait.PollImmediate) and retry listing events and scanning for
event.Reason == "Pulled" and the "already present on machine and can be accessed
by the pod" substring, failing only after the timeout; apply the same polling
fix to the other similar block that checks events (the block around the second
occurrence noted in the comment, currently at lines 607-629) so both checks
reliably wait for the Pulled event before asserting using
additionalImageStorePath and testPod.
- Around line 583-599: After calling Delete on testPod, wait for the pod to be
fully removed (or to reach a terminal state) before running the podman rmi and
creating the fallback pod: poll with wait.PollImmediate
(k8s.io/apimachinery/pkg/util/wait) against
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions{}) and loop until the Get returns a NotFound error or the
pod.Status.Phase is Succeeded/Failed (or deletionTimestamp is set and pod is
gone); only then run ExecOnNodeWithChroot (the removeCmd) and call createTestPod
for testPod2. Ensure the wait has a reasonable timeout and fail the test if the
pod does not terminate in time.

In `@test/extended/node/node_utils.go`:
- Around line 950-959: waitForDaemonSetReady currently always uses
oc.Namespace() when calling
oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(...), which can
poll the wrong DaemonSet when createTestDaemonSet created resources in a
different namespace; change the helper to accept an explicit namespace parameter
(e.g., add ns string to waitForDaemonSetReady signature) and use
oc.AdminKubeClient().AppsV1().DaemonSets(ns).Get(ctx, name, metav1.GetOptions{})
instead, and apply the same namespace-parameter fix to the other occurrence
referenced (lines 1125-1158) so callers pass the intended namespace rather than
relying on oc.Namespace().
- Around line 1085-1092: In waitForPodDeleted, non-NotFound GET errors are being
swallowed (returning false, nil) which masks RBAC/transport/apiserver failures;
update the closure in waitForPodDeleted to return the actual error (return
false, err) for any err that is not apierrors.IsNotFound(err) so polling will
surface and abort on real API errors while still returning (true, nil) when
apierrors.IsNotFound(err).

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 79-89: The test currently only logs when an invalid
ContainerRuntimeConfig is accepted (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.

In `@test/extended/node/additional_image_stores.go`:
- Around line 80-89: The test currently only logs when invalid
ContainerRuntimeConfig paths are accepted by the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 411-425: The per-node checks in Deploy currently swallow
ExecOnNodeWithChroot errors (they call framework.Logf and continue), which can
allow Deploy to report success without verifying all nodes; change the behavior
so that when ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl
is-active" call and the "mount" call) the function returns an error immediately
(e.g., fmt.Errorf) that includes the node name and the original error; update
the logic around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of
continuing so calling code knows verification failed.
- Around line 79-107: The Cleanup method on StargzStoreSetup only deletes the
namespace but must also undo host-level changes applied during deployment:
update Cleanup to iterate the affected nodes and idempotently stop/disable the
stargz-store systemd unit, remove the installed binary and config files, unmount
any mounts created, revert any CRI-O config changes (and reload/restart
CRI-O/kubelet as needed), and remove any sidecar/static-pod artifacts; use the
StargzStoreSetup.Cleanup function to call helper routines (or add new helpers)
that run the required commands via the existing cluster exec/ssh mechanism
(referencing s.namespace, s.oc, s.deployed and the Cleanup method) and ensure
each step logs failures but continues so cleanup is best-effort and safe to
re-run.
- Around line 217-220: Replace the external GitHub curl download (the curl -L -o
/tmp/stargz.tar.gz and the hardcoded
https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL and the echo
"Downloading stargz-snapshotter %s...") with a cluster-local/prepackaged
artifact lookup: read the stargz-snapshotter tarball from the test artifact
store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.
- Around line 48-75: The current flow only sets s.deployed = true after all
verification succeeds, so partial failures (namespace/configmap/daemonset
created but verification failed) leave resources in cluster because IsDeployed()
returns false; change the logic to mark the deployment as present as soon as
cluster-affecting resources are created: set s.deployed = true immediately after
successful createDaemonSet (or right after createConfigMap/createDaemonSet
sequence) and before calling waitForDaemonSetReady/verifyStargzStoreRunning, and
mirror the same change in the similar block around lines 79-83 so callers
invoking IsDeployed() can trigger cleanup when verification fails.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5c9d8fe-95ec-4b74-8c24-f0b1dcb4c3b2

📥 Commits

Reviewing files that changed from the base of the PR and between f09da80 and af7b257.

📒 Files selected for processing (7)

pkg/testsuites/standard_suites.go
test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

✅ Files skipped from review due to trivial changes (1)

pkg/testsuites/standard_suites.go

🚧 Files skipped from review as they are similar to previous changes (2)

test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go

coderabbitai · 2026-04-30T10:05:31Z

+		g.By("PHASE 6: Verifying CRI-O config file is removed after CRC deletion")
+		for _, node := range pureWorkers {
+			_, err := ExecOnNodeWithChroot(oc, node.Name, "cat", "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores")
+			o.Expect(err).To(o.HaveOccurred(),
+				"CRI-O config file should be removed after ContainerRuntimeConfig deletion on node %s", node.Name)
+			framework.Logf("Node %s: CRI-O config file removed successfully", node.Name)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert that the file is absent, not just that cat failed.

Any ExecOnNodeWithChroot(...) failure satisfies this expectation today, so a broken debug session or transient node error looks identical to successful cleanup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545, Replace the fragile "cat" check with a command that succeeds but only prints when the file exists so we can assert absence specifically: use ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e /etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present; fi") and then assert the command returned no error and produced an empty stdout (meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file removed successfully", node.Name) and update the expectation to check output == "" rather than just expecting ExecOnNodeWithChroot to fail.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (8)

test/extended/node/node_utils.go (2)

1085-1092: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't swallow non-NotFound pod GET errors.

Line 1091 returns false, nil for every error besides IsNotFound, so real API failures only show up as a timeout instead of failing the poll immediately.

Suggested fix

 func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
 		_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
 		if apierrors.IsNotFound(err) {
 			return true, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return false, nil
 	})
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 1085 - 1092, The poll in
function waitForPodDeleted is swallowing non-NotFound GET errors; modify the
closure used by wait.PollUntilContextTimeout so that after calling
oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.

950-958: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll the intended DaemonSet and surface real GET failures.

Line 952 hardcodes oc.Namespace() even though createTestDaemonSet(...) lets callers choose a namespace, so this helper can wait on the wrong DaemonSet. Lines 953-954 also turn every API error into a silent timeout, which makes RBAC/apiserver failures look like readiness flakes.

Suggested fix

-func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, name string, timeout time.Duration) error {
+func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace, name string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 10*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(ctx, name, metav1.GetOptions{})
-		if err != nil {
+		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(ctx, name, metav1.GetOptions{})
+		if apierrors.IsNotFound(err) {
 			return false, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return ds.Status.NumberReady == ds.Status.DesiredNumberScheduled &&
 			ds.Status.DesiredNumberScheduled > 0, nil
 	})
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 950 - 958, The helper
waitForDaemonSetReady currently always polls oc.Namespace() and silences GET
errors; change its signature to accept an explicit namespace (e.g.
waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace string,
name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.

test/extended/node/additional_layer_stores.go (3)

460-462: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid a personal Quay image here.

Line 461 hardcodes quay.io/bgudi/test-small:estargz, which makes this suite depend on a non-repository-owned external image. If that tag is removed, permissions change, or mirroring rules differ, CI breaks for reasons unrelated to the code under test. Prefer a repository-managed image routed through image.LocationFor(...) and registered in the allowed image list.
#!/bin/bash
set -euo pipefail

rg -n 'quay\.io/bgudi/test-small:estargz' test/extended/node
fd -i 'image.go' test/extended/util | xargs rg -n 'allowedImages|bgudi/test-small|estargz'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 460 - 462,
Replace the hardcoded external image value assigned to eStargzImage
("quay.io/bgudi/test-small:estargz") with a repository-managed image lookup
using image.LocationFor(...) (or another test helper that returns an allowed
image), and ensure the chosen image is registered in the allowed image list used
by the test suite; update the assignment where eStargzImage is defined and any
test harness references so they use the LocationFor result rather than the
personal quay URL.
207-213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail TC5 if the invalid path is accepted.

Lines 207-213 only log a warning on a successful create, so this spec still passes if invalid-character validation regresses.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 207 - 213, The
test currently only logs a warning when an invalid path is accepted, so add a
failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.
744-748: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the worker MCP to leave the old steady state before waitForMCP(...).

These paths jump straight from waitForContainerRuntimeConfigSuccess(...) to waitForMCP(...). If the pool is still Updated when polling starts, waitForMCP(...) can return on the previous rendered config and race the later storage.conf assertions.

Also applies to: 770-774, 840-844
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 744 - 748, After
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...), ensure
the worker MachineConfigPool leaves its previous steady state before calling
waitForMCP(...): insert a short wait that polls the "worker" MCP until its
status no longer reports the prior steady-state condition (e.g., not
"Updated"/not matching the old rendered configuration) — you can reuse the
existing waitForMCP helper with a flag/variant or add a small helper (e.g.,
waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only when it
has transitioned away from the old steady state, then proceed to the existing
waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same insertion at
the other occurrences noted (around the blocks at the other two locations).

test/extended/node/additional_artifact_stores.go (3)

567-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

After getPureWorkerNodes(...), both specs continue with an empty slice. On SNO/compact clusters that makes the node-verification loops vacuously succeed and the test can pass without checking a worker at all.

Also applies to: 656-664
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 575, The
test continues even when getPureWorkerNodes(...) returns an empty slice, causing
vacuous success on SNO/compact clusters; add an early skip when pureWorkers is
empty by checking len(pureWorkers)==0 and invoking the test skip helper (e.g.,
o.Skip or t.Skip) with a clear message before calling createDirectoriesOnNodes;
apply the same guard at the other occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.
79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make these negative-path specs fail when invalid configs are accepted.

Lines 79-85 and the similar blocks only log success when the API accepts an invalid ContainerRuntimeConfig, so the spec still goes green on a validation regression. TC2 has the same problem after the poll: if ObservedGeneration == Generation without ever reaching ContainerRuntimeConfigFailure=True, the test still passes.

Also applies to: 118-140, 168-174, 246-253, 283-290, 317-324
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 85, The
test currently treats API acceptance of invalid ContainerRuntimeConfig as
success; change the Create() success branch (the block around
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the defer
cleanupContainerRuntimeConfig call) to explicitly fail the test when an invalid
config is accepted (use the test framework's failure/assert method instead of
only logging), and likewise update the poll/validation checks that compare
ObservedGeneration to Generation to fail if ObservedGeneration==Generation
occurs without ever observing the ContainerRuntimeConfigFailure=True condition
(i.e., after your existing poll of the resource state, assert failure when no
failure condition was ever set). Target the create/accept branches and the poll
that checks ObservedGeneration/Generation and ContainerRuntimeConfigFailure to
make negative-path specs fail on invalid-config acceptance.
540-545: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert file absence explicitly instead of expecting cat to fail.

Any ExecOnNodeWithChroot(...) failure satisfies this check today, so a broken debug session is indistinguishable from successful cleanup. Check for existence with a shell test and assert empty output instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" failure check with an explicit existence test: call
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 507-514: After successfully creating the ContainerRuntimeConfig
(the call to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{})), immediately schedule its cleanup by calling defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) right after the
o.Expect(err).NotTo(o.HaveOccurred()) that follows the create; this ensures the
ctrcfg (variable ctrcfg) is removed on test exit even if subsequent assertions
(including waitForContainerRuntimeConfigSuccess) fail.
- Around line 712-719: The test currently proceeds even when
getPureWorkerNodes(...) returns an empty slice, letting directory
setup/verification be no-ops and falsely passing; fix by checking
len(pureWorkers) immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) and, if zero, skip the spec (or return) with a
clear message so the createDirectoriesOnNodes and subsequent verification loops
are not run; apply the same guard in both places where pureWorkers is used (the
block around createDirectoriesOnNodes and the later verification loop) and
reference getPureWorkerNodes, pureWorkers, and createDirectoriesOnNodes when
adding the check.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-575: The test continues even when getPureWorkerNodes(...)
returns an empty slice, causing vacuous success on SNO/compact clusters; add an
early skip when pureWorkers is empty by checking len(pureWorkers)==0 and
invoking the test skip helper (e.g., o.Skip or t.Skip) with a clear message
before calling createDirectoriesOnNodes; apply the same guard at the other
occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.
- Around line 79-85: The test currently treats API acceptance of invalid
ContainerRuntimeConfig as success; change the Create() success branch (the block
around mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
the defer cleanupContainerRuntimeConfig call) to explicitly fail the test when
an invalid config is accepted (use the test framework's failure/assert method
instead of only logging), and likewise update the poll/validation checks that
compare ObservedGeneration to Generation to fail if
ObservedGeneration==Generation occurs without ever observing the
ContainerRuntimeConfigFailure=True condition (i.e., after your existing poll of
the resource state, assert failure when no failure condition was ever set).
Target the create/accept branches and the poll that checks
ObservedGeneration/Generation and ContainerRuntimeConfigFailure to make
negative-path specs fail on invalid-config acceptance.
- Around line 540-545: Replace the fragile "cat" failure check with an explicit
existence test: call ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 460-462: Replace the hardcoded external image value assigned to
eStargzImage ("quay.io/bgudi/test-small:estargz") with a repository-managed
image lookup using image.LocationFor(...) (or another test helper that returns
an allowed image), and ensure the chosen image is registered in the allowed
image list used by the test suite; update the assignment where eStargzImage is
defined and any test harness references so they use the LocationFor result
rather than the personal quay URL.
- Around line 207-213: The test currently only logs a warning when an invalid
path is accepted, so add a failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.
- Around line 744-748: After waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...), ensure the worker MachineConfigPool leaves its previous
steady state before calling waitForMCP(...): insert a short wait that polls the
"worker" MCP until its status no longer reports the prior steady-state condition
(e.g., not "Updated"/not matching the old rendered configuration) — you can
reuse the existing waitForMCP helper with a flag/variant or add a small helper
(e.g., waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only
when it has transitioned away from the old steady state, then proceed to the
existing waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same
insertion at the other occurrences noted (around the blocks at the other two
locations).

In `@test/extended/node/node_utils.go`:
- Around line 1085-1092: The poll in function waitForPodDeleted is swallowing
non-NotFound GET errors; modify the closure used by wait.PollUntilContextTimeout
so that after calling oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.
- Around line 950-958: The helper waitForDaemonSetReady currently always polls
oc.Namespace() and silences GET errors; change its signature to accept an
explicit namespace (e.g. waitForDaemonSetReady(ctx context.Context, oc
*exutil.CLI, namespace string, name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8bbb8226-1e43-42a4-8e65-d214abf75a43

📥 Commits

Reviewing files that changed from the base of the PR and between af7b257 and 990e5f3.

📒 Files selected for processing (7)

pkg/testsuites/standard_suites.go
test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

✅ Files skipped from review due to trivial changes (2)

test/extended/node/additional_stores_combined.go
test/extended/node/stargz_store_setup.go

🚧 Files skipped from review as they are similar to previous changes (1)

test/extended/node/additional_image_stores.go

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (18)

test/extended/node/additional_layer_stores.go (3)

427-429: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Defer CRC cleanup immediately after create.

If any later assertion fails before Phase 11, the worker pool stays mutated and can poison the rest of the suite.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 427 - 429, After
creating the ContainerRuntimeConfig with
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), immediately schedule its cleanup with a defer that
deletes the resource (use
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Delete(ctx,
ctrcfg.Name, metav1.DeleteOptions{})) so the worker pool is restored if later
assertions fail; place the defer directly after the Create call and keep using
the existing ctx and ctrcfg.Name to identify the resource.

632-635: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

Without the guard, directory setup and verification loops can no-op on SNO/compact clusters and still report success.

Also applies to: 730-733

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 632 - 635, After
computing pureWorkers with getPureWorkerNodes(workerNodes), add a guard that
checks if len(pureWorkers) == 0 and skips the spec (e.g., call ginkgo.Skip with
a short message) so the test does not proceed on SNO/compact clusters; apply the
same change at the other occurrence around lines 730-733 where
getPureWorkerNodes is used. Ensure you reference getNodesByLabel and
getPureWorkerNodes to locate the blocks to modify.

208-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail this negative test when create succeeds.

Right now an invalid-character regression only logs a warning and the spec still passes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 208 - 214, The
negative test currently only logs a warning when creating the
ContainerRuntimeConfig unexpectedly succeeds; change that to fail the test so
regressions break the suite. Replace the else branch body that currently defers
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and calls
framework.Logf("Warning: Path with '%s' accepted at API level", tc.char) with
code that still defers cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
but then calls framework.Failf (or the appropriate test-failing helper used in
this test suite) with a clear message referencing tc.char and ctrcfg.Name so the
test is marked failed when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
unexpectedly succeeds.

test/extended/node/node_utils.go (1)

950-955: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop turning polling API failures into timeouts.

These helpers currently return false, nil on GET errors, so real apiserver/RBAC/transport failures get masked until the poll times out. Return the error for non-NotFound cases and only retry on the states that are actually expected during polling.

Also applies to: 1052-1056, 1087-1091
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 950 - 955, The helper
waitForDaemonSetReady currently swallows GET errors by returning false, nil
which masks apiserver/RBAC/transport failures; change the
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) error handling to
return the error for all non-missing cases (i.e., if err != nil and
!apierrors.IsNotFound(err) return false, err) and only treat NotFound or
specific transient states as retryable; apply the same fix pattern to the other
similar helpers referenced (the GET calls at the blocks around lines 1052-1056
and 1087-1091) so polling only suppresses expected transient states and surfaces
real errors.

test/extended/node/additional_image_stores.go (5)

80-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative API tests still pass if invalid configs are accepted.

The success branches only log, so a validation regression can still produce a green suite. Each case should either assert create-time rejection or, if the API accepts the object, assert that MCO reports ContainerRuntimeConfigFailure=True.

Also applies to: 119-141, 169-175, 247-253, 284-290, 318-324

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 80 - 86, The
tests currently only log when Create succeeds, so invalid configs can slip
through; update each test block that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and the
other similar blocks at the noted ranges) to assert behavior: if Create returns
an error then fail the test only if the error is unexpected (or explicitly
assert rejection for negative tests), and if Create succeeds then defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and immediately
poll/assert that the Machine Config Operator reports
ContainerRuntimeConfigFailure=True for the created object (use the existing MCO
status-check helper or implement a short wait loop checking the MCO status
condition for the object referenced by ctrcfg.Name and tc.path), failing the
test if the MCO does not set ContainerRuntimeConfigFailure=True within the
timeout; replace the plain framework.Logf success branches with these assertions
for all mentioned blocks.

550-576: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll for the pull event instead of reading events once.

Event recording is asynchronous. Listing once immediately after waitForPodRunning(...) makes both the additional-store and fallback assertions flaky.

Also applies to: 607-629

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 550 - 576, The
test currently lists events once after waitForPodRunning and checks for a
"Pulled" event message using variables/events from
oc.AdminKubeClient().CoreV1().Events(...) and foundAlreadyPresentEvent, which is
flaky because events are recorded asynchronously; replace the single List() call
with a polling loop (e.g., wait.PollImmediate or a retry with timeout) that
repeatedly calls oc.AdminKubeClient().CoreV1().Events(...).List(ctx,
metav1.ListOptions{FieldSelector: fmt.Sprintf("involvedObject.name=%s",
testPod.Name)}) and scans events.Items for the Pulled reason and the "already
present on machine and can be accessed by the pod" message until found or
timeout, then assert foundAlreadyPresentEvent; apply the same polling change to
the similar block referenced at lines 607-629 so both the additional-store and
fallback assertions use polling and are no longer flaky (keep references to
testPod and additionalImageStorePath in the assertion message).

718-720: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

Without a guard, the setup and verification loops can no-op on SNO/compact clusters and still report success.

Also applies to: 807-809

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 718 - 720, The
test assumes pure worker nodes exist but doesn't handle SNO/compact clusters;
after calling getPureWorkerNodes(workerNodes) add a guard that checks if
len(pureWorkers) == 0 and skip the spec (e.g., call framework.Skipf or
ginkgo.Skip with a clear message like "no pure worker nodes found, skipping") so
the setup and verification loops do not run and report false success; apply the
same guard to the other occurrence that mirrors this logic near the later block.

583-592: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the first pod to disappear before removing the image.

Deletion is asynchronous here. Running podman rmi immediately can race the terminating pod still using that image on the node, which makes the fallback phase nondeterministic.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 583 - 592, After
calling AdminKubeClient().CoreV1().Pods(...).Delete for testPod, wait for the
pod to be fully removed before running the podman rmi on the node: poll the API
(e.g., using wait.PollImmediate or a simple loop calling
AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions())) until the Get returns a NotFound error (use
metav1.IsNotFound(err)) or the pod is clearly terminated, then proceed to
construct removeCmd and call ExecOnNodeWithChroot to run podman rmi against
additionalImageStorePath/testImageDefault. Ensure the wait uses a reasonable
timeout to avoid flakiness.

631-644: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The hard 2x startup-time assertion is too brittle for CI.

Registry latency and cluster load can dominate wall-clock time even when the event checks already proved the image source. This is likely to create false reds unrelated to the feature.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 631 - 644, The
hard 2x assertion using o.Expect(podStartupTime).To(o.BeNumerically("<",
pod2StartupTime/2)) is brittle; change it to a more tolerant, configurable check
on the computed speedup variable (speedup :=
float64(pod2StartupTime)/float64(podStartupTime)) instead of dividing times
directly. Replace the strict check with an assertion like
o.Expect(speedup).To(o.BeNumerically(">", minSpeedup)) where minSpeedup is a
constant (e.g., 1.2) or read from an env var (e.g.,
ADDITIONAL_IMAGE_STORES_MIN_SPEEDUP) so CI can relax the threshold, and keep the
framework.Logf output of speedup for diagnostics; update references to
podStartupTime, pod2StartupTime, speedup and the o.Expect call in
additional_image_stores.go.

test/extended/node/stargz_store_setup.go (4)

49-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mark the setup as deployed before post-create verification starts.

If namespace/DaemonSet creation succeeds but readiness or node verification fails, callers skip Cleanup() because deployed is still false. That leaks partial cluster state from failed setups.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 49 - 84, The setup
currently sets s.deployed = true only after all post-create verifications, which
prevents Cleanup() from running on partial failures; move the deployment marker
so cleanup runs on failures during readiness/verification: after
createDaemonSet(ctx) returns nil (successful creation of the DaemonSet) set
s.deployed = true, then proceed to waitForDaemonSetReady(ctx, ...) and
verifyStargzStoreRunning(ctx). Update the code around createDaemonSet,
waitForDaemonSetReady, and verifyStargzStoreRunning to reflect this ordering
(set s.deployed = true immediately after createDaemonSet succeeds).

468-484: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail deployment when per-node verification commands fail.

Continuing after ExecOnNodeWithChroot(...) errors lets Deploy() report success without actually validating every worker.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 468 - 484, The loop
that verifies stargz-store currently ignores failures from ExecOnNodeWithChroot
and continues, allowing Deploy() to succeed despite missing per-node validation;
change the error handling in the checks that call ExecOnNodeWithChroot (both the
"systemctl is-active stargz-store" call and the "mount" call) to return a
descriptive error (e.g., fmt.Errorf with node.Name and the original err) instead
of logging a warning and continuing so that Deploy() fails when per-node
verification cannot be performed.

90-118: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Cleanup does not undo the host-level installation.

Deleting the namespace leaves the binary, systemd unit, config, mount, enabled service, and CRI-O side effects on the workers. Later specs will inherit modified nodes unless cleanup also stops/disables stargz-store, removes the host files, reloads systemd, and restarts CRI-O.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 90 - 118, The Cleanup
method currently only deletes the namespace and leaves host-level artifacts;
update Cleanup(ctx context.Context) error to also perform host cleanup on all
worker nodes where stargz-store was installed: for each node (discover via the
cluster client or stored node list on the setup struct) SSH/exec to stop and
disable the systemd unit (e.g., "stargz-store.service"), remove the unit file
and any drop-ins, remove the installed binary and config files, unmount and
remove any mountpoints created by stargz-store, reload systemd daemon, and
restart the CRI-O service to undo side effects; ensure all operations handle
NotFound errors and surface failures via framework.Logf, and keep s.deployed set
to false only after host cleanup completes.

275-284: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid downloading stargz-store from GitHub during test setup.

This makes the spec depend on external internet reachability and release availability instead of cluster behavior. A repo/CI-provided artifact or payload-owned image is much safer for extended tests.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 275 - 284, The test
setup currently downloads stargz-store from GitHub using the curl/tar commands
(the "Downloading stargz-snapshotter" block and subsequent "curl -L -o
/tmp/stargz.tar.gz", "tar -xzf /tmp/stargz.tar.gz -C /tmp/", and "cp
/tmp/stargz-store /host/usr/local/bin/" lines), which creates an external
dependency; replace that download sequence with a deterministic source: pull the
binary from a repo/CI-provided artifact or extract it from a payload-owned image
(for example unpack a test-assets tar in the repo or use a container image that
contains stargz-store), and change the script to copy from that local path
(e.g., ../test-assets/stargz-store or an image extraction step) into
/host/usr/local/bin/, adding a clear failure message if the artifact is missing.
Ensure the curl/tar lines are removed and that the script still sets +x on the
copied stargz-store.

test/extended/node/additional_stores_combined.go (2)

269-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

On SNO/compact topologies, getPureWorkerNodes(...) can be empty. In that case the directory setup and verification loops no-op, and the spec can pass without validating any worker node.

Also applies to: 367-369, 441-443, 546-548

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 269 - 271, The
spec must be skipped when there are no pure worker nodes: after calling
getNodesByLabel(...) and computing pureWorkers :=
getPureWorkerNodes(workerNodes) add a guard that if len(pureWorkers) == 0 you
call the test skip helper (e.g., Skip/Skipf from your test framework) and return
so the setup/verification loops are not treated as a passing test; apply the
same check at the other occurrences of getPureWorkerNodes (the blocks around
lines 367-369, 441-443, and 546-548) to consistently skip those specs when
pureWorkers is empty.

67-69: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the rejection reason, not just that some error happened.

This passes on unrelated create failures too. Checking that the error mentions the invalid image path under test keeps TC1 tied to the actual validation being exercised.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 67 - 69, The
test currently only asserts that Create returned some error; change it to assert
the rejection reason explicitly by checking the error message contains the
invalid image path string used in ctrcfg. After the Create call on
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create, keep the
existence check but add an assertion that err.Error() (or the returned Status
message) contains the specific bad image path value from the ctrcfg object (the
invalid image path constant/field you set for the test) using a substring
matcher (e.g. Gomega ContainSubstring) so the failure is tied to the intended
validation.

test/extended/node/additional_artifact_stores.go (3)

567-570: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

If getPureWorkerNodes(...) returns empty, the setup and verification loops become no-ops and the spec can report success without touching a worker node.

Also applies to: 656-658

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 570, The
spec must early-skip when no pure worker nodes exist: after calling
getPureWorkerNodes (the pureWorkers variable) check if len(pureWorkers) == 0 and
call the test-skip helper (e.g., ginkgo.Skip / o.Skip with a clear message) so
the setup and verification loops become no-ops safely; apply the same guard in
the second occurrence around the later getPureWorkerNodes usage (the block at
lines ~656-658) to avoid falsely passing when no worker nodes are available.

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still go green when invalid configs are accepted.

Several cases only log on success, and TC2's fallback poll path never asserts that ContainerRuntimeConfigFailure=True was actually reached. If validation regresses, this suite will still pass. Each case should either require create-time rejection or explicitly assert the failure condition after creation.

Also applies to: 118-140, 168-174, 246-252, 283-289, 317-323

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 85, The
negative tests currently only log success and never fail the test when an
invalid ContainerRuntimeConfig is accepted; update each case that expects
rejection (e.g., the Create call via
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the TC2
fallback poll) so they either assert create-time failure (t.Fatalf/t.Errorf)
when Create returns nil error, or, if creation is allowed, immediately poll and
assert the MCO reached ContainerRuntimeConfigFailure=True via the existing poll
helper (the same helper used in other checks) before deferring
cleanupContainerRuntimeConfig; apply the same change pattern to the other
similar blocks you flagged (around the ranges noted) so invalid configs cause
test failures instead of only logging.

540-545: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert that the config file is absent, not just that cat failed.

Any debug-session or node-command failure satisfies this expectation today. Use a command that succeeds and prints only when the file exists, then assert empty stdout.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" failure check in the PHASE 6 loop: instead of
expecting ExecOnNodeWithChroot(oc, node.Name, "cat",
"/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores") to error, run
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "<command that only emits
content if the file exists>") (or equivalent) so the command always succeeds and
prints only when the file is present; then assert err is nil and that stdout is
empty for each node in pureWorkers. Use the existing ExecOnNodeWithChroot call
and the known path "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores" to
locate the code and replace the Expect(err).To(o.HaveOccurred()) check with an
assertion that stdout == "" (file absent) and no error occurred.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 1044-1057: The test hardcodes a workstation-local manifest path in
the nydusSetupYAML constant which will not exist in CI; update the test to load
the nydus-store manifest from the repo or embed it and apply programmatically
instead of using "/home/bgudi/...". Replace the nydusSetupYAML constant with a
relative path within the repository (or read the YAML from a testdata resource
or an embedded string) and change the oc.AsAdmin().Run("apply").Args("-f",
nydusSetupYAML).Output() call to use that resource; ensure the test still sets
nydusNamespace/nydusStorePath/nydusTestPodImage as before and validate the file
is packaged in the repo or embedded so CI can access it.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 671-680: The test currently only copies files into imageStorePath
(skopeoCmd/ExecOnNodeWithChroot) and asserts directory contents, but never
verifies CRI-O actually used additionalImageStores; add a pod-level pull/launch
using the prepopulated image (create a Pod spec that references the
prepopulated-image used in skopeoCmd) and after scheduling ensure the pod
reaches Running (or ImagePullBackOff failure absence) and then on the node use
ExecOnNodeWithChroot to run the container runtime inspection (e.g., crictl
images or crictl inspecti) to assert the image is present and was resolved
locally from imageStorePath (check the image entry/ID and any path/source
metadata), and fail the test if the pod had to fetch the image from registry or
if the runtime shows no local image; reference skopeoCmd, imageStorePath,
ExecOnNodeWithChroot, and lsOutput to find where to insert the pod creation and
runtime-inspection assertions.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-570: The spec must early-skip when no pure worker nodes exist:
after calling getPureWorkerNodes (the pureWorkers variable) check if
len(pureWorkers) == 0 and call the test-skip helper (e.g., ginkgo.Skip / o.Skip
with a clear message) so the setup and verification loops become no-ops safely;
apply the same guard in the second occurrence around the later
getPureWorkerNodes usage (the block at lines ~656-658) to avoid falsely passing
when no worker nodes are available.
- Around line 79-85: The negative tests currently only log success and never
fail the test when an invalid ContainerRuntimeConfig is accepted; update each
case that expects rejection (e.g., the Create call via
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the TC2
fallback poll) so they either assert create-time failure (t.Fatalf/t.Errorf)
when Create returns nil error, or, if creation is allowed, immediately poll and
assert the MCO reached ContainerRuntimeConfigFailure=True via the existing poll
helper (the same helper used in other checks) before deferring
cleanupContainerRuntimeConfig; apply the same change pattern to the other
similar blocks you flagged (around the ranges noted) so invalid configs cause
test failures instead of only logging.
- Around line 540-545: Replace the fragile "cat" failure check in the PHASE 6
loop: instead of expecting ExecOnNodeWithChroot(oc, node.Name, "cat",
"/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores") to error, run
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "<command that only emits
content if the file exists>") (or equivalent) so the command always succeeds and
prints only when the file is present; then assert err is nil and that stdout is
empty for each node in pureWorkers. Use the existing ExecOnNodeWithChroot call
and the known path "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores" to
locate the code and replace the Expect(err).To(o.HaveOccurred()) check with an
assertion that stdout == "" (file absent) and no error occurred.

In `@test/extended/node/additional_image_stores.go`:
- Around line 80-86: The tests currently only log when Create succeeds, so
invalid configs can slip through; update each test block that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and the
other similar blocks at the noted ranges) to assert behavior: if Create returns
an error then fail the test only if the error is unexpected (or explicitly
assert rejection for negative tests), and if Create succeeds then defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and immediately
poll/assert that the Machine Config Operator reports
ContainerRuntimeConfigFailure=True for the created object (use the existing MCO
status-check helper or implement a short wait loop checking the MCO status
condition for the object referenced by ctrcfg.Name and tc.path), failing the
test if the MCO does not set ContainerRuntimeConfigFailure=True within the
timeout; replace the plain framework.Logf success branches with these assertions
for all mentioned blocks.
- Around line 550-576: The test currently lists events once after
waitForPodRunning and checks for a "Pulled" event message using variables/events
from oc.AdminKubeClient().CoreV1().Events(...) and foundAlreadyPresentEvent,
which is flaky because events are recorded asynchronously; replace the single
List() call with a polling loop (e.g., wait.PollImmediate or a retry with
timeout) that repeatedly calls
oc.AdminKubeClient().CoreV1().Events(...).List(ctx,
metav1.ListOptions{FieldSelector: fmt.Sprintf("involvedObject.name=%s",
testPod.Name)}) and scans events.Items for the Pulled reason and the "already
present on machine and can be accessed by the pod" message until found or
timeout, then assert foundAlreadyPresentEvent; apply the same polling change to
the similar block referenced at lines 607-629 so both the additional-store and
fallback assertions use polling and are no longer flaky (keep references to
testPod and additionalImageStorePath in the assertion message).
- Around line 718-720: The test assumes pure worker nodes exist but doesn't
handle SNO/compact clusters; after calling getPureWorkerNodes(workerNodes) add a
guard that checks if len(pureWorkers) == 0 and skip the spec (e.g., call
framework.Skipf or ginkgo.Skip with a clear message like "no pure worker nodes
found, skipping") so the setup and verification loops do not run and report
false success; apply the same guard to the other occurrence that mirrors this
logic near the later block.
- Around line 583-592: After calling AdminKubeClient().CoreV1().Pods(...).Delete
for testPod, wait for the pod to be fully removed before running the podman rmi
on the node: poll the API (e.g., using wait.PollImmediate or a simple loop
calling AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions())) until the Get returns a NotFound error (use
metav1.IsNotFound(err)) or the pod is clearly terminated, then proceed to
construct removeCmd and call ExecOnNodeWithChroot to run podman rmi against
additionalImageStorePath/testImageDefault. Ensure the wait uses a reasonable
timeout to avoid flakiness.
- Around line 631-644: The hard 2x assertion using
o.Expect(podStartupTime).To(o.BeNumerically("<", pod2StartupTime/2)) is brittle;
change it to a more tolerant, configurable check on the computed speedup
variable (speedup := float64(pod2StartupTime)/float64(podStartupTime)) instead
of dividing times directly. Replace the strict check with an assertion like
o.Expect(speedup).To(o.BeNumerically(">", minSpeedup)) where minSpeedup is a
constant (e.g., 1.2) or read from an env var (e.g.,
ADDITIONAL_IMAGE_STORES_MIN_SPEEDUP) so CI can relax the threshold, and keep the
framework.Logf output of speedup for diagnostics; update references to
podStartupTime, pod2StartupTime, speedup and the o.Expect call in
additional_image_stores.go.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 427-429: After creating the ContainerRuntimeConfig with
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), immediately schedule its cleanup with a defer that
deletes the resource (use
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Delete(ctx,
ctrcfg.Name, metav1.DeleteOptions{})) so the worker pool is restored if later
assertions fail; place the defer directly after the Create call and keep using
the existing ctx and ctrcfg.Name to identify the resource.
- Around line 632-635: After computing pureWorkers with
getPureWorkerNodes(workerNodes), add a guard that checks if len(pureWorkers) ==
0 and skips the spec (e.g., call ginkgo.Skip with a short message) so the test
does not proceed on SNO/compact clusters; apply the same change at the other
occurrence around lines 730-733 where getPureWorkerNodes is used. Ensure you
reference getNodesByLabel and getPureWorkerNodes to locate the blocks to modify.
- Around line 208-214: The negative test currently only logs a warning when
creating the ContainerRuntimeConfig unexpectedly succeeds; change that to fail
the test so regressions break the suite. Replace the else branch body that
currently defers cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
calls framework.Logf("Warning: Path with '%s' accepted at API level", tc.char)
with code that still defers cleanupContainerRuntimeConfig(ctx, mcClient,
ctrcfg.Name) but then calls framework.Failf (or the appropriate test-failing
helper used in this test suite) with a clear message referencing tc.char and
ctrcfg.Name so the test is marked failed when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
unexpectedly succeeds.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 269-271: The spec must be skipped when there are no pure worker
nodes: after calling getNodesByLabel(...) and computing pureWorkers :=
getPureWorkerNodes(workerNodes) add a guard that if len(pureWorkers) == 0 you
call the test skip helper (e.g., Skip/Skipf from your test framework) and return
so the setup/verification loops are not treated as a passing test; apply the
same check at the other occurrences of getPureWorkerNodes (the blocks around
lines 367-369, 441-443, and 546-548) to consistently skip those specs when
pureWorkers is empty.
- Around line 67-69: The test currently only asserts that Create returned some
error; change it to assert the rejection reason explicitly by checking the error
message contains the invalid image path string used in ctrcfg. After the Create
call on mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create, keep
the existence check but add an assertion that err.Error() (or the returned
Status message) contains the specific bad image path value from the ctrcfg
object (the invalid image path constant/field you set for the test) using a
substring matcher (e.g. Gomega ContainSubstring) so the failure is tied to the
intended validation.

In `@test/extended/node/node_utils.go`:
- Around line 950-955: The helper waitForDaemonSetReady currently swallows GET
errors by returning false, nil which masks apiserver/RBAC/transport failures;
change the oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) error
handling to return the error for all non-missing cases (i.e., if err != nil and
!apierrors.IsNotFound(err) return false, err) and only treat NotFound or
specific transient states as retryable; apply the same fix pattern to the other
similar helpers referenced (the GET calls at the blocks around lines 1052-1056
and 1087-1091) so polling only suppresses expected transient states and surfaces
real errors.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 49-84: The setup currently sets s.deployed = true only after all
post-create verifications, which prevents Cleanup() from running on partial
failures; move the deployment marker so cleanup runs on failures during
readiness/verification: after createDaemonSet(ctx) returns nil (successful
creation of the DaemonSet) set s.deployed = true, then proceed to
waitForDaemonSetReady(ctx, ...) and verifyStargzStoreRunning(ctx). Update the
code around createDaemonSet, waitForDaemonSetReady, and verifyStargzStoreRunning
to reflect this ordering (set s.deployed = true immediately after
createDaemonSet succeeds).
- Around line 468-484: The loop that verifies stargz-store currently ignores
failures from ExecOnNodeWithChroot and continues, allowing Deploy() to succeed
despite missing per-node validation; change the error handling in the checks
that call ExecOnNodeWithChroot (both the "systemctl is-active stargz-store" call
and the "mount" call) to return a descriptive error (e.g., fmt.Errorf with
node.Name and the original err) instead of logging a warning and continuing so
that Deploy() fails when per-node verification cannot be performed.
- Around line 90-118: The Cleanup method currently only deletes the namespace
and leaves host-level artifacts; update Cleanup(ctx context.Context) error to
also perform host cleanup on all worker nodes where stargz-store was installed:
for each node (discover via the cluster client or stored node list on the setup
struct) SSH/exec to stop and disable the systemd unit (e.g.,
"stargz-store.service"), remove the unit file and any drop-ins, remove the
installed binary and config files, unmount and remove any mountpoints created by
stargz-store, reload systemd daemon, and restart the CRI-O service to undo side
effects; ensure all operations handle NotFound errors and surface failures via
framework.Logf, and keep s.deployed set to false only after host cleanup
completes.
- Around line 275-284: The test setup currently downloads stargz-store from
GitHub using the curl/tar commands (the "Downloading stargz-snapshotter" block
and subsequent "curl -L -o /tmp/stargz.tar.gz", "tar -xzf /tmp/stargz.tar.gz -C
/tmp/", and "cp /tmp/stargz-store /host/usr/local/bin/" lines), which creates an
external dependency; replace that download sequence with a deterministic source:
pull the binary from a repo/CI-provided artifact or extract it from a
payload-owned image (for example unpack a test-assets tar in the repo or use a
container image that contains stargz-store), and change the script to copy from
that local path (e.g., ../test-assets/stargz-store or an image extraction step)
into /host/usr/local/bin/, adding a clear failure message if the artifact is
missing. Ensure the curl/tar lines are removed and that the script still sets +x
on the copied stargz-store.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 85d07e88-830c-48ae-9844-1ef372fa2007

📥 Commits

Reviewing files that changed from the base of the PR and between 990e5f3 and d98f6f9.

📒 Files selected for processing (6)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

BhargaviGudi · 2026-05-06T07:43:00Z

/test all

BhargaviGudi · 2026-05-12T08:03:48Z

@coderabbitai resume

coderabbitai · 2026-05-12T08:03:57Z

✅ Actions performed

Reviews resumed.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (7)

test/extended/node/additional_layer_stores.go (4)
171-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TC5 still treats an accepted invalid path as a non-failure.

Lines 209-214 log a warning when Create succeeds for a path containing @, !, #, $, %. The spec is supposed to verify that the API rejects these, so an API regression that lets them through would still report green. Replace the warning branch with o.Expect(err).To(o.HaveOccurred(), ...) (and defer cleanupContainerRuntimeConfig(...) only if a create somehow succeeds), so the test fails when the API stops enforcing the rule.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 171 - 217, The
test currently treats a successful Create for invalid AdditionalLayerStores
paths as a non-failure; change the branch that logs a warning (the else after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) to assert
failure using o.Expect(err).To(o.HaveOccurred(), ...) so the test fails if the
API accepts invalid characters; only call defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) inside that branch if
Create unexpectedly succeeds. Ensure you update the test inside the Ginkgo It
block that constructs ctrcfg and iterates invalidChars.
628-639: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TC11 and TC12 still skip the pureWorkers empty-slice guard.

TC13 and TC14 were updated with if len(pureWorkers) < 1 { e2eskipper.Skipf(...) }, but TC11 (line 634) and TC12 (line 732) still call createDirectoriesOnNodes(oc, pureWorkers, ...) and the verification loops without checking length. On SNO/compact clusters the directory setup and per-node verification become no-ops, so the specs pass without actually validating worker behavior. Add the same guard immediately after each pureWorkers := getPureWorkerNodes(workerNodes).

Also applies to: 725-733
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 628 - 639, After
calling getPureWorkerNodes in the TC11 and TC12 specs (the pureWorkers :=
getPureWorkerNodes(workerNodes) lines), add the same empty-slice guard used in
TC13/TC14: check if len(pureWorkers) < 1 and call e2eskipper.Skipf(...) to skip
the test on SNO/compact clusters; this prevents createDirectoriesOnNodes(oc,
pureWorkers, ...) and the subsequent per-node verification loops from running
with an empty slice and falsely passing. Ensure the guard is inserted
immediately after the pureWorkers assignment in both affected test blocks that
reference createDirectoriesOnNodes and the verification loops.
1040-1070: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

TC15 still pins the manifest to a workstation-local path and a personal image.

Line 1046: /home/bgudi/work/src/github.com/openshift/epic/additionalArtifactsStore/testfiles/nydus-store-setup.yaml will not exist in CI or on any other developer machine, so the oc apply -f on line 1056 (and the cleanup oc delete -f on line 1069) will hard-fail outside the author's environment. Move the YAML into the repo (e.g., test/extended/testdata/...) and reference it via exutil.FixturePath(...), or embed it with //go:embed and pipe through stdin.

Line 1048: quay.io/bgudi/test-small:oci is the same personal-Quay pattern flagged for the eStargz image. Same remediation applies (project-managed image + image.LocationFor).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 1040 - 1070, The
test currently hard-codes a workstation-local manifest path and a personal Quay
image (nydusSetupYAML and nydusTestPodImage) which will fail in CI; move the
YAML into the repo testdata (e.g., test/extended/testdata/) and load it via
exutil.FixturePath(...) or embed it with //go:embed and use
oc.AsAdmin().Run("apply").Args("-f", <fixture or -") accordingly (also update
the defer cleanup that calls oc.AsAdmin().Run("delete").Args("-f",
nydusSetupYAML)), and replace the personal image string nydusTestPodImage with a
project-managed image resolved via image.LocationFor(...) so the test uses
repository-controlled artifacts.
368-625: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

TC10 still has two previously flagged issues: personal Quay image and no immediate CRC cleanup defer.

Line 381: quay.io/bgudi/test-small:estargz is a personal external repo. The same image is reused at line 963. CI/other developers will break if the tag is removed or permissions change. Use a project-managed image via image.LocationFor(...) after adding it to the allowedImages registry, or move it to an org-controlled repo.

Lines 427-429: ContainerRuntimeConfig is created but only deleted in Phase 11. Any assertion failure between Phase 4 and Phase 10 leaves the CRC (and the worker pool's mutated rendered config) in place and poisons subsequent specs. Add defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) immediately after the successful create.
♻️ Proposed defer placement
 		_, err = mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
 		o.Expect(err).NotTo(o.HaveOccurred())
+		defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
 		framework.Logf("ContainerRuntimeConfig %s created with path: %s", ctrcfg.Name, stargzSetup.GetStorePath())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 368 - 625, The
test uses a personal Quay image and lacks immediate cleanup for the
ContainerRuntimeConfig: replace the hard-coded eStargzImage
("quay.io/bgudi/test-small:estargz") by resolving it via image.LocationFor(...)
(after adding that image tag to allowedImages or moving it to a
project/org-controlled repo) wherever eStargzImage is declared/used, and after
the successful creation call
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...) add a
defer call to cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) so the
CRC is removed on any early exit; keep the existing final explicit delete in
Phase 11 or make it idempotent if needed.
test/extended/node/additional_stores_combined.go (3)
67-69: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin TC1 to the invalid-path rejection.

This currently passes on any create failure. Assert that the error references the bad image path so only the intended validation failure satisfies the spec.
Suggested tightening
 _, err = mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
 o.Expect(err).To(o.HaveOccurred())
+o.Expect(err.Error()).To(o.ContainSubstring("relative/invalid/path"))
 framework.Logf("Test PASSED: Combined config with invalid image path rejected: %v", err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 67 - 69, The
test currently accepts any create failure; tighten it by asserting the returned
error message references the invalid image path string from the ctrcfg (the bad
path used in the ContainerRuntimeConfig). After calling
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), replace the generic o.Expect(err).To(o.HaveOccurred())
with an assertion that err contains the specific invalid image path (e.g., using
the string used when building ctrcfg) so only the intended validation rejection
for the bad image path satisfies the test.
269-272: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip TC5-TC8 when pureWorkers is empty.

These specs can no-op on topologies without pure workers: directory setup touches no nodes and the verification loops never execute, so the test passes without proving anything.
Suggested guard
 workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
 o.Expect(err).NotTo(o.HaveOccurred())
 pureWorkers := getPureWorkerNodes(workerNodes)
+if len(pureWorkers) < 1 {
+	e2eskipper.Skipf("Need at least 1 pure worker node for this test")
+}
Also applies to: 367-370, 441-444, 546-549
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 269 - 272,
After computing pureWorkers from getPureWorkerNodes(workerNodes), guard the
subsequent directory-setup and verification blocks by checking if
len(pureWorkers) == 0 and skipping the test when empty; i.e., add an early
return/skip (using the test framework's Skip helper) with a clear message so the
spec no-ops don't silently pass. Apply the same pattern for the other
occurrences where pureWorkers is used (the other blocks that follow the
getPureWorkerNodes call).
792-804: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

TC9 still doesn't exercise additionalImageStores.

This phase only proves the directory exists and storage.conf references it. The spec still passes if CRI-O ignores additionalImageStores entirely. To match the test name, add a pod pull/run that depends on the pre-populated image and then verify on-node that the image resolved locally from the configured image store rather than from the registry.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 792 - 804, Add
a runtime check that actually uses the pre-populated image: create and run a pod
that specifies the prepopulated-image name (use the same image reference used
when prepopulating) via oc.Run/Create and wait for it to start on the node (use
testNode.Name to schedule if needed), then on the node use ExecOnNodeWithChroot
to inspect the local image store (e.g. run "crictl images" or "podman images"
and "crictl inspecti <image-id>" or "podman image inspect <image>") to assert
the image in use matches the prepopulated image ID/path
(imageStorePath+"/prepopulated-image") and that no registry pull occurred (check
crio/podman journal or absence of pull events). Ensure you reference
ExecOnNodeWithChroot, oc.Run/Create, testNode.Name, and imageStorePath when
adding these checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 869-897: The test is creating pods directly in the "default"
namespace; update TC13–TC15 to use the test CLI's per-test namespace instead.
Replace hard-coded "default" string in calls like
oc.AdminKubeClient().CoreV1().Pods("default").Create/Get and
deletePodAndWait(..., "default", podName) with oc.Namespace() (or capture once:
testNamespace := oc.Namespace() and use that variable), and ensure
createTestPodSpec, waitForPodRunning, getStargzSnapshotCount usages consistently
reference that namespace variable so pods are created, looked up, and deleted in
the test's ephemeral namespace.
- Around line 1118-1125: The defer closure captures the spec's ctx which may be
canceled when the defer runs; replace use of the captured ctx in the cleanup
(the Delete call on
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Delete,
waitForMCPToStartUpdating, and waitForMCP) with a fresh bounded context created
from context.Background() (e.g., context.WithTimeout(context.Background(),
<sane-duration>)) and ensure you call cancel() in the defer; also surface/log
any errors returned by Delete, waitForMCPToStartUpdating, and waitForMCP instead
of dropping them; apply the same ctx-fix/logging approach to the nydus DaemonSet
cleanup that uses oc.AsAdmin().Run so the cleanup reliably runs even after the
spec context is canceled.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 206-208: TC4 sets ContainerRuntimeConfig.LogLevel and
ContainerRuntimeConfig.PidsLimit but the test never asserts their rendered CRI-O
values; add node-side assertions in the TC4 test to read the rendered CRI-O
configuration (the CRI-O config produced on the node) and verify that log_level
== "info" and pids_limit == 4096, tying these checks to the
ContainerRuntimeConfig fields (symbols: ContainerRuntimeConfig, LogLevel,
PidsLimit). Apply the same added assertions in the other TC block referenced
(the similar code at lines ~254-259) so regressions that drop those fields will
fail the spec.
- Around line 343-354: The test only checks the first and last store paths;
update the verification in test/extended/node/additional_stores_combined.go to
assert every configured store path instead of just endpoints: when iterating
over pureWorkers and using ExecOnNodeWithChroot to read storage.conf (variable
output), build loops that check all expected indices for each store type (e.g.,
"/var/lib/layer-store-%d", "/var/lib/image-store-%d",
"/var/lib/artifact-store-%d") up to the configured max counts (use the test's
max constants or derive from the config used to provision stores) and call
o.Expect(output).To(o.ContainSubstring(...)) for each generated path so the test
validates every configured path rather than only the first and last.

---

Duplicate comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 171-217: The test currently treats a successful Create for invalid
AdditionalLayerStores paths as a non-failure; change the branch that logs a
warning (the else after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) to assert
failure using o.Expect(err).To(o.HaveOccurred(), ...) so the test fails if the
API accepts invalid characters; only call defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) inside that branch if
Create unexpectedly succeeds. Ensure you update the test inside the Ginkgo It
block that constructs ctrcfg and iterates invalidChars.
- Around line 628-639: After calling getPureWorkerNodes in the TC11 and TC12
specs (the pureWorkers := getPureWorkerNodes(workerNodes) lines), add the same
empty-slice guard used in TC13/TC14: check if len(pureWorkers) < 1 and call
e2eskipper.Skipf(...) to skip the test on SNO/compact clusters; this prevents
createDirectoriesOnNodes(oc, pureWorkers, ...) and the subsequent per-node
verification loops from running with an empty slice and falsely passing. Ensure
the guard is inserted immediately after the pureWorkers assignment in both
affected test blocks that reference createDirectoriesOnNodes and the
verification loops.
- Around line 1040-1070: The test currently hard-codes a workstation-local
manifest path and a personal Quay image (nydusSetupYAML and nydusTestPodImage)
which will fail in CI; move the YAML into the repo testdata (e.g.,
test/extended/testdata/) and load it via exutil.FixturePath(...) or embed it
with //go:embed and use oc.AsAdmin().Run("apply").Args("-f", <fixture or -")
accordingly (also update the defer cleanup that calls
oc.AsAdmin().Run("delete").Args("-f", nydusSetupYAML)), and replace the personal
image string nydusTestPodImage with a project-managed image resolved via
image.LocationFor(...) so the test uses repository-controlled artifacts.
- Around line 368-625: The test uses a personal Quay image and lacks immediate
cleanup for the ContainerRuntimeConfig: replace the hard-coded eStargzImage
("quay.io/bgudi/test-small:estargz") by resolving it via image.LocationFor(...)
(after adding that image tag to allowedImages or moving it to a
project/org-controlled repo) wherever eStargzImage is declared/used, and after
the successful creation call
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...) add a
defer call to cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) so the
CRC is removed on any early exit; keep the existing final explicit delete in
Phase 11 or make it idempotent if needed.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 67-69: The test currently accepts any create failure; tighten it
by asserting the returned error message references the invalid image path string
from the ctrcfg (the bad path used in the ContainerRuntimeConfig). After calling
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), replace the generic o.Expect(err).To(o.HaveOccurred())
with an assertion that err contains the specific invalid image path (e.g., using
the string used when building ctrcfg) so only the intended validation rejection
for the bad image path satisfies the test.
- Around line 269-272: After computing pureWorkers from
getPureWorkerNodes(workerNodes), guard the subsequent directory-setup and
verification blocks by checking if len(pureWorkers) == 0 and skipping the test
when empty; i.e., add an early return/skip (using the test framework's Skip
helper) with a clear message so the spec no-ops don't silently pass. Apply the
same pattern for the other occurrences where pureWorkers is used (the other
blocks that follow the getPureWorkerNodes call).
- Around line 792-804: Add a runtime check that actually uses the pre-populated
image: create and run a pod that specifies the prepopulated-image name (use the
same image reference used when prepopulating) via oc.Run/Create and wait for it
to start on the node (use testNode.Name to schedule if needed), then on the node
use ExecOnNodeWithChroot to inspect the local image store (e.g. run "crictl
images" or "podman images" and "crictl inspecti <image-id>" or "podman image
inspect <image>") to assert the image in use matches the prepopulated image
ID/path (imageStorePath+"/prepopulated-image") and that no registry pull
occurred (check crio/podman journal or absence of pull events). Ensure you
reference ExecOnNodeWithChroot, oc.Run/Create, testNode.Name, and imageStorePath
when adding these checks.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 31bc31c7-dc04-46e4-8eec-d83350cf1ffc

📥 Commits

Reviewing files that changed from the base of the PR and between 990e5f3 and d98f6f9.

📒 Files selected for processing (6)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go

🚧 Files skipped from review as they are similar to previous changes (4)

test/extended/node/stargz_store_setup.go
test/extended/node/additional_image_stores.go
test/extended/node/node_utils.go
test/extended/node/additional_artifact_stores.go

coderabbitai · 2026-05-12T08:10:17Z

+				ContainerRuntimeConfig: &machineconfigv1.ContainerRuntimeConfiguration{
+					LogLevel:  "info",
+					PidsLimit: int64Ptr(4096),


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Actually verify LogLevel and PidsLimit in TC4.

TC4 sets these CRI-O fields, but the assertions only check storage.conf and that crio is active. A regression that drops either field would still pass this spec. Add a node-side assertion against the rendered CRI-O config for log_level = "info" and pids_limit = 4096.

Also applies to: 254-259

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_stores_combined.go` around lines 206 - 208, TC4 sets ContainerRuntimeConfig.LogLevel and ContainerRuntimeConfig.PidsLimit but the test never asserts their rendered CRI-O values; add node-side assertions in the TC4 test to read the rendered CRI-O configuration (the CRI-O config produced on the node) and verify that log_level == "info" and pids_limit == 4096, tying these checks to the ContainerRuntimeConfig fields (symbols: ContainerRuntimeConfig, LogLevel, PidsLimit). Apply the same added assertions in the other TC block referenced (the similar code at lines ~254-259) so regressions that drop those fields will fail the spec.

coderabbitai · 2026-05-12T08:10:17Z

+		g.By("Verifying storage.conf contains all stores")
+		for _, node := range pureWorkers {
+			output, err := ExecOnNodeWithChroot(oc, node.Name, "cat", "/etc/containers/storage.conf")
+			o.Expect(err).NotTo(o.HaveOccurred())
+
+			// Verify at least some of the stores are present
+			o.Expect(output).To(o.ContainSubstring("/var/lib/layer-store-0"))
+			o.Expect(output).To(o.ContainSubstring("/var/lib/layer-store-4"))
+			o.Expect(output).To(o.ContainSubstring("/var/lib/image-store-0"))
+			o.Expect(output).To(o.ContainSubstring("/var/lib/image-store-9"))
+			o.Expect(output).To(o.ContainSubstring("/var/lib/artifact-store-0"))
+			o.Expect(output).To(o.ContainSubstring("/var/lib/artifact-store-9"))


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

TC5 only proves a subset of the max-store entries.

Checking just the first and last path for each store type will miss truncated or partially rendered configs. Since this case exists to validate the max-cardinality setup, assert every configured path.

Suggested tightening

- // Verify at least some of the stores are present - o.Expect(output).To(o.ContainSubstring("/var/lib/layer-store-0")) - o.Expect(output).To(o.ContainSubstring("/var/lib/layer-store-4")) - o.Expect(output).To(o.ContainSubstring("/var/lib/image-store-0")) - o.Expect(output).To(o.ContainSubstring("/var/lib/image-store-9")) - o.Expect(output).To(o.ContainSubstring("/var/lib/artifact-store-0")) - o.Expect(output).To(o.ContainSubstring("/var/lib/artifact-store-9")) + for _, dir := range allDirs { + o.Expect(output).To(o.ContainSubstring(dir), + "storage.conf should contain %s on node %s", dir, node.Name) + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_stores_combined.go` around lines 343 - 354, The test only checks the first and last store paths; update the verification in test/extended/node/additional_stores_combined.go to assert every configured store path instead of just endpoints: when iterating over pureWorkers and using ExecOnNodeWithChroot to read storage.conf (variable output), build loops that check all expected indices for each store type (e.g., "/var/lib/layer-store-%d", "/var/lib/image-store-%d", "/var/lib/artifact-store-%d") up to the configured max counts (use the test's max constants or derive from the config used to provision stores) and call o.Expect(output).To(o.ContainSubstring(...)) for each generated path so the test validates every configured path rather than only the first and last.

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_image_stores.go`:
- Around line 309-312: This persists real ContainerRuntimeConfig objects; change
the Create call to use the API dry-run create by passing
metav1.CreateOptions{DryRun: []string{"All"}} to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create so the
admission is exercised but nothing is persisted, and remove or skip the deferred
cleanupContainerRuntimeConfig call for this acceptance-only validation (do the
same replacement for the other occurrence around lines 351-353 that uses ctrcfg
and metav1.CreateOptions{}).

In `@test/extended/node/additional_layer_stores.go`:
- Around line 535-548: Remove the brittle fixed wall-clock speedup assertion:
delete the o.Expect(...) that checks speedup >= 1.5 (the block referencing
speedup, pod1Duration and pod2Duration) so the test no longer fails on
cluster/regional latency; keep the performance Logf outputs
(speedup/pod1Duration/pod2Duration) and rely on the existing snapshot-count
layer-reuse checks instead of enforcing a numeric 1.5x threshold.
- Around line 70-73: This test is creating a real ContainerRuntimeConfig
unnecessarily; change the call to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to use a
dry-run Create (pass metav1.CreateOptions{DryRun: []string{"All"}}) so the API
validates and echoes the spec without persisting or triggering MCO rollouts, and
remove the associated defer cleanupContainerRuntimeConfig call since no real
resource is created.
- Around line 1008-1017: The test currently calls ExecOnNodeWithChroot(...,
"systemctl", "is-active", "stargz-store") and only asserts the output is not
"active", which hides failures when the command errors or returns empty; update
the check to explicitly assert the ExecOnNodeWithChroot call returned no error
and that strings.TrimSpace(statusOutput) equals an expected inactive state
(e.g., "inactive" or "failed") rather than merely Not(Equal("active")), and
ensure the deferred restart actually validates its ExecOnNodeWithChroot result
(capture its output/error and fail the test or log an error if starting
stargz-store fails) so failures to restart are not silently dropped; reference
ExecOnNodeWithChroot, statusOutput, err, testNode.Name and the deferred restart
block when making the changes.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 721-722: Replace the hardcoded user-owned image
"quay.io/bgudi/..." with a stable, publicly-hosted or internally-provisioned
test image and update any other occurrences (e.g., the occurrences referenced by
testImage at lines using framework.Logf, imageStorePath and testNode.Name);
locate the testImage variable and change it to a reliable image tag (for example
a maintained public test image or a fixture from your test registry), or use the
project’s helper/factory to provision a local test image, and ensure the same
replacement is applied at the other occurrence noted (around the block that logs
Pre-populating image and the second occurrence referenced by lines 801-802).
- Around line 849-850: The cleanup call deletes pod2 from the "default"
namespace but pod2 was created in testNamespace; update the deletePodAndWait
call that references pod2Name so it passes testNamespace instead of "default"
(locate the invocation of deletePodAndWait that uses pod2Name near the log
"Layer store (stargz) lazy pulling verified successfully"); ensure you
consistently use testNamespace wherever pod2Name is deleted to properly remove
the created pod.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: f233fc28-eeed-4e5f-a698-5b945cd4dc09

📥 Commits

Reviewing files that changed from the base of the PR and between d98f6f9 and 59b9662.

📒 Files selected for processing (7)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go
test/extended/testdata/bindata.go

✅ Files skipped from review due to trivial changes (1)

test/extended/testdata/bindata.go

🚧 Files skipped from review as they are similar to previous changes (3)

test/extended/node/node_utils.go
test/extended/node/additional_artifact_stores.go
test/extended/node/stargz_store_setup.go

coderabbitai · 2026-05-12T10:06:06Z

+		created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+		o.Expect(err).NotTo(o.HaveOccurred())
+		defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use API dry-run for these acceptance-only validation cases.

TC8/TC9 only assert admission success and the returned spec. Persisting real ContainerRuntimeConfig objects in the parallel “API Validation” suite can kick off worker MCP reconciliation and interfere with unrelated tests.

Suggested change

- created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{}) + createOpts := metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}} + created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, createOpts) o.Expect(err).NotTo(o.HaveOccurred()) - defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)

Also applies to: 351-353

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_image_stores.go` around lines 309 - 312, This persists real ContainerRuntimeConfig objects; change the Create call to use the API dry-run create by passing metav1.CreateOptions{DryRun: []string{"All"}} to mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create so the admission is exercised but nothing is persisted, and remove or skip the deferred cleanupContainerRuntimeConfig call for this acceptance-only validation (do the same replacement for the other occurrence around lines 351-353 that uses ctrcfg and metav1.CreateOptions{}).

coderabbitai · 2026-05-12T10:06:06Z

+		created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+		o.Expect(err).NotTo(o.HaveOccurred())
+		defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use API dry-run in this validation-only success case.

TC1 only verifies admission success and the echoed spec. Creating a real ContainerRuntimeConfig in the non-disruptive suite can still trigger an MCO rollout for no test benefit.

Suggested change

- created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{}) + createOpts := metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}} + created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, createOpts) o.Expect(err).NotTo(o.HaveOccurred()) - defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_layer_stores.go` around lines 70 - 73, This test is creating a real ContainerRuntimeConfig unnecessarily; change the call to mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to use a dry-run Create (pass metav1.CreateOptions{DryRun: []string{"All"}}) so the API validates and echoes the spec without persisting or triggering MCO rollouts, and remove the associated defer cleanupContainerRuntimeConfig call since no real resource is created.

coderabbitai · 2026-05-12T10:06:06Z

+		// Verify performance improvement with layer sharing
+		g.By("Verifying performance improvement with layer sharing")
+		speedup := float64(pod1Duration) / float64(pod2Duration)
+		framework.Logf("Performance comparison:")
+		framework.Logf("  - First pod (initial pull):  %v", pod1Duration)
+		framework.Logf("  - Second pod (layer sharing): %v", pod2Duration)
+		framework.Logf("  - Performance improvement: %.2fx faster with layer sharing", speedup)
+
+		// For 6GB eStargz image, second pod should be significantly faster due to layer sharing
+		// Verify at least 1.5x faster (relaxed threshold since lazy pull is already fast)
+		o.Expect(speedup).To(o.BeNumerically(">=", 1.5),
+			"Second pod using shared layers should be significantly faster (at least 1.5x). "+
+				"First pod: %v, Second pod: %v, Speedup: %.2fx",
+			pod1Duration, pod2Duration, speedup)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the fixed wall-clock speedup threshold.

The snapshot-count checks already validate layer reuse. Requiring >= 1.5x startup improvement makes this spec sensitive to cluster load and registry latency, which is likely to create CI flakes even when the feature works.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_layer_stores.go` around lines 535 - 548, Remove the brittle fixed wall-clock speedup assertion: delete the o.Expect(...) that checks speedup >= 1.5 (the block referencing speedup, pod1Duration and pod2Duration) so the test no longer fails on cluster/regional latency; keep the performance Logf outputs (speedup/pod1Duration/pod2Duration) and rely on the existing snapshot-count layer-reuse checks instead of enforcing a numeric 1.5x threshold.

coderabbitai · 2026-05-12T10:06:06Z

+		defer func() {
+			// Restart stargz-store at cleanup
+			framework.Logf("Restarting stargz-store service")
+			ExecOnNodeWithChroot(oc, testNode.Name, "systemctl", "start", "stargz-store")
+		}()
+
+		// Verify service is stopped
+		statusOutput, err := ExecOnNodeWithChroot(oc, testNode.Name, "systemctl", "is-active", "stargz-store")
+		framework.Logf("stargz-store status after stop: %s", strings.TrimSpace(statusOutput))
+		o.Expect(strings.TrimSpace(statusOutput)).NotTo(o.Equal("active"))


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Check the restart and status command results after stopping stargz-store.

If is-active fails here, statusOutput can be empty and still satisfy Not(Equal("active")), so the fallback assertion becomes meaningless. The deferred restart also drops failures that would leave later serial specs running with stargz-store still stopped.

Suggested change

defer func() { // Restart stargz-store at cleanup framework.Logf("Restarting stargz-store service") - ExecOnNodeWithChroot(oc, testNode.Name, "systemctl", "start", "stargz-store") + if output, err := ExecOnNodeWithChroot(oc, testNode.Name, "systemctl", "start", "stargz-store"); err != nil { + framework.Logf("failed to restart stargz-store: %v, output: %s", err, output) + } }() // Verify service is stopped statusOutput, err := ExecOnNodeWithChroot(oc, testNode.Name, "systemctl", "is-active", "stargz-store") + o.Expect(err).NotTo(o.HaveOccurred()) framework.Logf("stargz-store status after stop: %s", strings.TrimSpace(statusOutput)) o.Expect(strings.TrimSpace(statusOutput)).NotTo(o.Equal("active"))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_layer_stores.go` around lines 1008 - 1017, The test currently calls ExecOnNodeWithChroot(..., "systemctl", "is-active", "stargz-store") and only asserts the output is not "active", which hides failures when the command errors or returns empty; update the check to explicitly assert the ExecOnNodeWithChroot call returned no error and that strings.TrimSpace(statusOutput) equals an expected inactive state (e.g., "inactive" or "failed") rather than merely Not(Equal("active")), and ensure the deferred restart actually validates its ExecOnNodeWithChroot result (capture its output/error and fail the test or log an error if starting stargz-store fails) so failures to restart are not silently dropped; reference ExecOnNodeWithChroot, statusOutput, err, testNode.Name and the deferred restart block when making the changes.

coderabbitai · 2026-05-12T10:06:06Z

+		testImage := "quay.io/bgudi/test-image-6gb:v1.0"
+		framework.Logf("Pre-populating image %s to %s on node %s", testImage, imageStorePath, testNode.Name)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid user-owned quay.io/bgudi/... test images here.

These tests now depend on personal external repos for core coverage. That is brittle for CI and for future maintenance; a repo/tag change breaks the suite for reasons unrelated to combined-store functionality.

Also applies to: 801-802

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_stores_combined.go` around lines 721 - 722, Replace the hardcoded user-owned image "quay.io/bgudi/..." with a stable, publicly-hosted or internally-provisioned test image and update any other occurrences (e.g., the occurrences referenced by testImage at lines using framework.Logf, imageStorePath and testNode.Name); locate the testImage variable and change it to a reliable image tag (for example a maintained public test image or a fixture from your test registry), or use the project’s helper/factory to provision a local test image, and ensure the same replacement is applied at the other occurrence noted (around the block that logs Pre-populating image and the second occurrence referenced by lines 801-802).

quay.io/bgudi/test-small:estargz acceptable as test image

@BhargaviGudi, understood — the image is smaller and eStargz-compatible, which is good. However, the core concern remains: quay.io/bgudi/test-small:estargz is still hosted under a personal account (bgudi). If that account's visibility changes, the repo is deleted, or the tag is overwritten, CI will break for reasons entirely unrelated to the feature under test, and no one outside your team can remediate it.

The recommended approach for OpenShift E2E tests is to use an image from:

A team-/org-controlled quay.io org (e.g., quay.io/openshift-release-dev/..., quay.io/redhat-user-workloads/..., or a dedicated CI org), or

An in-cluster image built by the test setup itself.

That said, I recognize this is your call. If the team has confirmed the image will be maintained and publicly accessible long-term, I'll defer to that decision.

coderabbitai · 2026-05-12T10:06:06Z

+		deletePodAndWait(ctx, oc, "default", pod2Name)
+		framework.Logf("Layer store (stargz) lazy pulling verified successfully")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Delete the second pod from testNamespace, not default.

pod2 is created in testNamespace, so this cleanup call misses the actual pod. That leaves the real pod behind and makes the later state of the node/namespace unreliable.

Suggested change

- deletePodAndWait(ctx, oc, "default", pod2Name) + deletePodAndWait(ctx, oc, testNamespace, pod2Name)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_stores_combined.go` around lines 849 - 850, The cleanup call deletes pod2 from the "default" namespace but pod2 was created in testNamespace; update the deletePodAndWait call that references pod2Name so it passes testNamespace instead of "default" (locate the invocation of deletePodAndWait that uses pod2Name near the log "Layer store (stargz) lazy pulling verified successfully"); ensure you consistently use testNamespace wherever pod2Name is deleted to properly remove the created pod.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (4)

test/extended/node/additional_image_stores.go (4)
308-310: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use DryRun: ["All"] for acceptance-only TC8/TC9.

These specs only assert admission success + returned spec, but they persist real ContainerRuntimeConfig objects keyed to the worker MCP selector. Even with the deferred cleanup, the create triggers MCO reconciliation on the worker pool, which can race with other parallel API-validation tests and cause spurious worker rollouts in shared CI clusters. A dry-run create exercises the same validation path without persisting anything.
♻️ Proposed change (apply to both TC8 line 308 and TC9 line 350)
-		created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+		created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(
+			ctx, ctrcfg, metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}},
+		)
 		o.Expect(err).NotTo(o.HaveOccurred())
-		defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
Also applies to: 350-352
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 308 - 310, The
test is persisting real ContainerRuntimeConfig objects; change the create to a
dry-run so it only exercises admission/validation: when calling
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), pass metav1.CreateOptions{DryRun: []string{"All"}}
instead (do this for the create at ctrcfg usage in TC8 and the analogous create
in TC9); you can leave the deferred cleanup call (cleanupContainerRuntimeConfig)
but it will be harmless for dry-run creates.
595-601: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Strict 2x wall-clock assertion is still in place and will be noisy in CI.

The event-source checks above already prove the image came from the additional store vs. the registry, which is the actual functional claim. Tying the pass/fail of TC10 to podStartupTime < pod2StartupTime/2 couples the test to registry latency, network conditions, and cluster load — runs where the fallback pull happens to be fast (cached upstream, warm proxy, etc.) will fail despite the feature working. Either relax to o.BeNumerically("<", pod2StartupTime) (still proves additional store is faster) or drop the numeric assertion entirely and keep the speedup as a logged metric.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 595 - 601, The
strict 2x wall-clock assertion in the o.Expect call using podStartupTime and
pod2StartupTime is too brittle; update the expectation to only require
podStartupTime be less than pod2StartupTime (replace BeNumerically("<",
pod2StartupTime/2) with BeNumerically("<", pod2StartupTime)) or remove the
numeric assertion entirely and keep framework.Logf("Performance improvement
verified: Prepopulated image is %.2fx faster", speedup) as a metric; adjust the
o.Expect message accordingly and keep the existing event-source checks that
validate the image origin.
507-525: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Event listing is still a single snapshot — poll until the Pulled event appears or a timeout elapses.

Kubelet event recording is asynchronous; waitForPodRunning returning does not guarantee the Pulled event has been written to the API. Both the "already present on machine" check (Lines 507–525) and the "Successfully pulled" check (Lines 564–581) read events exactly once, so the assertions on Lines 529 and 583 can be flaky in CI even when the feature works correctly. Wrap each event scan in a wait.PollImmediate/wait.PollUntilContextTimeout loop and only fail after the timeout.

Also applies to: 564-581
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 507 - 525, The
test reads pod events only once which can be flaky because kubelet writes events
asynchronously; update the two event-scanning blocks that inspect Events for
testPod (the "already present on machine" Pulled check and the later
"Successfully pulled" Pulled check) to poll until the desired event appears or a
timeout elapses instead of a single listing. Wrap the existing loop over
oc.AdminKubeClient().CoreV1().Events(...).List(...) inside a wait.PollImmediate
or wait.PollUntilContextTimeout call (use the test context), re-listing events
each iteration and returning success when strings.Contains(event.Message,
"...already present on machine...") or the "Successfully pulled" message is
seen, and only fail the test after the poll times out; keep using testPod.Name
and the same FieldSelector and preserve logging behavior.
540-549: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still no wait between Delete(testPod) and podman rmi — Phase 5 can race the terminating pod.

Delete on Line 541 is asynchronous. The pod (and its container holding the image open) may still be terminating when the podman --root … rmi runs on Line 547, which makes the rmi nondeterministic ("image is in use") and the subsequent fallback assertion flaky. Block on full pod removal before touching node-local storage.
🛠️ Proposed fix
 		g.By("Deleting first pod")
 		err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
 		o.Expect(err).NotTo(o.HaveOccurred())
+		err = waitForPodDeleted(ctx, oc, testPod.Name, 2*time.Minute)
+		o.Expect(err).NotTo(o.HaveOccurred())
 		framework.Logf("First pod deleted")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 540 - 549, The
Delete call for testPod is asynchronous so the rmi can race with a
still-terminating pod; update the flow to wait for the pod to be fully removed
before running ExecOnNodeWithChroot(removeCmd). After calling
oc.AdminKubeClient().CoreV1().Pods(...).Delete(...), poll or use a wait helper
to confirm the pod is gone (e.g., loop using
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get for testPod.Name and
break when it returns NotFound or when pod phase is not Terminating) and only
then construct removeCmd and call ExecOnNodeWithChroot with testNode and
additionalImageStorePath so the pod’s container no longer holds the image open.

🧹 Nitpick comments (3)

test/extended/testdata/bindata.go (2)
50327-50409: ⚡ Quick win

set -e is silently defeated by trailing || true / || echo.

The script opens with set -e but most failure-significant commands are wrapped with || echo "…skipped" or || true (ostree unlock, version checks, final systemctl status). The net effect is that real failures (e.g. ostree admin unlock fails because the node isn't ostree-based, nydus-store --version fails because the build silently produced no binary) get logged as "skipped" and the DaemonSet still enters sleep infinity, reporting success. The E2E will then proceed and fail much later with a confusing error.

Drop the || echo on commands that must succeed (version checks of just-installed binaries, the final systemctl status nydus-store), or replace them with explicit if … then fail blocks that exit non-zero so the pod CrashLoops and the failure is attributable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/testdata/bindata.go` around lines 50327 - 50409, The script
enables errexit with set -e but then masks real failures by appending || echo/||
true to critical commands; remove those fallbacks or replace them with explicit
failure handling so errors cause the pod to exit. Specifically, for ostree admin
unlock, the three nsenter version checks (/usr/local/bin/nydusd --version,
/usr/local/bin/nydus-image --version, /usr/local/bin/nydus-store --version) and
the final nsenter systemctl status nydus-store, either drop the trailing "||
echo …" / "|| true" or wrap each in an if-check that logs the error and exits
non-zero (so set -e triggers); ensure the build/install steps (git clone, make,
cp bin/nydus-store) also propagate failures instead of being silenced.
50319-50319: ⚡ Quick win

Builder image ubi9/go-toolset:latest is a floating tag.

Using :latest for a toolchain image means the Go version, base layers, and CVE surface can drift between test runs. Please pin to a specific tag (e.g. registry.access.redhat.com/ubi9/go-toolset:1.22) for reproducibility.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/testdata/bindata.go` at line 50319, Replace the floating
toolchain image tag "registry.access.redhat.com/ubi9/go-toolset:latest" with a
pinned, explicit version (for example
"registry.access.redhat.com/ubi9/go-toolset:1.22") to ensure reproducible
builds; locate the literal image string in test/extended/testdata/bindata.go
(the line containing image: registry.access.redhat.com/ubi9/go-toolset:latest)
and update it to the chosen fixed tag, then run the tests to verify there are no
compatibility regressions.
test/extended/node/additional_image_stores.go (1)
25-25: ⚡ Quick win

Use the maxImageStoresCount constant in TC2 instead of hardcoding 11.

The constant is declared but never referenced; TC2 hardcodes 11 and the literal "max of 10" in messages. If the API limit ever changes, this test will drift silently.
♻️ Proposed refactor
-		g.By("Creating ContainerRuntimeConfig with 11 image stores (exceeds max of 10)")
-		imageStores := make([]machineconfigv1.AdditionalImageStore, 11)
-		for i := 0; i < 11; i++ {
+		g.By(fmt.Sprintf("Creating ContainerRuntimeConfig with %d image stores (exceeds max of %d)", maxImageStoresCount+1, maxImageStoresCount))
+		imageStores := make([]machineconfigv1.AdditionalImageStore, maxImageStoresCount+1)
+		for i := 0; i < maxImageStoresCount+1; i++ {
 			imageStores[i] = machineconfigv1.AdditionalImageStore{Path: machineconfigv1.StorePath(fmt.Sprintf("/mnt/imagestore-%d", i))}
 		}
@@
-		o.Expect(err).To(o.HaveOccurred(), "Expected API to reject exceeding maximum of 10 image stores")
+		o.Expect(err).To(o.HaveOccurred(), "Expected API to reject exceeding maximum of %d image stores", maxImageStoresCount)
Also applies to: 86-116
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` at line 25, The test TC2
currently hardcodes the value 11 and literal "max of 10" instead of using the
declared constant maxImageStoresCount; update TC2 (and any other occurrences in
the same test block between the flagged lines) to compute the over-limit value
as maxImageStoresCount+1, replace the hardcoded 11 with that expression, and
update any assertion/error messages to reference maxImageStoresCount (e.g., "max
of %d") so the test uses the constant consistently; ensure all occurrences in
TC2 and the surrounding assertions are changed so the test will adapt if
maxImageStoresCount changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 24: Replace the hard-coded personal Quay image reference in
testImageDefault with a stable, org-controlled test image and avoid relying on a
personal namespace; update the value of testImageDefault in
additional_image_stores.go to point to a project-owned registry (e.g.,
quay.io/openshifttest/... or an openshift/origin test-image) and, if feasible,
choose a smaller image that still exercises the 6GB-related behavior to reduce
flakiness and CI cost.
- Line 29: The Describe block starting with g.Describe whose description string
includes "[Suite:openshift/disruptive-longrunning] Additional Image Stores API
Validation" is inconsistent with the comment stating these are "Non-disruptive
API validation tests - can run in parallel"; update the Ginkgo tags to match
intent: either remove or replace "[Suite:openshift/disruptive-longrunning]" with
the appropriate non-disruptive suite tag (or simply omit the disruptive suite
tag) if tests are parallel-safe, or add a "[Serial]" (and keep disruptive
tagging) if they are actually disruptive/need serialization; locate the Describe
invocation (the var _ = g.Describe(...) line) and modify the tag substring in
the description accordingly so description and comment match.

In `@test/extended/testdata/bindata.go`:
- Around line 50212-50460: The bindata.go file contains a hand-edited YAML asset
(_testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml) but it should be
generated from the source YAML; fix by editing the original source file
test/extended/testdata/node/additional-storage/nydus-store-setup.yaml (not
bindata.go) to apply your substantive changes, then regenerate bindata.go using
the project's go-bindata generation step so the functions
testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYamlBytes and
testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml remain in sync with
the source.
- Around line 50346-50375: The install script uses an unverified remote tarball
and an unpinned git clone (NYDUS_VERSION variable, the curl -L
...nydus-static-${NYDUS_VERSION}-linux-amd64.tgz line, and git clone
https://github.com/containers/nydus-storage-plugin.git), which makes the test
non-hermetic and a supply-chain risk; update the script to (1) pin NYDUS_VERSION
to a specific release and replace the git clone with cloning a specific tag or
commit hash (or use a mirrored repository), (2) download a published checksum
(sha256) or signature for the nydus tarball and verify it before extracting
(fail the job if verification fails), and (3) prefer fetching both the tarball
and the nydus-store binary from a trusted internal mirror or CI artifact store
and fall back to the external source only with explicit verification; apply
these changes around the curl/tar/cp/chmod steps and the git clone/make/cp
bin/nydus-store steps so binaries are only installed after checksum/signature
and pin verification.
- Around line 50311-50316: The DaemonSet uses nodeSelector
node-role.kubernetes.io/worker combined with a wildcard toleration (operator:
Exists) which is unsafe for SNO/TNF/TNA topologies and may schedule onto arbiter
nodes; either gate this test to run only on multi-node topologies or restrict
scheduling: replace the broad nodeSelector/toleration approach by selecting
control-plane-and-worker nodes (or add an explicit anti-selector for
node-role.kubernetes.io/arbiter with DoesNotExist) and narrow the tolerations to
only the specific taints you intend to tolerate so that the privileged pod
(hostPID/hostNetwork, ostree unlock, crio restart) cannot land on arbiter or on
zero-worker SNO/TNF setups.
- Around line 50342-50345: The installer mutates nodes persistently (runs
"ostree admin unlock --hotfix", drops binaries into /usr/local/bin, installs a
systemd unit and runs "systemctl restart crio") with no rollback; update the
DaemonSet spec used by the installer to set
updateStrategy.rollingUpdate.maxUnavailable: 1 to serialize restarts, add a
matching uninstall path (either a separate DaemonSet or a pod preStop handler)
that stops/disables the installed systemd unit, removes the dropped binaries,
and attempts to re-lock/restore ostree state (reverse "ostree admin unlock
--hotfix") so nodes are returned to pre-install state, and update the header
docstring/README to document this asset is ephemeral for CI and list the
explicit cleanup steps. Ensure the code locations referencing the ostree unlock
command, the systemd unit installation, the binary install path (/usr/local/bin)
and the crio restart are updated to include teardown logic and the DaemonSet
updateStrategy change.

---

Duplicate comments:
In `@test/extended/node/additional_image_stores.go`:
- Around line 308-310: The test is persisting real ContainerRuntimeConfig
objects; change the create to a dry-run so it only exercises
admission/validation: when calling
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), pass metav1.CreateOptions{DryRun: []string{"All"}}
instead (do this for the create at ctrcfg usage in TC8 and the analogous create
in TC9); you can leave the deferred cleanup call (cleanupContainerRuntimeConfig)
but it will be harmless for dry-run creates.
- Around line 595-601: The strict 2x wall-clock assertion in the o.Expect call
using podStartupTime and pod2StartupTime is too brittle; update the expectation
to only require podStartupTime be less than pod2StartupTime (replace
BeNumerically("<", pod2StartupTime/2) with BeNumerically("<", pod2StartupTime))
or remove the numeric assertion entirely and keep framework.Logf("Performance
improvement verified: Prepopulated image is %.2fx faster", speedup) as a metric;
adjust the o.Expect message accordingly and keep the existing event-source
checks that validate the image origin.
- Around line 507-525: The test reads pod events only once which can be flaky
because kubelet writes events asynchronously; update the two event-scanning
blocks that inspect Events for testPod (the "already present on machine" Pulled
check and the later "Successfully pulled" Pulled check) to poll until the
desired event appears or a timeout elapses instead of a single listing. Wrap the
existing loop over oc.AdminKubeClient().CoreV1().Events(...).List(...) inside a
wait.PollImmediate or wait.PollUntilContextTimeout call (use the test context),
re-listing events each iteration and returning success when
strings.Contains(event.Message, "...already present on machine...") or the
"Successfully pulled" message is seen, and only fail the test after the poll
times out; keep using testPod.Name and the same FieldSelector and preserve
logging behavior.
- Around line 540-549: The Delete call for testPod is asynchronous so the rmi
can race with a still-terminating pod; update the flow to wait for the pod to be
fully removed before running ExecOnNodeWithChroot(removeCmd). After calling
oc.AdminKubeClient().CoreV1().Pods(...).Delete(...), poll or use a wait helper
to confirm the pod is gone (e.g., loop using
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get for testPod.Name and
break when it returns NotFound or when pod phase is not Terminating) and only
then construct removeCmd and call ExecOnNodeWithChroot with testNode and
additionalImageStorePath so the pod’s container no longer holds the image open.

---

Nitpick comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 25: The test TC2 currently hardcodes the value 11 and literal "max of 10"
instead of using the declared constant maxImageStoresCount; update TC2 (and any
other occurrences in the same test block between the flagged lines) to compute
the over-limit value as maxImageStoresCount+1, replace the hardcoded 11 with
that expression, and update any assertion/error messages to reference
maxImageStoresCount (e.g., "max of %d") so the test uses the constant
consistently; ensure all occurrences in TC2 and the surrounding assertions are
changed so the test will adapt if maxImageStoresCount changes.

In `@test/extended/testdata/bindata.go`:
- Around line 50327-50409: The script enables errexit with set -e but then masks
real failures by appending || echo/|| true to critical commands; remove those
fallbacks or replace them with explicit failure handling so errors cause the pod
to exit. Specifically, for ostree admin unlock, the three nsenter version checks
(/usr/local/bin/nydusd --version, /usr/local/bin/nydus-image --version,
/usr/local/bin/nydus-store --version) and the final nsenter systemctl status
nydus-store, either drop the trailing "|| echo …" / "|| true" or wrap each in an
if-check that logs the error and exits non-zero (so set -e triggers); ensure the
build/install steps (git clone, make, cp bin/nydus-store) also propagate
failures instead of being silenced.
- Line 50319: Replace the floating toolchain image tag
"registry.access.redhat.com/ubi9/go-toolset:latest" with a pinned, explicit
version (for example "registry.access.redhat.com/ubi9/go-toolset:1.22") to
ensure reproducible builds; locate the literal image string in
test/extended/testdata/bindata.go (the line containing image:
registry.access.redhat.com/ubi9/go-toolset:latest) and update it to the chosen
fixed tag, then run the tests to verify there are no compatibility regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: aeba5da5-231a-4776-9439-5a82eb9e2139

📥 Commits

Reviewing files that changed from the base of the PR and between 59b9662 and 9e4a7dd.

📒 Files selected for processing (7)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go
test/extended/testdata/bindata.go

🚧 Files skipped from review as they are similar to previous changes (5)

test/extended/node/stargz_store_setup.go
test/extended/node/node_utils.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_artifact_stores.go
test/extended/node/additional_stores_combined.go

coderabbitai · 2026-05-12T12:11:53Z

+const (
+	additionalImageStorePath     = "/var/lib/additional-images"
+	additionalImageStoreTestName = "additional-imagestore-test"
+	testImageDefault             = "quay.io/bgudi/test-image-6gb:v1.0"


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid using a personal Quay registry namespace for upstream test images.

testImageDefault = "quay.io/bgudi/test-image-6gb:v1.0" points to a personal Quay account. Upstream e2e tests should not depend on an individual user's registry — the image can be deleted, rate-limited, made private, or the account removed, which would silently break this entire E2E suite (TC10) in CI. Move the 6GB test image to an OpenShift org-owned location (e.g., quay.io/openshifttest/...) or another stable, project-controlled registry, and consider hosting it under the existing openshift/origin test-image infra.

Also worth confirming a 6GB image is truly required — anything smaller that still demonstrates the speedup would make the test cheaper and more reliable.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_image_stores.go` at line 24, Replace the hard-coded personal Quay image reference in testImageDefault with a stable, org-controlled test image and avoid relying on a personal namespace; update the value of testImageDefault in additional_image_stores.go to point to a project-owned registry (e.g., quay.io/openshifttest/... or an openshift/origin test-image) and, if feasible, choose a smaller image that still exercises the 6GB-related behavior to reduce flakiness and CI cost.

coderabbitai · 2026-05-12T12:11:53Z

+)
+
+// Non-disruptive API validation tests - can run in parallel
+var _ = g.Describe("[Jira:Node/CRI-O][sig-node][Feature:AdditionalStorageSupport][Suite:openshift/disruptive-longrunning] Additional Image Stores API Validation", func() {


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Describe-block label says "Suite:openshift/disruptive-longrunning" but the comment claims this suite is non-disruptive.

Line 28 documents this block as "Non-disruptive API validation tests - can run in parallel", yet the Ginkgo description tags it with [Suite:openshift/disruptive-longrunning] (same suite as the truly disruptive describe at Line 364). If these specs really are non-disruptive and parallel-safe, they should live in a non-disruptive suite (and likely not advertise disruptive-longrunning); if they aren't, the comment is misleading and they need a [Serial] tag. Please reconcile.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/additional_image_stores.go` at line 29, The Describe block starting with g.Describe whose description string includes "[Suite:openshift/disruptive-longrunning] Additional Image Stores API Validation" is inconsistent with the comment stating these are "Non-disruptive API validation tests - can run in parallel"; update the Ginkgo tags to match intent: either remove or replace "[Suite:openshift/disruptive-longrunning]" with the appropriate non-disruptive suite tag (or simply omit the disruptive suite tag) if tests are parallel-safe, or add a "[Serial]" (and keep disruptive tagging) if they are actually disruptive/need serialization; locate the Describe invocation (the var _ = g.Describe(...) line) and modify the tag substring in the description accordingly so description and comment match.

coderabbitai · 2026-05-12T12:11:53Z

+var _testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml = []byte(`# Nydus-store DaemonSet-based setup for OpenShift
+# This will install and configure nydus-store on all worker nodes
+#
+# Usage:
+#   oc apply -f nydus-store-setup.yaml
+#   oc adm policy add-scc-to-user privileged -z nydus-store-installer -n nydus-store
+#   oc get pods -n nydus-store -w
+#   # Wait for all pods to show "Setup complete!" in logs
+#   oc logs -n nydus-store -l app=nydus-store-installer --tail=10
+#
+# Verify:
+#   oc debug node/<worker-node> -- chroot /host systemctl status nydus-store
+#
+# Cleanup:
+#   oc delete -f nydus-store-setup.yaml
+
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: nydus-store
+  labels:
+    app: nydus-store
+    pod-security.kubernetes.io/enforce: privileged
+    pod-security.kubernetes.io/audit: privileged
+    pod-security.kubernetes.io/warn: privileged
+    security.openshift.io/scc.podSecurityLabelSync: "false"
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: nydus-store-installer
+  namespace: nydus-store
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: nydus-store-config
+  namespace: nydus-store
+data:
+  nydus-config.json: |
+    {
+      "device": {
+        "backend": {
+          "type": "registry",
+          "config": {
+            "scheme": "https",
+            "timeout": 5,
+            "connect_timeout": 5,
+            "retry_limit": 2
+          }
+        },
+        "cache": {
+          "type": "blobcache",
+          "config": {
+            "work_dir": "/var/lib/nydus/cache"
+          }
+        }
+      },
+      "mode": "direct",
+      "digest_validate": false,
+      "iostats_files": false,
+      "enable_xattr": true,
+      "fs_prefetch": {
+        "enable": true,
+        "threads_count": 2
+      }
+    }
+
+  nydus-store.service: |
+    [Unit]
+    Description=nydus store plugin for CRI-O
+    After=network.target
+    Before=crio.service
+
+    [Service]
+    Type=simple
+    Environment=HOME=/root
+    ExecStart=/usr/local/bin/nydus-store --log-to-stdout --log-level info --config-path /etc/nydus-config.json --root /var/lib/nydus-store
+    Restart=always
+    RestartSec=1
+
+    [Install]
+    WantedBy=multi-user.target
+---
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: nydus-store-installer
+  namespace: nydus-store
+spec:
+  selector:
+    matchLabels:
+      app: nydus-store-installer
+  template:
+    metadata:
+      labels:
+        app: nydus-store-installer
+    spec:
+      serviceAccountName: nydus-store-installer
+      nodeSelector:
+        node-role.kubernetes.io/worker: ""
+      hostPID: true
+      hostNetwork: true
+      tolerations:
+      - operator: Exists
+      containers:
+      - name: installer
+        image: registry.access.redhat.com/ubi9/go-toolset:latest
+        securityContext:
+          privileged: true
+          runAsUser: 0
+        command:
+        - /bin/bash
+        - -c
+        - |
+          set -e
+
+          echo "=== Nydus-store Installer ==="
+          echo "Node: $(hostname)"
+
+          # Check if already installed and running
+          if nsenter -t 1 -m -u -i -n -p -- systemctl is-active nydus-store &>/dev/null; then
+            echo "nydus-store already running on this node"
+            nsenter -t 1 -m -u -i -n -p -- systemctl status nydus-store --no-pager || true
+            echo "Sleeping to keep pod running..."
+            sleep infinity
+          fi
+
+          echo "Installing nydus-store..."
+
+          # Unlock ostree for modifications
+          echo "Unlocking ostree..."
+          nsenter -t 1 -m -u -i -n -p -- ostree admin unlock --hotfix || echo "ostree unlock failed or already unlocked"
+
+          # Download Nydus binaries from dragonflyoss/image-service
+          NYDUS_VERSION="v2.4.1"
+          echo "Downloading Nydus binaries ${NYDUS_VERSION}..."
+          curl -L -o /tmp/nydus.tgz \
+            https://github.com/dragonflyoss/image-service/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz
+
+          # Extract binaries
+          echo "Extracting Nydus binaries to /usr/local/bin..."
+          tar -xzf /tmp/nydus.tgz -C /tmp/
+          cp /tmp/nydus-static/nydusd /host/usr/local/bin/
+          cp /tmp/nydus-static/nydus-image /host/usr/local/bin/
+          chmod +x /host/usr/local/bin/nydusd
+          chmod +x /host/usr/local/bin/nydus-image
+
+          # Build nydus-store plugin from source
+          echo "Building nydus-store plugin from source..."
+          echo "This may take 1-2 minutes..."
+
+          # Clone repository
+          git clone https://github.com/containers/nydus-storage-plugin.git /tmp/nydus-plugin
+          cd /tmp/nydus-plugin
+
+          # Build
+          echo "Running make to build nydus-store..."
+          make
+
+          # Install
+          echo "Installing nydus-store binary..."
+          cp bin/nydus-store /host/usr/local/bin/nydus-store
+          chmod +x /host/usr/local/bin/nydus-store
+
+          # Verify binaries
+          echo "Verifying binaries..."
+          nsenter -t 1 -m -u -i -n -p -- /usr/local/bin/nydusd --version || echo "nydusd version check skipped"
+          nsenter -t 1 -m -u -i -n -p -- /usr/local/bin/nydus-image --version || echo "nydus-image version check skipped"
+          nsenter -t 1 -m -u -i -n -p -- /usr/local/bin/nydus-store --version || echo "nydus-store version check skipped"
+
+          # Create directories
+          echo "Creating directories..."
+          mkdir -p /host/etc
+          mkdir -p /host/var/lib/nydus-store
+          mkdir -p /host/var/lib/nydus/cache
+
+          # Copy config file
+          echo "Copying nydus-config.json..."
+          cp /config/nydus-config.json /host/etc/nydus-config.json
+
+          # Copy service file
+          echo "Copying systemd service..."
+          cp /config/nydus-store.service /host/etc/systemd/system/nydus-store.service
+
+          # Reload systemd and enable service
+          echo "Enabling nydus-store service..."
+          nsenter -t 1 -m -u -i -n -p -- systemctl daemon-reload
+          nsenter -t 1 -m -u -i -n -p -- systemctl enable nydus-store
+          nsenter -t 1 -m -u -i -n -p -- systemctl start nydus-store
+
+          # Wait for service to be ready
+          echo "Waiting for nydus-store to be ready..."
+          sleep 5
+
+          # Verify service is running
+          echo "Verifying nydus-store service..."
+          nsenter -t 1 -m -u -i -n -p -- systemctl status nydus-store --no-pager || true
+
+          # Restart CRI-O to pick up the new layer store
+          echo "Restarting CRI-O..."
+          nsenter -t 1 -m -u -i -n -p -- systemctl restart crio
+
+          echo "=== Setup complete! ==="
+          echo "nydus-store is now running on $(hostname)"
+          echo ""
+          echo "To use in ContainerRuntimeConfig, set additionalLayerStores path to:"
+          echo "  /var/lib/nydus-store"
+          echo ""
+
+          # Keep pod running
+          sleep infinity
+        volumeMounts:
+        - name: host-root
+          mountPath: /host
+        - name: config
+          mountPath: /config
+          readOnly: true
+        resources:
+          requests:
+            memory: "512Mi"
+            cpu: "500m"
+          limits:
+            memory: "2Gi"
+            cpu: "2000m"
+      volumes:
+      - name: host-root
+        hostPath:
+          path: /
+          type: Directory
+      - name: config
+        configMap:
+          name: nydus-store-config
+`)
+
+func testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYamlBytes() ([]byte, error) {
+	return _testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml, nil
+}
+
+func testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml() (*asset, error) {
+	bytes, err := testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYamlBytes()
+	if err != nil {
+		return nil, err
+	}
+
+	info := bindataFileInfo{name: "test/extended/testdata/node/additional-storage/nydus-store-setup.yaml", size: 0, mode: os.FileMode(0), modTime: time.Unix(0, 0)}
+	a := &asset{bytes: bytes, info: info}
+	return a, nil
+}


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Edit the source YAML, not the generated bindata.

bindata.go is generated from test/extended/testdata/node/additional-storage/nydus-store-setup.yaml. Please make the substantive changes (called out in the comments below) in the source YAML and regenerate this file via the existing go-bindata step so the two stay in sync. Hand-edited bindata diverges on the next regeneration.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/testdata/bindata.go` around lines 50212 - 50460, The bindata.go file contains a hand-edited YAML asset (_testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml) but it should be generated from the source YAML; fix by editing the original source file test/extended/testdata/node/additional-storage/nydus-store-setup.yaml (not bindata.go) to apply your substantive changes, then regenerate bindata.go using the project's go-bindata generation step so the functions testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYamlBytes and testExtendedTestdataNodeAdditionalStorageNydusStoreSetupYaml remain in sync with the source.

coderabbitai · 2026-05-12T12:11:53Z

+      nodeSelector:
+        node-role.kubernetes.io/worker: ""
+      hostPID: true
+      hostNetwork: true
+      tolerations:
+      - operator: Exists


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Worker nodeSelector + wildcard toleration is unsafe on SNO/TNF/TNA topologies.

nodeSelector: node-role.kubernetes.io/worker: "" combined with tolerations: - operator: Exists has two problems for non-standard control-plane topologies that this test suite is expected to run on:

On SNO and Two-Node Fixed, no node carries a dedicated node-role.kubernetes.io/worker label distinct from control-plane; the DaemonSet will produce zero pods and the E2E will silently skip the install step (and the rest of the test will then fail for unrelated reasons).

The wildcard operator: Exists tolerates every taint, including node-role.kubernetes.io/arbiter:NoSchedule on TNA. If any node ever carries both worker and the arbiter taint, this privileged installer (hostPID/hostNetwork, ostree unlock, crio restart) will run on the resource-constrained arbiter — which is explicitly disallowed.

Please either gate this test on multi-node topologies, or scope the selector/tolerations explicitly (e.g. select control-plane-and-worker nodes for SNO/TNF and exclude arbiter via node-role.kubernetes.io/arbiter DoesNotExist or a narrow toleration list).

As per coding guidelines: "flag code that assumes dedicated worker nodes exist via node-role.kubernetes.io/worker nodeSelector. On SNO and TNF, all workloads run on control-plane nodes" and "broad or wildcard tolerations will inadvertently schedule to the resource-constrained arbiter".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/testdata/bindata.go` around lines 50311 - 50316, The DaemonSet uses nodeSelector node-role.kubernetes.io/worker combined with a wildcard toleration (operator: Exists) which is unsafe for SNO/TNF/TNA topologies and may schedule onto arbiter nodes; either gate this test to run only on multi-node topologies or restrict scheduling: replace the broad nodeSelector/toleration approach by selecting control-plane-and-worker nodes (or add an explicit anti-selector for node-role.kubernetes.io/arbiter with DoesNotExist) and narrow the tolerations to only the specific taints you intend to tolerate so that the privileged pod (hostPID/hostNetwork, ostree unlock, crio restart) cannot land on arbiter or on zero-worker SNO/TNF setups.

coderabbitai · 2026-05-12T12:11:53Z

+          # Unlock ostree for modifications
+          echo "Unlocking ostree..."
+          nsenter -t 1 -m -u -i -n -p -- ostree admin unlock --hotfix || echo "ostree unlock failed or already unlocked"
+


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Persistent, destructive node mutation with no cleanup path.

This installer performs ostree admin unlock --hotfix, drops binaries into /usr/local/bin, installs a systemd unit, and systemctl restart crio — all of which persist after the DaemonSet is deleted. The header docstring's "Cleanup" section only does oc delete -f which leaves the node permanently mutated (unlocked ostree, started service, restarted CRI-O on every worker simultaneously).

Two consequences worth addressing before merge:

Restarting crio on all workers in parallel (DaemonSet rollout with no maxUnavailable: 1) will momentarily disrupt every workload pod cluster-wide — that's a fairly large blast radius for an E2E install hook.

Subsequent test runs on the same cluster will see "already running" and skip install, hiding regressions in the install path itself.

Consider serializing the rollout (updateStrategy.rollingUpdate.maxUnavailable: 1) and adding a paired uninstall DaemonSet (or preStop) that disables the unit and restores ostree, and document that this asset is intended for ephemeral CI clusters only.

Also applies to: 50397-50413

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/testdata/bindata.go` around lines 50342 - 50345, The installer mutates nodes persistently (runs "ostree admin unlock --hotfix", drops binaries into /usr/local/bin, installs a systemd unit and runs "systemctl restart crio") with no rollback; update the DaemonSet spec used by the installer to set updateStrategy.rollingUpdate.maxUnavailable: 1 to serialize restarts, add a matching uninstall path (either a separate DaemonSet or a pod preStop handler) that stops/disables the installed systemd unit, removes the dropped binaries, and attempts to re-lock/restore ostree state (reverse "ostree admin unlock --hotfix") so nodes are returned to pre-install state, and update the header docstring/README to document this asset is ephemeral for CI and list the explicit cleanup steps. Ensure the code locations referencing the ostree unlock command, the systemd unit installation, the binary install path (/usr/local/bin) and the crio restart are updated to include teardown logic and the DaemonSet updateStrategy change.

coderabbitai · 2026-05-12T12:11:53Z

+          # Download Nydus binaries from dragonflyoss/image-service
+          NYDUS_VERSION="v2.4.1"
+          echo "Downloading Nydus binaries ${NYDUS_VERSION}..."
+          curl -L -o /tmp/nydus.tgz \
+            https://github.com/dragonflyoss/image-service/releases/download/${NYDUS_VERSION}/nydus-static-${NYDUS_VERSION}-linux-amd64.tgz
+
+          # Extract binaries
+          echo "Extracting Nydus binaries to /usr/local/bin..."
+          tar -xzf /tmp/nydus.tgz -C /tmp/
+          cp /tmp/nydus-static/nydusd /host/usr/local/bin/
+          cp /tmp/nydus-static/nydus-image /host/usr/local/bin/
+          chmod +x /host/usr/local/bin/nydusd
+          chmod +x /host/usr/local/bin/nydus-image
+
+          # Build nydus-store plugin from source
+          echo "Building nydus-store plugin from source..."
+          echo "This may take 1-2 minutes..."
+
+          # Clone repository
+          git clone https://github.com/containers/nydus-storage-plugin.git /tmp/nydus-plugin
+          cd /tmp/nydus-plugin
+
+          # Build
+          echo "Running make to build nydus-store..."
+          make
+
+          # Install
+          echo "Installing nydus-store binary..."
+          cp bin/nydus-store /host/usr/local/bin/nydus-store
+          chmod +x /host/usr/local/bin/nydus-store


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Unpinned/unchecked downloads make this test non-hermetic and a supply-chain risk.

Two concerns in this install path:

curl -L … nydus-static-${NYDUS_VERSION}-linux-amd64.tgz is fetched over the internet with no checksum/signature verification and then executed as root with privileged: true. A tampered/redirected response will be installed into /usr/local/bin on every worker.

git clone https://github.com/containers/nydus-storage-plugin.git pulls HEAD of main — there is no pinned tag/commit, so the test result depends on whatever was last pushed upstream and is not reproducible across runs.

For a CI test asset, please pin a specific commit/tag, verify a sha256 of the downloaded tarball, and ideally mirror both artifacts to a trusted registry/repo rather than depending on github.com being reachable from the cluster.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/testdata/bindata.go` around lines 50346 - 50375, The install script uses an unverified remote tarball and an unpinned git clone (NYDUS_VERSION variable, the curl -L ...nydus-static-${NYDUS_VERSION}-linux-amd64.tgz line, and git clone https://github.com/containers/nydus-storage-plugin.git), which makes the test non-hermetic and a supply-chain risk; update the script to (1) pin NYDUS_VERSION to a specific release and replace the git clone with cloning a specific tag or commit hash (or use a mirrored repository), (2) download a published checksum (sha256) or signature for the nydus tarball and verify it before extracting (fail the job if verification fails), and (3) prefer fetching both the tarball and the nydus-store binary from a trusted internal mirror or CI artifact store and fall back to the external source only with explicit verification; apply these changes around the curl/tar/cp/chmod steps and the git clone/make/cp bin/nydus-store steps so binaries are only installed after checksum/signature and pin verification.

BhargaviGudi · 2026-05-12T12:22:13Z

@cpmeadors @saschagrunert Could you please help to review the PR? Thanks

BhargaviGudi · 2026-05-12T12:22:43Z

/pipeline required

openshift-merge-bot · 2026-05-12T12:22:46Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

openshift-ci · 2026-05-12T12:47:20Z

@BhargaviGudi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`9e4a7dd`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

BhargaviGudi · 2026-05-12T12:52:22Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview

openshift-ci · 2026-05-12T12:52:27Z

@BhargaviGudi: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview-1of2
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6baf4b20-4e01-11f1-94c0-65bfcdb810dd-0

coderabbitai

♻️ Duplicate comments (2)

test/extended/node/additional_image_stores.go (2)

26-26: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Move this test image to an org-controlled registry.

Upstream e2e should not depend on a personal Quay namespace. If that repo is deleted, made private, or rate-limited, this entire suite starts failing for reasons unrelated to the feature under test.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` at line 26, The test image
constant testImageDefault in additional_image_stores.go points at a personal
Quay repo; replace it with an organization-controlled, long-lived registry image
(or make it configurable via an env var/flag) so e2e doesn't depend on a
personal namespace; update the value of testImageDefault (or the code that reads
it) to point to the org registry image and adjust any test fixtures or
documentation that reference the old quay.io/bgudi path.

79-80: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use API dry-run for the whole validation describe.

These specs only verify admission behavior, but they currently create real cluster-scoped ContainerRuntimeConfig objects. That makes the "non-disruptive API validation" suite capable of kicking off worker MCP reconciliation, and if validation regresses the negative cases can leave real resources behind before the test fails.

Proposed change

- _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+ createOpts := metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}}
+ _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, createOpts)

- created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+ createOpts := metav1.CreateOptions{DryRun: []string{metav1.DryRunAll}}
+ created, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, createOpts)
  o.Expect(err).NotTo(o.HaveOccurred())
- defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)

Also applies to: 114-117, 145-147, 174-177, 217-219, 250-253, 280-283, 310-312, 352-354

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 79 - 80, Replace
real API creates with dry-run creates so the tests only exercise admission
without persisting cluster-scoped ContainerRuntimeConfig resources: change calls
to mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx,
ctrcfg, metav1.CreateOptions{}) to include the dry-run option
(metav1.CreateOptions{DryRun: []string{"All"}}) for each occurrence in the file
(the Create invocations shown and the other similar groups flagged in the
review). Keep the same assertions and flow — only adjust the Create call's
metav1.CreateOptions to use DryRun so validation runs but no real resource is
created.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 26: The test image constant testImageDefault in
additional_image_stores.go points at a personal Quay repo; replace it with an
organization-controlled, long-lived registry image (or make it configurable via
an env var/flag) so e2e doesn't depend on a personal namespace; update the value
of testImageDefault (or the code that reads it) to point to the org registry
image and adjust any test fixtures or documentation that reference the old
quay.io/bgudi path.
- Around line 79-80: Replace real API creates with dry-run creates so the tests
only exercise admission without persisting cluster-scoped ContainerRuntimeConfig
resources: change calls to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}) to include the dry-run option
(metav1.CreateOptions{DryRun: []string{"All"}}) for each occurrence in the file
(the Create invocations shown and the other similar groups flagged in the
review). Keep the same assertions and flow — only adjust the Create call's
metav1.CreateOptions to use DryRun so validation runs but no real resource is
created.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ca67ce0b-4e50-47e7-9e16-c06683156fb8

📥 Commits

Reviewing files that changed from the base of the PR and between 9e4a7dd and 570dff8.

📒 Files selected for processing (7)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go
test/extended/testdata/bindata.go

✅ Files skipped from review due to trivial changes (1)

test/extended/testdata/bindata.go

🚧 Files skipped from review as they are similar to previous changes (5)

test/extended/node/additional_layer_stores.go
test/extended/node/stargz_store_setup.go
test/extended/node/additional_stores_combined.go
test/extended/node/additional_artifact_stores.go
test/extended/node/node_utils.go

coderabbitai

🧹 Nitpick comments (2)

test/extended/node/additional_layer_stores.go (1)
1023-1024: 💤 Low value

Duplicate TC18 placeholder comment.

The same // TC18: Default resolution unchanged (P1 Regression) line is repeated twice. Either fold these into one comment or replace one with the missing TC17/TC18 distinction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 1023 - 1024,
There are two identical placeholder comments "// TC18: Default resolution
unchanged (P1 Regression)"; remove the duplicate or correct the second one to
the intended test identifier (e.g., change to "// TC17: ..." if that was meant)
so only one TC18 comment remains and test identifiers are unique; locate the
duplicate by searching for the exact comment string in
additional_layer_stores.go and update the second occurrence accordingly.
test/extended/node/additional_image_stores.go (1)
423-423: ⚡ Quick win

Deferred Delete calls drop the returned error and don't wait for pod termination.

The two defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(...) at Lines 423 and 500 swallow errors and return immediately. If TC8 fails between Phase 4 and Phase 5, the test pod stays in the namespace while subsequent serial specs and the deferred cleanupContainerRuntimeConfig race the still-terminating pod against the next MCP rollout. Use the existing deletePodAndWait helper (already used elsewhere in this PR) inside the defer, or at minimum log/check the delete error.

Also applies to: 500-500
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` at line 423, The deferred
direct calls to oc.AdminKubeClient().CoreV1().Pods(...).Delete(...) drop
returned errors and don't wait for pod termination; replace those defers with a
defer that calls the existing deletePodAndWait helper (or at minimum capture and
log the error) so the pod is fully removed before subsequent specs and
cleanupContainerRuntimeConfig run; update the defer around testPod (the variable
created in this test) to call deletePodAndWait(ctx, oc, testPod.Name) and
log/check any returned error from that call instead of calling
Pods(...).Delete(...) directly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 423: The deferred direct calls to
oc.AdminKubeClient().CoreV1().Pods(...).Delete(...) drop returned errors and
don't wait for pod termination; replace those defers with a defer that calls the
existing deletePodAndWait helper (or at minimum capture and log the error) so
the pod is fully removed before subsequent specs and
cleanupContainerRuntimeConfig run; update the defer around testPod (the variable
created in this test) to call deletePodAndWait(ctx, oc, testPod.Name) and
log/check any returned error from that call instead of calling
Pods(...).Delete(...) directly.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 1023-1024: There are two identical placeholder comments "// TC18:
Default resolution unchanged (P1 Regression)"; remove the duplicate or correct
the second one to the intended test identifier (e.g., change to "// TC17: ..."
if that was meant) so only one TC18 comment remains and test identifiers are
unique; locate the duplicate by searching for the exact comment string in
additional_layer_stores.go and update the second occurrence accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fe9fb502-6d72-4d94-8f23-d628f3df585c

📥 Commits

Reviewing files that changed from the base of the PR and between 6f9f7e0 and e279b74.

📒 Files selected for processing (7)

test/extended/node/additional_artifact_stores.go
test/extended/node/additional_image_stores.go
test/extended/node/additional_layer_stores.go
test/extended/node/additional_stores_combined.go
test/extended/node/node_utils.go
test/extended/node/stargz_store_setup.go
test/extended/testdata/bindata.go

✅ Files skipped from review due to trivial changes (1)

test/extended/testdata/bindata.go

🚧 Files skipped from review as they are similar to previous changes (3)

test/extended/node/node_utils.go
test/extended/node/additional_stores_combined.go
test/extended/node/stargz_store_setup.go

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade): - additionalArtifactStores (max 10 stores) - additionalImageStores (max 10 stores) - additionalLayerStores (max 5 stores, supports :ref suffix) - Combined storage configurations Test suite: openshift/additional-storage-support Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com> Add comprehensive test cases for Additional Storage Support feature Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com> Updated testcases

openshift-ci · 2026-05-12T13:34:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi
Once this PR has been reviewed and has the lgtm label, please assign cpmeadors for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/extended/node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-merge-bot · 2026-05-12T13:54:24Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

BhargaviGudi · 2026-05-13T05:39:55Z

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview

openshift-ci · 2026-05-13T05:39:59Z

@BhargaviGudi: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview-1of2
periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-disruptive-longrunning-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2c4d9300-4e8e-11f1-9adc-36a3f6421138-0

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 28, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026

coderabbitai Bot requested changes Apr 28, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from bf419f3 to f09da80 Compare April 28, 2026 07:43

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026

coderabbitai Bot requested changes Apr 28, 2026

View reviewed changes

Comment thread test/extended/node/additional_image_stores.go Outdated

Comment thread test/extended/node/additional_layer_stores.go Outdated

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from f09da80 to af7b257 Compare April 30, 2026 09:50

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from af7b257 to 990e5f3 Compare May 5, 2026 08:43

coderabbitai Bot requested changes May 5, 2026

View reviewed changes

Comment thread test/extended/node/additional_layer_stores.go

Comment thread test/extended/node/additional_layer_stores.go

BhargaviGudi mentioned this pull request May 5, 2026

OCPNODE-4055: Add periodic CI jobs for additional-storage-support tests openshift/release#78465

Closed

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 990e5f3 to d98f6f9 Compare May 6, 2026 05:02

coderabbitai Bot requested changes May 6, 2026

View reviewed changes

Comment thread test/extended/node/additional_layer_stores.go Outdated

Comment thread test/extended/node/additional_stores_combined.go

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from d98f6f9 to 59b9662 Compare May 12, 2026 09:58

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

BhargaviGudi marked this pull request as ready for review May 12, 2026 11:57

BhargaviGudi changed the title ~~WIP OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature~~ OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature May 12, 2026

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026

openshift-ci Bot requested review from p0lyn0mial and rphillips May 12, 2026 11:58

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 59b9662 to 9e4a7dd Compare May 12, 2026 12:06

coderabbitai Bot requested changes May 12, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 9e4a7dd to 570dff8 Compare May 12, 2026 12:47

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 570dff8 to 6f9f7e0 Compare May 12, 2026 12:54

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 6f9f7e0 to e279b74 Compare May 12, 2026 13:14

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from e279b74 to c7b60ef Compare May 12, 2026 13:26

BhargaviGudi force-pushed the additionalArtifactsAutomation branch from c7b60ef to 2630937 Compare May 12, 2026 13:32

		testImage := "quay.io/bgudi/test-image-6gb:v1.0"
		framework.Logf("Pre-populating image %s to %s on node %s", testImage, imageStorePath, testNode.Name)

		deletePodAndWait(ctx, oc, "default", pod2Name)
		framework.Logf("Layer store (stargz) lazy pulling verified successfully")

Conversation

BhargaviGudi commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Coverage

CodeRabbit Review Fixes

Suite Tags

All the test cases ran locally and All are passed

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 28, 2026

Uh oh!

openshift-ci Bot commented Apr 28, 2026

Uh oh!

openshift-ci-robot commented Apr 28, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BhargaviGudi commented Apr 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BhargaviGudi commented May 6, 2026

Uh oh!

BhargaviGudi commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BhargaviGudi commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Apr 28, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot May 12, 2026 •

edited

Loading