Skip to content

Reconfigurator: race between planner and executor during disk expungement led to executor perpetually failing #10025

@jgallagher

Description

@jgallagher

On a customer system today, while performing a sled expungement, the blueprint_executor task started failing on every invocation, reporting this error:

execution:        enabled
status:           failed at: Ensure external networking resources (step 1/15)
error:            step failed: Ensure external networking resources
  caused by:      Internal Error: unexpected database error: Record not found

The "ensuring external networking resources" is the step where we attempt to ensure that the external IP and service NIC rows in the non-reconfigurator CRDB tables are consistent with the blueprint. (We treat them as a form of rendezvous table, although they long predate that term and idea.) This is a two-step process with two substeps each:

  1. For every expunged zone with an external IP:
    1. Soft delete its external IP by updating the relevant row (unless it's already been soft deleted, in which case do nothing)
    2. Soft delete its service NIC by updating the relevant row (again unless it's already been soft deleted)
  2. For every in-service zone with an external IP:
    1. Ensure its external IP exists, creating it if necessary
    2. Ensure its service NIC exists, creating it if necessary

The datastore process for this is here.

The errors we were seeing were from attempting to soft delete external IPs. We did see these logs for some zones:

if deleted_nic {
info!(log, "successfully deleted Omicron zone vNIC");
} else {
debug!(log, "Omicron zone vNIC already deleted");
}

but never made it to the point of ensuring new records existed, strongly implying we were failing to soft delete some particular zone's records.

Support grabbed both the Nexus logs and a Reconfigurator state file. The logs showed the zone ID of the last successful soft deletion (repeated on each iteration of the executor task). The loops here are ordered by sled ID first, zone ID second, so given the state file and the last successful soft delete, we were able to find both the zone ID of the zone we were presumably failing on and its associated external IP / service NIC details. We manually confirmed that the rows the blueprint claimed should exist did not:

> select * from external_ip where id = '$ZONE_EXTERNAL_IP_ID';
NO ROWS

> select * from service_network_interface where id = '$ZONE_NIC_ID';
NO ROWS

The planner/executor believe this should be impossible, hence the executor getting stuck. (This is one of the few executor steps that stops the entire task, under the assumption that if we can't ensure the networking tables are correct it may not be safe to start up zones; e.g. could we have multiple zones believing they're associated with the same external IP?)

Looking at the Reconfigurator state file, I believe the sequence that got us here was:

  1. Before expunging the sled, we expunged each of that sled's disks, one at a time. (This is the documented procedure, and exists as a workaround for Region replacement sagas apparently more prone to failed database connection claims #8520.)
  2. The 9th disk being expunged hosted a Nexus.
  3. The planner run after expunging the 9th disk expunged the Nexus on that disk, then placed a new Nexus. It put it on the next disk that was going to be expunged. (At this time, both the disk and sled were still in-service, so this was a perfectly valid, albeit very unfortunate, placement.)
  4. The 10th disk was expunged.
  5. The planner run after expunging the 10th disk expunged the Nexus on that disk, which the planner run in step 3 had just created. It placed a new Nexus on another sled.

However, the executor did not run in between the blueprint from step 3 being made the target and the blueprint from step 5 being made the target. Therefore, it never ensured the external IP / service NIC rows for the short-lived Nexus were created. Therefore, all subsequent executor runs from that point on are doomed: they're stuck trying to soft-delete rows that don't exist, because the only blueprint that would have caused them to be created was never executed.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions