You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a customer system today, while performing a sled expungement, the blueprint_executor task started failing on every invocation, reporting this error:
execution: enabled
status: failed at: Ensure external networking resources (step 1/15)
error: step failed: Ensure external networking resources
caused by: Internal Error: unexpected database error: Record not found
The "ensuring external networking resources" is the step where we attempt to ensure that the external IP and service NIC rows in the non-reconfigurator CRDB tables are consistent with the blueprint. (We treat them as a form of rendezvous table, although they long predate that term and idea.) This is a two-step process with two substeps each:
For every expunged zone with an external IP:
Soft delete its external IP by updating the relevant row (unless it's already been soft deleted, in which case do nothing)
Soft delete its service NIC by updating the relevant row (again unless it's already been soft deleted)
For every in-service zone with an external IP:
Ensure its external IP exists, creating it if necessary
Ensure its service NIC exists, creating it if necessary
info!(log,"successfully deleted Omicron zone vNIC");
}else{
debug!(log,"Omicron zone vNIC already deleted");
}
but never made it to the point of ensuring new records existed, strongly implying we were failing to soft delete some particular zone's records.
Support grabbed both the Nexus logs and a Reconfigurator state file. The logs showed the zone ID of the last successful soft deletion (repeated on each iteration of the executor task). The loops here are ordered by sled ID first, zone ID second, so given the state file and the last successful soft delete, we were able to find both the zone ID of the zone we were presumably failing on and its associated external IP / service NIC details. We manually confirmed that the rows the blueprint claimed should exist did not:
> select * from external_ip where id = '$ZONE_EXTERNAL_IP_ID';
NO ROWS
> select * from service_network_interface where id = '$ZONE_NIC_ID';
NO ROWS
The planner/executor believe this should be impossible, hence the executor getting stuck. (This is one of the few executor steps that stops the entire task, under the assumption that if we can't ensure the networking tables are correct it may not be safe to start up zones; e.g. could we have multiple zones believing they're associated with the same external IP?)
Looking at the Reconfigurator state file, I believe the sequence that got us here was:
The planner run after expunging the 9th disk expunged the Nexus on that disk, then placed a new Nexus. It put it on the next disk that was going to be expunged. (At this time, both the disk and sled were still in-service, so this was a perfectly valid, albeit very unfortunate, placement.)
The 10th disk was expunged.
The planner run after expunging the 10th disk expunged the Nexus on that disk, which the planner run in step 3 had just created. It placed a new Nexus on another sled.
However, the executor did not run in between the blueprint from step 3 being made the target and the blueprint from step 5 being made the target. Therefore, it never ensured the external IP / service NIC rows for the short-lived Nexus were created. Therefore, all subsequent executor runs from that point on are doomed: they're stuck trying to soft-delete rows that don't exist, because the only blueprint that would have caused them to be created was never executed.
On a customer system today, while performing a sled expungement, the
blueprint_executortask started failing on every invocation, reporting this error:The "ensuring external networking resources" is the step where we attempt to ensure that the external IP and service NIC rows in the non-reconfigurator CRDB tables are consistent with the blueprint. (We treat them as a form of rendezvous table, although they long predate that term and idea.) This is a two-step process with two substeps each:
The datastore process for this is here.
The errors we were seeing were from attempting to soft delete external IPs. We did see these logs for some zones:
omicron/nexus/db-queries/src/db/datastore/deployment/external_networking.rs
Lines 178 to 182 in 3d2d0c1
but never made it to the point of ensuring new records existed, strongly implying we were failing to soft delete some particular zone's records.
Support grabbed both the Nexus logs and a Reconfigurator state file. The logs showed the zone ID of the last successful soft deletion (repeated on each iteration of the executor task). The loops here are ordered by sled ID first, zone ID second, so given the state file and the last successful soft delete, we were able to find both the zone ID of the zone we were presumably failing on and its associated external IP / service NIC details. We manually confirmed that the rows the blueprint claimed should exist did not:
The planner/executor believe this should be impossible, hence the executor getting stuck. (This is one of the few executor steps that stops the entire task, under the assumption that if we can't ensure the networking tables are correct it may not be safe to start up zones; e.g. could we have multiple zones believing they're associated with the same external IP?)
Looking at the Reconfigurator state file, I believe the sequence that got us here was:
However, the executor did not run in between the blueprint from step 3 being made the target and the blueprint from step 5 being made the target. Therefore, it never ensured the external IP / service NIC rows for the short-lived Nexus were created. Therefore, all subsequent executor runs from that point on are doomed: they're stuck trying to soft-delete rows that don't exist, because the only blueprint that would have caused them to be created was never executed.