Summary
Cancelling a context.Context passed to Client.Solve during a build causes the function to eventually exit as expected but buildkitd fails to kill the underlying build step and it continues to run until completed. Buildkitd prints several error messages indicating it cannot kill runc and after buildkit-runc finally exits, a zombie process remains running indefinitely. A new zombie process will be created every time a build is cancelled in this manner. Another concerning aspect of this behavior is that running the same operation while the previously-cancelled build step is still running will re-attach to the same buildkit-runc process and start streaming progress from that step. Is this expected behavior?
I ran this test both on MacOS Monterey and Debian 10 to ensure it wasn't a host issue and the results were the same.
Environment
Invocation: Go client
Go version: 1.18.1
Buildkit mode: rootless
Buildkit version: v0.10.3
Buildkit environment: container
Details
Dockerfile
FROM python:3.9
RUN pip install pipenv waitress numpy flask dask awscli pandas
RUN echo "all done"
Buildkitd Launch
docker run -d \
-p 1234:1234 \
--security-opt seccomp=unconfined \
--security-opt apparmor=unconfined \
moby/buildkit:v0.10.3-rootless --oci-worker-no-process-sandbox --addr=tcp://0.0.0.0:1234
Test Code
package main
import (
"context"
"fmt"
"log"
"os"
"testing"
"time"
"github.com/containerd/console"
bkclient "github.com/moby/buildkit/client"
"github.com/moby/buildkit/util/progress/progressui"
"golang.org/x/sync/errgroup"
)
type LogWriter struct {
Logger *log.Logger
}
func (w *LogWriter) Write(msg []byte) (int, error) {
w.Logger.Println(string(msg))
return len(msg), nil
}
var testDir = "/path/to/docker/context-dir"
func main() {
bk, err := bkclient.New(context.TODO(), "tcp://127.0.0.1:1234")
if err != nil {
log.Fatal(err)
}
solveOpts := bkclient.SolveOpt{
Frontend: "dockerfile.v0",
FrontendAttrs: map[string]string{},
LocalDirs: map[string]string{
"context": testDir,
"dockerfile": testDir,
},
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
ch := make(chan *bkclient.SolveStatus)
eg, ctx := errgroup.WithContext(ctx)
eg.Go(func() error {
lw := &LogWriter{Logger: log.New(os.Stdout, "progress: ", log.Llongfile)}
var c console.Console
if cn, err := console.ConsoleFromFile(os.Stderr); err != nil {
c = cn
}
_, err := progressui.DisplaySolveStatus(context.TODO(), "", c, lw, ch)
return err
})
eg.Go(func() error {
if _, err := bk.Solve(ctx, nil, solveOpts, ch); err != nil {
return err
}
log.Println("Solve complete")
return nil
})
go func() {
time.Sleep(10 * time.Second)
log.Println("Cancelling context")
cancel()
}()
if err = eg.Wait(); err != nil {
log.Fatal(fmt.Errorf("buildkit solve issue: %w", err))
}
}
Buildkitd logs
These continue for a very long time until the underlying pip command completes.
buildkit_1 | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1 | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1 | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1 | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1 | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
Buildkitd
A [pip] process will remain running after the buildkit-run process finally exits. Here's an example of what happens after cancelling 3 separate builds during the pip install step.
$ docker exec -it buildkitd ps -eo pid,ppid,time,args
PID PPID TIME COMMAND
1 0 0:00 rootlesskit buildkitd --addr=tcp://0.0.0.0:1234 --oci-worker-
11 1 0:00 /proc/self/exe buildkitd --addr=tcp://0.0.0.0:1234 --oci-work
26 11 0:55 buildkitd --addr=tcp://0.0.0.0:1234 --oci-worker-no-process-s
588 1 0:24 [pip]
13898 1 0:23 [pip]
24906 1 0:22 [pip]
34334 0 0:00 ps -eo pid,ppid,time,args
Summary
Cancelling a
context.Contextpassed toClient.Solveduring a build causes the function to eventually exit as expected but buildkitd fails to kill the underlying build step and it continues to run until completed. Buildkitd prints several error messages indicating it cannot kill runc and afterbuildkit-runcfinally exits, a zombie process remains running indefinitely. A new zombie process will be created every time a build is cancelled in this manner. Another concerning aspect of this behavior is that running the same operation while the previously-cancelled build step is still running will re-attach to the samebuildkit-runcprocess and start streaming progress from that step. Is this expected behavior?I ran this test both on MacOS Monterey and Debian 10 to ensure it wasn't a host issue and the results were the same.
Environment
Invocation: Go client
Go version: 1.18.1
Buildkit mode: rootless
Buildkit version: v0.10.3
Buildkit environment: container
Details
Dockerfile
Buildkitd Launch
Test Code
Buildkitd logs
These continue for a very long time until the underlying
pipcommand completes.Buildkitd
A
[pip]process will remain running after thebuildkit-runprocess finally exits. Here's an example of what happens after cancelling 3 separate builds during thepip installstep.