Skip to content

rootless: no-process-sandbox leaves zombie process #2855

@sonnysideup

Description

@sonnysideup

Summary

Cancelling a context.Context passed to Client.Solve during a build causes the function to eventually exit as expected but buildkitd fails to kill the underlying build step and it continues to run until completed. Buildkitd prints several error messages indicating it cannot kill runc and after buildkit-runc finally exits, a zombie process remains running indefinitely. A new zombie process will be created every time a build is cancelled in this manner. Another concerning aspect of this behavior is that running the same operation while the previously-cancelled build step is still running will re-attach to the same buildkit-runc process and start streaming progress from that step. Is this expected behavior?

I ran this test both on MacOS Monterey and Debian 10 to ensure it wasn't a host issue and the results were the same.

Environment

Invocation: Go client
Go version: 1.18.1
Buildkit mode: rootless
Buildkit version: v0.10.3
Buildkit environment: container

Details

Dockerfile

FROM python:3.9
RUN pip install pipenv waitress numpy flask dask awscli pandas
RUN echo "all done"

Buildkitd Launch

docker run -d \
  -p 1234:1234 \
  --security-opt seccomp=unconfined \
  --security-opt apparmor=unconfined \
  moby/buildkit:v0.10.3-rootless --oci-worker-no-process-sandbox --addr=tcp://0.0.0.0:1234

Test Code

package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"testing"
	"time"

	"github.com/containerd/console"
	bkclient "github.com/moby/buildkit/client"
	"github.com/moby/buildkit/util/progress/progressui"
	"golang.org/x/sync/errgroup"
)

type LogWriter struct {
	Logger *log.Logger
}

func (w *LogWriter) Write(msg []byte) (int, error) {
	w.Logger.Println(string(msg))
	return len(msg), nil
}

var testDir = "/path/to/docker/context-dir"

func main() {
	bk, err := bkclient.New(context.TODO(), "tcp://127.0.0.1:1234")
	if err != nil {
		log.Fatal(err)
	}

	solveOpts := bkclient.SolveOpt{
		Frontend:      "dockerfile.v0",
		FrontendAttrs: map[string]string{},
		LocalDirs: map[string]string{
			"context":    testDir,
			"dockerfile": testDir,
		},
	}

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	ch := make(chan *bkclient.SolveStatus)
	eg, ctx := errgroup.WithContext(ctx)

	eg.Go(func() error {
		lw := &LogWriter{Logger: log.New(os.Stdout, "progress: ", log.Llongfile)}

		var c console.Console
		if cn, err := console.ConsoleFromFile(os.Stderr); err != nil {
			c = cn
		}

		_, err := progressui.DisplaySolveStatus(context.TODO(), "", c, lw, ch)
		return err
	})

	eg.Go(func() error {
		if _, err := bk.Solve(ctx, nil, solveOpts, ch); err != nil {
			return err
		}

		log.Println("Solve complete")
		return nil
	})

	go func() {
		time.Sleep(10 * time.Second)
		log.Println("Cancelling context")
		cancel()
	}()

	if err = eg.Wait(); err != nil {
		log.Fatal(fmt.Errorf("buildkit solve issue: %w", err))
	}
}

Buildkitd logs

These continue for a very long time until the underlying pip command completes.

buildkit_1  | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1  | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1  | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1  | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"
buildkit_1  | time="2022-05-09T14:37:45Z" level=error msg="failed to kill runc vourqg6ehysb3yz2a0mtjmbj9: buildkit-runc did not terminate successfully: exit status 1: container \"vourqg6ehysb3yz2a0mtjmbj9\" does not exist\n" span="[2/3] RUN pip install pipenv waitress numpy flask dask awscli pandas"

Buildkitd

A [pip] process will remain running after the buildkit-run process finally exits. Here's an example of what happens after cancelling 3 separate builds during the pip install step.

$ docker exec -it buildkitd ps -eo pid,ppid,time,args
PID   PPID  TIME  COMMAND
    1     0  0:00 rootlesskit buildkitd --addr=tcp://0.0.0.0:1234 --oci-worker-
   11     1  0:00 /proc/self/exe buildkitd --addr=tcp://0.0.0.0:1234 --oci-work
   26    11  0:55 buildkitd --addr=tcp://0.0.0.0:1234 --oci-worker-no-process-s
  588     1  0:24 [pip]
13898     1  0:23 [pip]
24906     1  0:22 [pip]
34334     0  0:00 ps -eo pid,ppid,time,args

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions