Runtime v2 introduces a first class shim API for runtime authors to integrate with containerd. The shim API is minimal and scoped to the execution lifecycle of a container.
Users specify the runtime they wish to use when creating a container. The runtime can also be changed via a container update.
> ctr run --runtime io.containerd.runc.v2
When a user specifies a runtime name, io.containerd.runc.v2
, they will specify the name and version of the runtime.
This will be translated by containerd into a binary name for the shim.
io.containerd.runc.v2
-> containerd-shim-runc-v2
Since 1.6 release, it's also possible to specify absolute runtime path:
> ctr run --runtime /usr/local/bin/containerd-shim-runc-v2
containerd keeps the containerd-shim-*
prefix so that users can ps aux | grep containerd-shim
to see running shims on their system.
This section is dedicated to runtime authors wishing to build a shim. It will detail how the API works and different considerations when building shim.
Container information is provided to a shim in two ways.
The OCI Runtime Bundle and on the Create
rpc request.
Each shim MUST implement a start
subcommand.
This command will launch new shims.
The start command MUST accept the following flags:
-namespace
the namespace for the container-address
the address of the containerd's main grpc socket-publish-binary
the binary path to publish events back to containerd-id
the id of the container
The start command, as well as all binary calls to the shim, has the bundle for the container set as the cwd
.
The start command may have the following containerd specific environment variables set:
TTRPC_ADDRESS
the address of containerd's ttrpc API socketGRPC_ADDRESS
the address of containerd's grpc API socket (1.7+)MAX_SHIM_VERSION
the maximum shim version supported by the client, always2
for shim v2 (1.7+)SCHED_CORE
enable core scheduling if available (1.6+)NAMESPACE
an optional namespace the shim is operating in or inheriting (1.7+)
The start command MUST write to stdout either the ttrpc address that the shim is serving its API on, or (experimental) a JSON structure in the following format (where protocol can be either "ttrpc" or "grpc"):
{
"version": 2,
"address": "/address/of/task/service",
"protocol": "grpc"
}
The address will be used by containerd to issue API requests for container operations.
The start command can either start a new shim or return an address to an existing shim based on the shim's logic.
Each shim MUST implement a delete
subcommand.
This command allows containerd to delete any container resources created, mounted, and/or run by a shim when containerd can no longer communicate over rpc.
This happens if a shim is SIGKILL'd with a running container.
These resources will need to be cleaned up when containerd looses the connection to a shim.
This is also used when containerd boots and reconnects to shims.
If a bundle is still on disk but containerd cannot connect to a shim, the delete command is invoked.
The delete command MUST accept the following flags:
-namespace
the namespace for the container-address
the address of the containerd's main socket-publish-binary
the binary path to publish events back to containerd-id
the id of the container-bundle
the path to the bundle to delete. On non-Windows and non-FreeBSD platforms this will matchcwd
The delete command will be executed in the container's bundle as its cwd
except for on Windows and FreeBSD platforms.
containerd does not provide any host level configuration for shims via the API. If a shim needs configuration from the user with host level information across all instances, a shim specific configuration file can be setup.
On the create request, there is a generic *protobuf.Any
that allows a user to specify container level configuration for the shim.
message CreateTaskRequest {
string id = 1;
...
google.protobuf.Any options = 10;
}
A shim author can create their own protobuf message for configuration and clients can import and provide this information is needed.
I/O for a container is provided by the client to the shim via fifo on Linux, named pipes on Windows, or log files on disk.
The paths to these files are provided on the Create
rpc for the initial creation and on the Exec
rpc for additional processes.
message CreateTaskRequest {
string id = 1;
bool terminal = 4;
string stdin = 5;
string stdout = 6;
string stderr = 7;
}
message ExecProcessRequest {
string id = 1;
string exec_id = 2;
bool terminal = 3;
string stdin = 4;
string stdout = 5;
string stderr = 6;
}
Containers that are to be launched with an interactive terminal will have the terminal
field set to true
, data is still copied over the files(fifos,pipes) in the same way as non interactive containers.
The root filesystem for the containers is provided by on the Create
rpc.
Shims are responsible for managing the lifecycle of the filesystem mount during the lifecycle of a container.
message CreateTaskRequest {
string id = 1;
string bundle = 2;
repeated containerd.types.Mount rootfs = 3;
...
}
The mount protobuf message is:
message Mount {
// Type defines the nature of the mount.
string type = 1;
// Source specifies the name of the mount. Depending on mount type, this
// may be a volume name or a host path, or even ignored.
string source = 2;
// Target path in container
string target = 3;
// Options specifies zero or more fstab style mount options.
repeated string options = 4;
}
Shims are responsible for mounting the filesystem into the rootfs/
directory of the bundle.
Shims are also responsible for unmounting of the filesystem.
During a delete
binary call, the shim MUST ensure that filesystem is also unmounted.
Filesystems are provided by the containerd snapshotters.
The Runtime v2 supports an async event model. In order for the an upstream caller (such as Docker) to get these events in the correct order a Runtime v2 shim MUST implement the following events where Compliance=MUST
. This avoids race conditions between the shim and shim client where for example a call to Start
can signal a TaskExitEventTopic
before even returning the results from the Start
call. With these guarantees of a Runtime v2 shim a call to Start
is required to have published the async event TaskStartEventTopic
before the shim can publish the TaskExitEventTopic
.
Topic | Compliance | Description |
---|---|---|
runtime.TaskCreateEventTopic |
MUST | When a task is successfully created |
runtime.TaskStartEventTopic |
MUST (follow TaskCreateEventTopic ) |
When a task is successfully started |
runtime.TaskExitEventTopic |
MUST (follow TaskStartEventTopic ) |
When a task exits expected or unexpected |
runtime.TaskDeleteEventTopic |
MUST (follow TaskExitEventTopic or TaskCreateEventTopic if never started) |
When a task is removed from a shim |
runtime.TaskPausedEventTopic |
SHOULD | When a task is successfully paused |
runtime.TaskResumedEventTopic |
SHOULD (follow TaskPausedEventTopic ) |
When a task is successfully resumed |
runtime.TaskCheckpointedEventTopic |
SHOULD | When a task is checkpointed |
runtime.TaskOOMEventTopic |
SHOULD | If the shim collects Out of Memory events |
Topic | Compliance | Description |
---|---|---|
runtime.TaskExecAddedEventTopic |
MUST (follow TaskCreateEventTopic ) |
When an exec is successfully added |
runtime.TaskExecStartedEventTopic |
MUST (follow TaskExecAddedEventTopic ) |
When an exec is successfully started |
runtime.TaskExitEventTopic |
MUST (follow TaskExecStartedEventTopic ) |
When an exec (other than the init exec) exits expected or unexpected |
runtime.TaskDeleteEventTopic |
SHOULD (follow TaskExitEventTopic or TaskExecAddedEventTopic if never started) |
When an exec is removed from a shim |
The following sequence diagram shows the flow of actions when ctr run
command executed.
sequenceDiagram
participant ctr
participant containerd
participant shim
autonumber
ctr->>containerd: Create container
Note right of containerd: Save container metadata
containerd-->>ctr: Container ID
ctr->>containerd: Create task
%% Start shim
containerd-->shim: Prepare bundle
containerd->>shim: Execute binary: containerd-shim-runc-v2 start
shim->shim: Start TTRPC server
shim-->>containerd: Respond with address: unix://containerd/container.sock
containerd-->>shim: Create TTRPC client
%% Schedule task
Note right of containerd: Schedule new task
containerd->>shim: TaskService.CreateTaskRequest
shim-->>containerd: Task PID
containerd-->>ctr: Task ID
%% Start task
ctr->>containerd: Start task
containerd->>shim: TaskService.StartRequest
shim-->>containerd: OK
%% Wait task
ctr->>containerd: Wait task
containerd->>shim: TaskService.WaitRequest
Note right of shim: Block until task exits
shim-->>containerd: Exit status
containerd-->>ctr: OK
Note over ctr,shim: Other task requests (Kill, Pause, Resume, CloseIO, Exec, etc)
%% Kill signal
opt Kill task
ctr->>containerd: Kill task
containerd->>shim: TaskService.KillRequest
shim-->>containerd: OK
containerd-->>ctr: OK
end
%% Delete task
ctr->>containerd: Task Delete
containerd->>shim: TaskService.DeleteRequest
shim-->>containerd: Exit information
containerd->>shim: TaskService.ShutdownRequest
shim-->>containerd: OK
containerd-->shim: Close client
containerd->>shim: Execute binary: containerd-shim-runc-v2 delete
containerd-->shim: Delete bundle
containerd-->>ctr: Exit code
Shims may support pluggable logging via STDIO URIs. Current supported schemes for logging are:
- fifo - Linux
- binary - Linux & Windows
- file - Linux & Windows
- npipe - Windows
Binary logging has the ability to forward a container's STDIO to an external binary for consumption.
A sample logging driver that forwards the container's STDOUT and STDERR to journald
is:
package main
import (
"bufio"
"context"
"fmt"
"io"
"sync"
"github.com/containerd/containerd/runtime/v2/logging"
"github.com/coreos/go-systemd/journal"
)
func main() {
logging.Run(log)
}
func log(ctx context.Context, config *logging.Config, ready func() error) error {
// construct any log metadata for the container
vars := map[string]string{
"SYSLOG_IDENTIFIER": fmt.Sprintf("%s:%s", config.Namespace, config.ID),
}
var wg sync.WaitGroup
wg.Add(2)
// forward both stdout and stderr to the journal
go copy(&wg, config.Stdout, journal.PriInfo, vars)
go copy(&wg, config.Stderr, journal.PriErr, vars)
// signal that we are ready and setup for the container to be started
if err := ready(); err != nil {
return err
}
wg.Wait()
return nil
}
func copy(wg *sync.WaitGroup, r io.Reader, pri journal.Priority, vars map[string]string) {
defer wg.Done()
s := bufio.NewScanner(r)
for s.Scan() {
journal.Send(s.Text(), pri, vars)
}
}
If a shim does not or cannot implement an rpc call, it MUST return a github.com/containerd/containerd/errdefs.ErrNotImplemented
error.
A fifo on unix or named pipe on Windows will be provided to the shim.
It can be located inside the cwd
of the shim named "log".
The shims can use the existing github.com/containerd/containerd/log
package to log debug messages.
Messages will automatically be output in the containerd's daemon logs with the correct fields and runtime set.
ttrpc is one of the supported protocols for shims. It works with standard protobufs and GRPC services as well as generating clients. The only difference between grpc and ttrpc is the wire protocol. ttrpc removes the http stack in order to save memory and binary size to keep shims small. It is recommended to use ttrpc in your shim but grpc support is currently an experimental feature.