diff --git a/README.md b/README.md index 21871c0..c0587d7 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/Privatehive/g-spot-runner-github-actions/main.yml?branch=master&style=flat&logo=github&label=Docker+build)](https://github.com/Privatehive/g-spot-runner-github-actions/actions?query=branch%3Amaster) -**This terraform module provides a ready to use solution for Google Cloud hosted [GitHub ephemeral runner](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling). To save cost preemtible spot compute instances will be used.** +**This terraform module provides a ready to use solution for Google Cloud hosted [GitHub ephemeral runner](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling).** > [!IMPORTANT] > I am not responsible if this Terraform module results in high costs on your billing account. Keep an eye on your billing account and activate alerts! @@ -20,7 +20,7 @@ module "spot-runner" { github_organization = "" github_runner_group = "Default" github_runner_prefix = "runner" - spot_machine_type = "c2d-highcpu-8" + machine_type = "c2d-highcpu-8" } provider "google" { @@ -34,10 +34,11 @@ output "runner_webhook_config" { } ``` -Authenticate with `gcloud` and apply the Terraform (apply twice if the first apply results in an error) +Authenticate with `gcloud` and apply the terraform module (apply twice if the first apply results in an error - wait some minutes in between) ``` bash -$ terraform init && terraform apply +$ gcloud auth application-default login --project +$ terraform init -upgrade && terraform apply ``` Have a look at the Terraform output `runner_webhook_config`. There you find the Cloud Run webhook url and secret. Now switch to your GitHub organization settings and create a new webhook: @@ -52,19 +53,26 @@ Have a look at the Terraform output `runner_webhook_config`. There you find the That's it. -As soon as you start a GitHub workflow, which contains a job with `runs-on: self-hosted`, a compute instance (with the specified `spot_machine_type` type) starts. The name of the compute instance starts with the `github_runner_prefix` which is followed by a random string. The name of the compute instance is also the name of the runner in GitHub. After the job completed, the compute instance will be deleted again. +As soon as you start a GitHub workflow, which contains a job with `runs-on: self-hosted`, a VM instance (with the specified `machine_type`) starts. The name of the VM instance starts with the `github_runner_prefix`, which is followed by a random string to make the name unique. The name of the VM instance is also the name of the runner in the GitHub runner group. After the workflow job completed, the VM instance will be deleted again. + +## Configuration + +Have a look at the variables.tf file how to configure the Terraform module. + +> [!TIP] +> To find the cheapest VM machine_type use this [table](https://gcloud-compute.com/instances.html) and sort by Spot instance cost. But remember that the price varies depending on the region. ## How it works 1. As soon as a new GitHub workflow job is queued, the GitHub webhook event "Workflow jobs" invokes the Cloud Run [container](https://github.com/Privatehive/g-spot-runner-github-actions/pkgs/container/runner-autoscaler) with path `/webhook` -2. The Cloud run enqueues a "create runner" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to start a compute instance takes about 1 minute. -3. The Cloud task invokes the Cloud Run path `/create_runner`. -4. The Cloud Run creates the preemtible spot compute instance from the instance template -5. In the startup script the compute instance uses the PAT to generate a runner token. With the token it registers itself as an ephemeral runner in the runner group and immediately starts working on the workflow job. +2. The Cloud run enqueues a "create-vm" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to start a VM instance takes about 1 minute. +3. The Cloud task invokes the Cloud Run path `/create_vm`. +4. The Cloud Run creates the VM instance from the instance template (preemtible spot VM instance by default) +5. In the startup script of the VM instance the PAT is used to generate a runner token. With the token the VM registers itself as an **ephemeral** runner in the runner group and immediately starts working on the workflow job. 6. As soon as the workflow job completed, the GitHub webhook event "Workflow jobs" invokes the Cloud Run again. -7. The Cloud run enqueues a "delete runner" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to delete a compute instance takes about 1 minute. -8. The Cloud task invokes the Cloud Run path `/delete_runner`. -9. The Cloud Run deletes the compute instance. +7. The Cloud run enqueues a "delete-vm" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to delete a VM instance takes about 1 minute. +8. The Cloud task invokes the Cloud Run path `/delete_vm`. +9. The Cloud Run deletes the VM instance. ## Troubleshooting @@ -79,6 +87,6 @@ Error applying IAM policy for cloudrun service "v1/projects/azure-pipelines-spot 2. Solution: Override the Organization Policy "Domain Restricted Sharing" in the project, by setting it to "Allow all". -#### New compute Instance not starting (but a lot of instances are already running) +#### New VM Instance not starting (but a lot of instances are already running) -You exceeded your projects vCPU limit for the machine type in the region. You may find an error log message in the Cloud Run logs stating `Machine Type vCPU quota exceeded for region`. Request a quota increase from google customer support for the project. +You exceeded your projects vCPU limit for the machine type in the region or for all regions. You may find an error log message in the Cloud Run logs stating `Machine Type vCPU quota exceeded for region`. Request a quota increase from google customer support for the project. diff --git a/cloudRun.tf b/cloudRun.tf index 0183f84..f801b6a 100644 --- a/cloudRun.tf +++ b/cloudRun.tf @@ -11,7 +11,7 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" { template { service_account = google_service_account.agent_autoscaler.email - max_instance_request_concurrency = 10 + max_instance_request_concurrency = 20 timeout = "120s" scaling { min_instance_count = 0 @@ -32,7 +32,7 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" { value = google_cloud_tasks_queue.agent_autoscaler_tasks.id } env { - name = "INSTANCE_TEMPLATE_URL" + name = "INSTANCE_TEMPLATE" value = google_compute_instance_template.spot_instance.id } env { @@ -43,6 +43,10 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" { name = "RUNNER_GROUP" value = var.github_runner_group } + env { + name = "RUNNER_LABELS" + value = local.runnerLabel + } env { name = "WEBHOOK_SECRET" value = random_password.webhook_secret.result @@ -51,14 +55,6 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" { name = "ROUTE_WEBHOOK" value = local.webhookUrl } - env { - name = "ROUTE_CREATE_RUNNER" - value = local.webhookCreateRunner - } - env { - name = "ROUTE_DELETE_RUNNER" - value = local.webhookDeleteRunner - } dynamic "env" { for_each = var.enable_debug ? [0] : [] content { diff --git a/compute.tf b/compute.tf index 043fe09..ece03dc 100644 --- a/compute.tf +++ b/compute.tf @@ -1,23 +1,23 @@ resource "google_compute_instance_template" "spot_instance" { - name = "ephemeral-runner" + name = "ephemeral-github-runner" region = local.region - machine_type = var.spot_machine_type - tags = ["http-egress", "ssh-ingress"] + machine_type = var.machine_type + tags = var.enable_ssh ? ["http-egress", "ssh-ingress"] : ["http-egress"] depends_on = [google_project_service.compute_api] scheduling { - preemptible = true + preemptible = var.machine_preemtible automatic_restart = false on_host_maintenance = "TERMINATE" instance_termination_action = "STOP" - provisioning_model = "SPOT" + provisioning_model = var.machine_preemtible ? "SPOT" : "STANDARD" } disk { auto_delete = true boot = true - source_image = var.spot_machine_image + source_image = var.machine_image disk_type = "pd-standard" disk_size_gb = 40 } @@ -49,7 +49,7 @@ chown -R agent:agent /home/agent pushd /home/agent sudo -u agent tar zxf /tmp/agent.tar.gz register_token=$(curl -s -L -X POST -H "Accept: application/vnd.github+json" -H "Authorization: Bearer ${var.github_pat_token}" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/orgs/${var.github_organization}/actions/runners/registration-token | jq -r .token) -sudo -u agent ./config.sh --unattended --disableupdate --ephemeral --name $(hostname) --url 'https://github.com/${var.github_organization}' --token $${register_token} --runnergroup '${var.github_runner_group}' || shutdown now +sudo -u agent ./config.sh --unattended --disableupdate --ephemeral --name $(hostname) ${local.runnerLabelInstanceTemplate} --url 'https://github.com/${var.github_organization}' --token $${register_token} --runnergroup '${var.github_runner_group}' || shutdown now ./bin/installdependencies.sh || shutdown now ./svc.sh install agent || shutdown now ./svc.sh start || shutdown now diff --git a/main.tf b/main.tf index 5941086..2ac9109 100644 --- a/main.tf +++ b/main.tf @@ -14,13 +14,13 @@ data "google_project" "current" { } locals { - projectId = data.google_client_config.current.project - projectNumber = data.google_project.current.number - region = data.google_client_config.current.region - zone = data.google_client_config.current.zone - webhookUrl = "/webhook" - webhookCreateRunner = "/create_runner" - webhookDeleteRunner = "/delete_runner" + webhookUrl = "/webhook" + projectId = data.google_client_config.current.project + projectNumber = data.google_project.current.number + region = data.google_client_config.current.region + zone = data.google_client_config.current.zone + runnerLabel = join(",", var.github_runner_labels) + runnerLabelInstanceTemplate = length(var.github_runner_labels) == 0 ? "" : format("--no-default-labels --labels '%s'", local.runnerLabel) } resource "google_project_service" "compute_api" { @@ -35,14 +35,6 @@ resource "google_project_service" "artifactregistry_api" { service = "artifactregistry.googleapis.com" } -#resource "google_project_service" "cloudscheduler_api" { -# service = "cloudscheduler.googleapis.com" -#} - resource "google_project_service" "cloudtasks_api" { service = "cloudtasks.googleapis.com" } - -#resource "google_project_service" "eventarc_api" { -# service = "eventarc.googleapis.com" -#} diff --git a/runner-autoscaler/README.md b/runner-autoscaler/README.md index 12ad8dd..0ecc37e 100644 --- a/runner-autoscaler/README.md +++ b/runner-autoscaler/README.md @@ -1,18 +1,42 @@ # Autoscaler -This tiny webserver receives GitHub "Workflow jobs" webhook events. Depending on the workflow job state, a compute instance will be started or deleted. -The short timeout of the GitHub webhook (10 sec) has to be worked around (10 sec are not enough to start compute instance) by using a Clout Task queue that calls the webserver back with an increased timeout. - -| Env | Default | Description | -| --------------------- | --------------- | ------------------------------------------------------------------------------------------- | -| ROUTE_WEBHOOK | /webhook | The Cloud Run path that is invoked by the GitHub webhook | -| ROUTE_DELETE_RUNNER | /delete_runner | The Cloud Run callback path invoked by Cloud Task when a compute instance should be deleted | -| ROUTE_CREATE_RUNNER | /create_runner | The Cloud Run callback path invoked by Cloud Task when a compute instance should be created | -| WEBHOOK_SECRET | arbitrarySecret | The GitHub webhook secret | -| PROJECT_ID | | The Google Cloud Project id | -| ZONE | | The Google Cloud zone where the spot compute instance should be created | -| TASK_QUEUE | | The URL of the Cloud Task queue | -| INSTANCE_TEMPLATE_URL | | The URL of the compute instance template | -| RUNNER_PREFIX | runner | Prefix of the compute instances (a random string will be added to the name) | -| RUNNER_GROUP | Default | The GitHub runner group | -| PORT | 8080 | On which port to bind the webserver | +#### Creates/Deletes VM instances depending on GitHub workflow jobs webhook events + +A webserver is listening for GitHub "Workflow jobs" webhook events. Depending on the workflow job, a VM instance will be either created or deleted. The [10 second timeout](https://docs.github.com/en/webhooks/using-webhooks/best-practices-for-using-webhooks#respond-within-10-seconds) of the GitHub webhook has to be worked around (10 sec are not enough to start VM instance) by using a Clout Task queue that calls the webserver back with an increased timeout of 120 seconds. + +### Scaling rules + +> [!IMPORTANT] +> If the scaler is configured incorrectly, this can lead to “dangling” computing instances, resulting in unnecessary costs. + +Following conditions of the workflow job webhook event have to be fulfilled, so a new VM instance will be **created**: + +* The webhook signature is valid (see WEBHOOK_SECRET env). +* The webhook `action` value equals `queued`. +* **All** labels of the workflow job match the configured RUNNER_LABELS. + +Following conditions of the workflow job webhook event have to be fulfilled, so an existing VM instance will be **deleted**: + +* The webhook signature is valid (see WEBHOOK_SECRET env). +* The webhook `action` value equals `completed`. +* The webhook `workflow_job.runner_group_name` value equals the configured RUNNER_GROUP. +* **All** labels of the workflow job match the configured RUNNER_LABELS. + +### Configuration + +The scaler is configured via the following environment variables: + +| Env | Default | Description | +| ----------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| ROUTE_WEBHOOK | /webhook | The Cloud Run path that is invoked by the GitHub webhook. Depending on the workflow job, a Cloud Task "delete runner" or "create runner" is enqueued. | +| ROUTE_DELETE_VM | /delete_vm | The Cloud Run callback path invoked by Cloud Task when a VM instance should be **deleted**. The payload contains the name of the "to be deleted" VM instance. | +| ROUTE_CREATE_VM | /create_vm | The Cloud Run callback path invoked by Cloud Task when a VM instance should be **created**. The payload contains the name of the "to be created" VM instance. | +| WEBHOOK_SECRET | | The GitHub webhook secret. This is the secret the webhook has been [configured](https://docs.github.com/en/webhooks/using-webhooks/validating-webhook-deliveries) with. | +| PROJECT_ID | | The Google Cloud Project Id. | +| ZONE | | The Google Cloud zone where the VM instance will be created. | +| TASK_QUEUE | | The relative resource name of the Cloud Task queue. | +| INSTANCE_TEMPLATE | | The relative resource name of the instance template from which the VM instance will be created. | +| RUNNER_PREFIX | runner | Prefix for the the name of a new VM instance. A random string (10 random lower case characters) will be added to make the name unique: "-". | +| RUNNER_GROUP | Default | The GitHub runner group where the VM instance is expected to join as a self hosted runner. | +| RUNNER_LABELS | self-hosted *(comma separated list)* | Only workflow jobs whose labels match **all** the configured labels will be taken into account. If only one configured label is **not** found in the workflow job it will be ignored. | +| PORT | 8080 | To which port the webserver is bound. | diff --git a/runner-autoscaler/main.go b/runner-autoscaler/main.go index 5e60f1c..1e176f7 100644 --- a/runner-autoscaler/main.go +++ b/runner-autoscaler/main.go @@ -3,6 +3,7 @@ package main import ( "os" "strconv" + "strings" "github.com/Tereius/g-spot-runner-github-actions/pkg" log "github.com/sirupsen/logrus" @@ -27,20 +28,27 @@ func mustGetEnv(name string) string { func main() { - log.Info("Starting poll server") - + labels := strings.Split(getEnvDefault("RUNNER_LABELS", "self-hosted"), ",") + runnerGroup := getEnvDefault("RUNNER_GROUP", "Default") scaler := pkg.NewAutoscaler(pkg.AutoscalerConfig{ - RouteCreateRunner: getEnvDefault("ROUTE_CREATE_RUNNER", "/create_runner"), - RouteDeleteRunner: getEnvDefault("ROUTE_DELETE_RUNNER", "/delete_runner"), - RouteWebhook: getEnvDefault("ROUTE_WEBHOOK", "/webhook"), - WebhookSecret: getEnvDefault("WEBHOOK_SECRET", "arbitrarySecret"), - ProjectId: mustGetEnv("PROJECT_ID"), - Zone: mustGetEnv("ZONE"), - TaskQueue: mustGetEnv("TASK_QUEUE"), - InstanceTemplateUrl: mustGetEnv("INSTANCE_TEMPLATE_URL"), - RunnerPrefix: getEnvDefault("RUNNER_PREFIX", "runner"), - RunnerGroup: getEnvDefault("RUNNER_GROUP", "Default"), + RouteWebhook: getEnvDefault("ROUTE_WEBHOOK", "/webhook"), + RouteCreateVm: getEnvDefault("ROUTE_CREATE_VM", "/create_vm"), + RouteDeleteVm: getEnvDefault("ROUTE_DELETE_VM", "/delete_vm"), + WebhookSecret: getEnvDefault("WEBHOOK_SECRET", ""), + ProjectId: mustGetEnv("PROJECT_ID"), + Zone: mustGetEnv("ZONE"), + TaskQueue: mustGetEnv("TASK_QUEUE"), + InstanceTemplate: mustGetEnv("INSTANCE_TEMPLATE"), + RunnerPrefix: getEnvDefault("RUNNER_PREFIX", "runner"), + RunnerGroup: runnerGroup, + RunnerLabels: labels, }) + + if len(labels) == 0 { + log.Warn("No workflow runner labels were provided. You should at least add the label \"self-hosted\"") + } + port, _ := strconv.Atoi(getEnvDefault("PORT", "8080")) + log.Infof("Starting autoscaler on port %d observing workflow jobs of runner group \"%s\" with labels \"%s\"", port, runnerGroup, strings.Join(labels, ", ")) scaler.Srv(port) } diff --git a/runner-autoscaler/pkg/srv.go b/runner-autoscaler/pkg/srv.go index 379df35..b427223 100644 --- a/runner-autoscaler/pkg/srv.go +++ b/runner-autoscaler/pkg/srv.go @@ -29,6 +29,9 @@ const SHA_PREFIX = "sha256=" const SHA_HEADER = "x-hub-signature-256" const EVENT_HEADER = "x-github-event" +const WEBHOOK_PING_EVENT = "ping" +const WEBHOOK_JOB_EVENT = "workflow_job" + type Job struct { Id int64 `json:"id"` Name string `json:"name"` @@ -53,6 +56,18 @@ func (j Job) hasLabel(label string) bool { return false } +// returns true if all labels were found. false otherwise. Returns also all labels that were missing +func (j Job) hasAllLabels(labels []string) (bool, []string) { + + missingLabels := []string{} + for _, label := range labels { + if !j.hasLabel(label) { + missingLabels = append(missingLabels, label) + } + } + return len(missingLabels) <= 0, missingLabels +} + type Action string const ( @@ -249,9 +264,9 @@ func (s *Autoscaler) createInstanceFromTemplate(ctx context.Context, instanceNam InstanceResource: &computepb.Instance{ Name: proto.String(instanceName), }, - SourceInstanceTemplate: &s.conf.InstanceTemplateUrl, + SourceInstanceTemplate: &s.conf.InstanceTemplate, }); err != nil { - log.Errorf("Could not create instance %s from template: %s - %s", instanceName, s.conf.InstanceTemplateUrl, err.Error()) + log.Errorf("Could not create instance %s from template: %s - %s", instanceName, s.conf.InstanceTemplate, err.Error()) return err } else { if err := res.Wait(ctx); err != nil { @@ -267,18 +282,15 @@ func (s *Autoscaler) createInstanceFromTemplate(ctx context.Context, instanceNam func (s *Autoscaler) createCallbackTaskWithToken(ctx context.Context, url, message string) (*taskspb.Task, error) { now := timestamppb.Now() - now.Seconds += 1 - // Build the Task payload. - // https://godoc.org/google.golang.org/genproto/googleapis/cloud/tasks/v2#CreateTaskRequest + now.Seconds += 1 // delay the callback a little bit req := &taskspb.CreateTaskRequest{ Parent: s.conf.TaskQueue, Task: &taskspb.Task{ DispatchDeadline: &durationpb.Duration{ - Seconds: 120, + Seconds: 120, // the timeout of the cloud task callback Nanos: 0, }, ScheduleTime: now, - // https://godoc.org/google.golang.org/genproto/googleapis/cloud/tasks/v2#HttpRequest MessageType: &taskspb.Task_HttpRequest{ HttpRequest: &taskspb.HttpRequest{ HttpMethod: taskspb.HttpMethod_POST, @@ -291,43 +303,33 @@ func (s *Autoscaler) createCallbackTaskWithToken(ctx context.Context, url, messa }, } - // Add a payload message if one is present. req.Task.GetHttpRequest().Body = []byte(message) createdTask, err := s.t.CreateTask(ctx, req) if err != nil { - return nil, fmt.Errorf("cloudtasks.CreateTask: %w", err) + return nil, fmt.Errorf("cloudtasks.CreateTask failed: %v", err) } else { - log.Info("Created callback task") + log.Infof("Created cloud task callback with url \"%s\" and payload \"%s\"", url, message) } return createdTask, nil } -func (s *Autoscaler) handleCreateRunner(ctx *gin.Context) { +func (s *Autoscaler) handleCreateVm(ctx *gin.Context) { - log.Info("Received handleCreateRunner call") + log.Info("Received create-vm cloud task callback") if data, err := s.verifySignature(ctx); err == nil { if err := s.createInstanceFromTemplate(ctx, string(data)); err != nil { ctx.AbortWithError(http.StatusInternalServerError, err) } else { ctx.Status(http.StatusOK) - /* - delteUrl := createCallbackUrl(ctx, s.conf.RouteCreateRunner) - if _, err := s.createCallbackTaskWithToken(ctx, delteUrl, runnerName); err != nil { - log.Errorf("Immediately delete instance \"%s\" again because callback could not be created", runnerName) - s.deleteInstance(context.Background(), runnerName) // Ignore timeous, make sure the spot instance gets destroyed - ctx.AbortWithError(http.StatusInternalServerError, err) - } else { - ctx.Status(http.StatusOK) - }*/ } } } -func (s *Autoscaler) handleDeleteRunner(ctx *gin.Context) { +func (s *Autoscaler) handleDeleteVm(ctx *gin.Context) { - log.Info("Received handleDeleteRunner call") + log.Info("Received delete-vm cloud task callback") if data, err := s.verifySignature(ctx); err == nil { if err := s.deleteInstance(ctx, string(data)); err != nil { ctx.AbortWithError(http.StatusInternalServerError, err) @@ -339,65 +341,67 @@ func (s *Autoscaler) handleDeleteRunner(ctx *gin.Context) { func (s *Autoscaler) handleWebhook(ctx *gin.Context) { - log.Info("Received webhook call") + log.Info("Received webhook") if data, err := s.verifySignature(ctx); err == nil { event := ctx.GetHeader(EVENT_HEADER) - log.Info(ctx.Request.Header) log.Info(string(data)) - if event == "ping" { - log.Info("Received ping") + if event == WEBHOOK_PING_EVENT { + log.Info("Webhook ping acknowledged") ctx.Status(http.StatusOK) - } else if event == "workflow_job" { + } else if event == WEBHOOK_JOB_EVENT { payload := Payload{} if err := json.Unmarshal(data, &payload); err != nil { - log.Errorf("Can not unmarshal payload: %s", err.Error()) + log.Errorf("Can not unmarshal payload - is the webhook content type set to \"application/json\"? %s", err.Error()) ctx.AbortWithError(http.StatusBadRequest, err) } else { if payload.Action == QUEUED { - if payload.Job.hasLabel("self-hosted") { - createUrl := createCallbackUrl(ctx, s.conf.RouteCreateRunner) - log.Infof("About to create new instance callback task with url: %s", createUrl) + if ok, missingLabels := payload.Job.hasAllLabels(s.conf.RunnerLabels); ok { + createUrl := createCallbackUrl(ctx, s.conf.RouteCreateVm) if _, err := s.createCallbackTaskWithToken(ctx, createUrl, fmt.Sprintf("%s-%s", s.conf.RunnerPrefix, randStringRunes(10))); err != nil { - log.Errorf("Can not create callback: %s", err.Error()) + log.Errorf("Can not enqueue create-vm cloud task callback: %s", err.Error()) + ctx.AbortWithError(http.StatusInternalServerError, err) + return } } else { - log.Info("Webhook requested to start a runner that is not self-hosted - ignoring") + log.Warnf("Webhook requested to start a runner that is missing the label(s) \"%s\" - ignoring", strings.Join(missingLabels, ", ")) } } else if payload.Action == COMPLETED { if payload.Job.RunnerGroupName == s.conf.RunnerGroup { - if strings.HasPrefix(payload.Job.RunnerName, s.conf.RunnerPrefix) { - deleteUrl := createCallbackUrl(ctx, s.conf.RouteDeleteRunner) - log.Infof("About to create delete callback task with url: %s", deleteUrl) + if ok, missingLabels := payload.Job.hasAllLabels(s.conf.RunnerLabels); ok { + deleteUrl := createCallbackUrl(ctx, s.conf.RouteDeleteVm) if _, err := s.createCallbackTaskWithToken(ctx, deleteUrl, payload.Job.RunnerName); err != nil { - log.Errorf("Can not create callback: %s", err.Error()) + log.Errorf("Can not enqueue delete-vm cloud task callback: %s", err.Error()) + ctx.AbortWithError(http.StatusInternalServerError, err) + return } } else { - log.Warnf("Webhook requested to stop a runner that does not start with the expected runner prefix (expected \"%s\" got \"%s\") - ignoring", s.conf.RunnerPrefix, payload.Job.RunnerName) + log.Warnf("Webhook signaled to delete a runner that is missing the label(s) \"%s\" - ignoring", strings.Join(missingLabels, ", ")) } } else { - log.Warnf("Webhook requested to stop a runner that does not belong to the expected runner group (expected \"%s\" got \"%s\") - ignoring", s.conf.RunnerGroup, payload.Job.RunnerGroupName) + log.Warnf("Webhook signaled to delete a runner that does not belong to the expected runner group (expected \"%s\" got \"%s\") - ignoring", s.conf.RunnerGroup, payload.Job.RunnerGroupName) } } ctx.Status(http.StatusOK) } } else { - log.Warnf("Unknown GitHub event \"%s\" received - ignored", event) + log.Infof("Unknown GitHub webhook event \"%s\" received - ignoring", event) ctx.Status(http.StatusOK) } } } type AutoscalerConfig struct { - RouteCreateRunner string - RouteDeleteRunner string - RouteWebhook string - WebhookSecret string - ProjectId string - Zone string - TaskQueue string - InstanceTemplateUrl string - RunnerPrefix string - RunnerGroup string + RouteWebhook string + RouteCreateVm string + RouteDeleteVm string + WebhookSecret string + ProjectId string + Zone string + TaskQueue string + InstanceTemplate string + RunnerPrefix string + RunnerGroup string + RunnerLabels []string } type Autoscaler struct { @@ -417,8 +421,8 @@ func NewAutoscaler(config AutoscalerConfig) Autoscaler { conf: config, } engine.Use(ginlogrus.Logger(log.WithFields(log.Fields{}))) - engine.POST(config.RouteCreateRunner, scaler.handleCreateRunner) - engine.POST(config.RouteDeleteRunner, scaler.handleDeleteRunner) + engine.POST(config.RouteCreateVm, scaler.handleCreateVm) + engine.POST(config.RouteDeleteVm, scaler.handleDeleteVm) engine.POST(config.RouteWebhook, scaler.handleWebhook) engine.GET("/healthcheck", func(ctx *gin.Context) { ctx.Status(http.StatusOK) }) return scaler diff --git a/runner-autoscaler/test/main_test.go b/runner-autoscaler/test/main_test.go index b563848..e06ef9f 100644 --- a/runner-autoscaler/test/main_test.go +++ b/runner-autoscaler/test/main_test.go @@ -17,20 +17,20 @@ var PORT = 9999 func init() { scaler := pkg.NewAutoscaler(pkg.AutoscalerConfig{ - RouteCreateRunner: "/create", - RouteDeleteRunner: "/delete", - RouteWebhook: "/webhook", - WebhookSecret: "It's a Secret to Everybody", - ProjectId: "1", - Zone: "z", - TaskQueue: "q", - InstanceTemplateUrl: "/", - RunnerPrefix: "runner", + RouteCreateVm: "/create", + RouteDeleteVm: "/delete", + RouteWebhook: "/webhook", + WebhookSecret: "It's a Secret to Everybody", + ProjectId: "1", + Zone: "z", + TaskQueue: "q", + InstanceTemplate: "/", + RunnerPrefix: "runner", }) go scaler.Srv(PORT) } -func Test(t *testing.T) { +func TestWebhookSignature(t *testing.T) { ctx, _ := context.WithTimeout(context.Background(), 5*time.Second) req, _ := http.NewRequestWithContext(ctx, "POST", fmt.Sprintf("http://localhost:%d/webhook", PORT), strings.NewReader("Hello, World!")) req.Header.Add("x-hub-signature-256", "sha256=757107ea0eb2509fc211221cce984b8a37570b6d7586c22c46f4379c8b043e17") diff --git a/variables.tf b/variables.tf index 3989393..23f6a28 100644 --- a/variables.tf +++ b/variables.tf @@ -1,24 +1,30 @@ -variable "spot_machine_type" { +variable "machine_type" { type = string - description = "The machine type that each spot agent will use" + description = "The VM instance machine type where the GitHub runner will run on" default = "e2-micro" } -variable "spot_machine_image" { +variable "machine_image" { type = string - description = "The machine Linux image to run (gcloud compute images list --filter ubuntu-os)" + description = "The VM instance boot image (gcloud compute images list --filter ubuntu-os). Only Linux is supported." default = "ubuntu-os-cloud/ubuntu-minimal-2004-lts" } +variable "machine_preemtible" { + type = bool + description = "The VM instance will be an preemtible spot instance that costs much less but may be stop by gcp at any time (leading to a failed workflow job)." + default = true +} + variable "enable_ssh" { type = bool - description = "Enable SSH access" + description = "Enable SSH access to the VM instances" default = false } variable "use_cloud_nat" { type = bool - description = "Use a cloud nat and router instead of a public ip address for the compute instances" + description = "Use a cloud NAT and router instead of a public ip address for the VM instances" default = false } @@ -45,9 +51,19 @@ variable "github_runner_group" { default = "Default" } +variable "github_runner_labels" { + type = list(string) + description = "One or multiple labels the runner will be tagged with" + default = ["self-hosted"] + validation { + condition = length(var.github_runner_labels) > 0 + error_message = "The variable github_runner_labels must contain at least one not empty value!" + } +} + variable "github_runner_prefix" { type = string - description = "The prefix of each runner" + description = "The name prefix of the runner (a random string will be automatically added to make the name unique)." default = "runner" } diff --git a/vpc.tf b/vpc.tf index c2920e1..652f510 100644 --- a/vpc.tf +++ b/vpc.tf @@ -1,13 +1,13 @@ resource "google_compute_network" "vpc_network" { name = "spot-runner-network" - description = "The network the spot runner will join" + description = "The network the ephemeral GitHub runner instances will join" auto_create_subnetworks = false depends_on = [google_project_service.compute_api] } resource "google_compute_subnetwork" "subnetwork" { name = "spot-runner-subnetwork" - description = "The subnetwork the spot runner will join" + description = "The subnetwork the ephemeral GitHub runner instances will join" ip_cidr_range = "10.0.1.0/24" network = google_compute_network.vpc_network.name private_ip_google_access = true