Skip to content

Commit

Permalink
Improve documentation, introduce 'machine_preemtible' variable
Browse files Browse the repository at this point in the history
  • Loading branch information
Tereius committed Jul 30, 2024
1 parent e387178 commit 2fb0151
Show file tree
Hide file tree
Showing 10 changed files with 194 additions and 146 deletions.
36 changes: 22 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/Privatehive/g-spot-runner-github-actions/main.yml?branch=master&style=flat&logo=github&label=Docker+build)](https://github.com/Privatehive/g-spot-runner-github-actions/actions?query=branch%3Amaster)

**This terraform module provides a ready to use solution for Google Cloud hosted [GitHub ephemeral runner](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling). To save cost preemtible spot compute instances will be used.**
**This terraform module provides a ready to use solution for Google Cloud hosted [GitHub ephemeral runner](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/autoscaling-with-self-hosted-runners#using-ephemeral-runners-for-autoscaling).**

> [!IMPORTANT]
> I am not responsible if this Terraform module results in high costs on your billing account. Keep an eye on your billing account and activate alerts!
Expand All @@ -20,7 +20,7 @@ module "spot-runner" {
github_organization = "<the_organization>"
github_runner_group = "Default"
github_runner_prefix = "runner"
spot_machine_type = "c2d-highcpu-8"
machine_type = "c2d-highcpu-8"
}
provider "google" {
Expand All @@ -34,10 +34,11 @@ output "runner_webhook_config" {
}
```

Authenticate with `gcloud` and apply the Terraform (apply twice if the first apply results in an error)
Authenticate with `gcloud` and apply the terraform module (apply twice if the first apply results in an error - wait some minutes in between)

``` bash
$ terraform init && terraform apply
$ gcloud auth application-default login --project <gcp_project>
$ terraform init -upgrade && terraform apply
```

Have a look at the Terraform output `runner_webhook_config`. There you find the Cloud Run webhook url and secret. Now switch to your GitHub organization settings and create a new webhook:
Expand All @@ -52,19 +53,26 @@ Have a look at the Terraform output `runner_webhook_config`. There you find the

That's it.

As soon as you start a GitHub workflow, which contains a job with `runs-on: self-hosted`, a compute instance (with the specified `spot_machine_type` type) starts. The name of the compute instance starts with the `github_runner_prefix` which is followed by a random string. The name of the compute instance is also the name of the runner in GitHub. After the job completed, the compute instance will be deleted again.
As soon as you start a GitHub workflow, which contains a job with `runs-on: self-hosted`, a VM instance (with the specified `machine_type`) starts. The name of the VM instance starts with the `github_runner_prefix`, which is followed by a random string to make the name unique. The name of the VM instance is also the name of the runner in the GitHub runner group. After the workflow job completed, the VM instance will be deleted again.

## Configuration

Have a look at the variables.tf file how to configure the Terraform module.

> [!TIP]
> To find the cheapest VM machine_type use this [table](https://gcloud-compute.com/instances.html) and sort by Spot instance cost. But remember that the price varies depending on the region.
## How it works

1. As soon as a new GitHub workflow job is queued, the GitHub webhook event "Workflow jobs" invokes the Cloud Run [container](https://github.com/Privatehive/g-spot-runner-github-actions/pkgs/container/runner-autoscaler) with path `/webhook`
2. The Cloud run enqueues a "create runner" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to start a compute instance takes about 1 minute.
3. The Cloud task invokes the Cloud Run path `/create_runner`.
4. The Cloud Run creates the preemtible spot compute instance from the instance template
5. In the startup script the compute instance uses the PAT to generate a runner token. With the token it registers itself as an ephemeral runner in the runner group and immediately starts working on the workflow job.
2. The Cloud run enqueues a "create-vm" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to start a VM instance takes about 1 minute.
3. The Cloud task invokes the Cloud Run path `/create_vm`.
4. The Cloud Run creates the VM instance from the instance template (preemtible spot VM instance by default)
5. In the startup script of the VM instance the PAT is used to generate a runner token. With the token the VM registers itself as an **ephemeral** runner in the runner group and immediately starts working on the workflow job.
6. As soon as the workflow job completed, the GitHub webhook event "Workflow jobs" invokes the Cloud Run again.
7. The Cloud run enqueues a "delete runner" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to delete a compute instance takes about 1 minute.
8. The Cloud task invokes the Cloud Run path `/delete_runner`.
9. The Cloud Run deletes the compute instance.
7. The Cloud run enqueues a "delete-vm" Cloud Task. This is necessary, because the timeout of a GitHub webhook is only 10 seconds but to delete a VM instance takes about 1 minute.
8. The Cloud task invokes the Cloud Run path `/delete_vm`.
9. The Cloud Run deletes the VM instance.

## Troubleshooting

Expand All @@ -79,6 +87,6 @@ Error applying IAM policy for cloudrun service "v1/projects/azure-pipelines-spot

2. Solution: Override the Organization Policy "Domain Restricted Sharing" in the project, by setting it to "Allow all".

#### New compute Instance not starting (but a lot of instances are already running)
#### New VM Instance not starting (but a lot of instances are already running)

You exceeded your projects vCPU limit for the machine type in the region. You may find an error log message in the Cloud Run logs stating `Machine Type vCPU quota exceeded for region`. Request a quota increase from google customer support for the project.
You exceeded your projects vCPU limit for the machine type in the region or for all regions. You may find an error log message in the Cloud Run logs stating `Machine Type vCPU quota exceeded for region`. Request a quota increase from google customer support for the project.
16 changes: 6 additions & 10 deletions cloudRun.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" {

template {
service_account = google_service_account.agent_autoscaler.email
max_instance_request_concurrency = 10
max_instance_request_concurrency = 20
timeout = "120s"
scaling {
min_instance_count = 0
Expand All @@ -32,7 +32,7 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" {
value = google_cloud_tasks_queue.agent_autoscaler_tasks.id
}
env {
name = "INSTANCE_TEMPLATE_URL"
name = "INSTANCE_TEMPLATE"
value = google_compute_instance_template.spot_instance.id
}
env {
Expand All @@ -43,6 +43,10 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" {
name = "RUNNER_GROUP"
value = var.github_runner_group
}
env {
name = "RUNNER_LABELS"
value = local.runnerLabel
}
env {
name = "WEBHOOK_SECRET"
value = random_password.webhook_secret.result
Expand All @@ -51,14 +55,6 @@ resource "google_cloud_run_v2_service" "agent_autoscaler" {
name = "ROUTE_WEBHOOK"
value = local.webhookUrl
}
env {
name = "ROUTE_CREATE_RUNNER"
value = local.webhookCreateRunner
}
env {
name = "ROUTE_DELETE_RUNNER"
value = local.webhookDeleteRunner
}
dynamic "env" {
for_each = var.enable_debug ? [0] : []
content {
Expand Down
14 changes: 7 additions & 7 deletions compute.tf
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
resource "google_compute_instance_template" "spot_instance" {

name = "ephemeral-runner"
name = "ephemeral-github-runner"
region = local.region
machine_type = var.spot_machine_type
tags = ["http-egress", "ssh-ingress"]
machine_type = var.machine_type
tags = var.enable_ssh ? ["http-egress", "ssh-ingress"] : ["http-egress"]
depends_on = [google_project_service.compute_api]

scheduling {
preemptible = true
preemptible = var.machine_preemtible
automatic_restart = false
on_host_maintenance = "TERMINATE"
instance_termination_action = "STOP"
provisioning_model = "SPOT"
provisioning_model = var.machine_preemtible ? "SPOT" : "STANDARD"
}

disk {
auto_delete = true
boot = true
source_image = var.spot_machine_image
source_image = var.machine_image
disk_type = "pd-standard"
disk_size_gb = 40
}
Expand Down Expand Up @@ -49,7 +49,7 @@ chown -R agent:agent /home/agent
pushd /home/agent
sudo -u agent tar zxf /tmp/agent.tar.gz
register_token=$(curl -s -L -X POST -H "Accept: application/vnd.github+json" -H "Authorization: Bearer ${var.github_pat_token}" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/orgs/${var.github_organization}/actions/runners/registration-token | jq -r .token)
sudo -u agent ./config.sh --unattended --disableupdate --ephemeral --name $(hostname) --url 'https://github.com/${var.github_organization}' --token $${register_token} --runnergroup '${var.github_runner_group}' || shutdown now
sudo -u agent ./config.sh --unattended --disableupdate --ephemeral --name $(hostname) ${local.runnerLabelInstanceTemplate} --url 'https://github.com/${var.github_organization}' --token $${register_token} --runnergroup '${var.github_runner_group}' || shutdown now
./bin/installdependencies.sh || shutdown now
./svc.sh install agent || shutdown now
./svc.sh start || shutdown now
Expand Down
22 changes: 7 additions & 15 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ data "google_project" "current" {
}

locals {
projectId = data.google_client_config.current.project
projectNumber = data.google_project.current.number
region = data.google_client_config.current.region
zone = data.google_client_config.current.zone
webhookUrl = "/webhook"
webhookCreateRunner = "/create_runner"
webhookDeleteRunner = "/delete_runner"
webhookUrl = "/webhook"
projectId = data.google_client_config.current.project
projectNumber = data.google_project.current.number
region = data.google_client_config.current.region
zone = data.google_client_config.current.zone
runnerLabel = join(",", var.github_runner_labels)
runnerLabelInstanceTemplate = length(var.github_runner_labels) == 0 ? "" : format("--no-default-labels --labels '%s'", local.runnerLabel)
}

resource "google_project_service" "compute_api" {
Expand All @@ -35,14 +35,6 @@ resource "google_project_service" "artifactregistry_api" {
service = "artifactregistry.googleapis.com"
}

#resource "google_project_service" "cloudscheduler_api" {
# service = "cloudscheduler.googleapis.com"
#}

resource "google_project_service" "cloudtasks_api" {
service = "cloudtasks.googleapis.com"
}

#resource "google_project_service" "eventarc_api" {
# service = "eventarc.googleapis.com"
#}
56 changes: 40 additions & 16 deletions runner-autoscaler/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,42 @@
# Autoscaler

This tiny webserver receives GitHub "Workflow jobs" webhook events. Depending on the workflow job state, a compute instance will be started or deleted.
The short timeout of the GitHub webhook (10 sec) has to be worked around (10 sec are not enough to start compute instance) by using a Clout Task queue that calls the webserver back with an increased timeout.

| Env | Default | Description |
| --------------------- | --------------- | ------------------------------------------------------------------------------------------- |
| ROUTE_WEBHOOK | /webhook | The Cloud Run path that is invoked by the GitHub webhook |
| ROUTE_DELETE_RUNNER | /delete_runner | The Cloud Run callback path invoked by Cloud Task when a compute instance should be deleted |
| ROUTE_CREATE_RUNNER | /create_runner | The Cloud Run callback path invoked by Cloud Task when a compute instance should be created |
| WEBHOOK_SECRET | arbitrarySecret | The GitHub webhook secret |
| PROJECT_ID | | The Google Cloud Project id |
| ZONE | | The Google Cloud zone where the spot compute instance should be created |
| TASK_QUEUE | | The URL of the Cloud Task queue |
| INSTANCE_TEMPLATE_URL | | The URL of the compute instance template |
| RUNNER_PREFIX | runner | Prefix of the compute instances (a random string will be added to the name) |
| RUNNER_GROUP | Default | The GitHub runner group |
| PORT | 8080 | On which port to bind the webserver |
#### Creates/Deletes VM instances depending on GitHub workflow jobs webhook events

A webserver is listening for GitHub "Workflow jobs" webhook events. Depending on the workflow job, a VM instance will be either created or deleted. The [10 second timeout](https://docs.github.com/en/webhooks/using-webhooks/best-practices-for-using-webhooks#respond-within-10-seconds) of the GitHub webhook has to be worked around (10 sec are not enough to start VM instance) by using a Clout Task queue that calls the webserver back with an increased timeout of 120 seconds.

### Scaling rules

> [!IMPORTANT]
> If the scaler is configured incorrectly, this can lead to “dangling” computing instances, resulting in unnecessary costs.
Following conditions of the workflow job webhook event have to be fulfilled, so a new VM instance will be **created**:

* The webhook signature is valid (see WEBHOOK_SECRET env).
* The webhook `action` value equals `queued`.
* **All** labels of the workflow job match the configured RUNNER_LABELS.

Following conditions of the workflow job webhook event have to be fulfilled, so an existing VM instance will be **deleted**:

* The webhook signature is valid (see WEBHOOK_SECRET env).
* The webhook `action` value equals `completed`.
* The webhook `workflow_job.runner_group_name` value equals the configured RUNNER_GROUP.
* **All** labels of the workflow job match the configured RUNNER_LABELS.

### Configuration

The scaler is configured via the following environment variables:

| Env | Default | Description |
| ----------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ROUTE_WEBHOOK | /webhook | The Cloud Run path that is invoked by the GitHub webhook. Depending on the workflow job, a Cloud Task "delete runner" or "create runner" is enqueued. |
| ROUTE_DELETE_VM | /delete_vm | The Cloud Run callback path invoked by Cloud Task when a VM instance should be **deleted**. The payload contains the name of the "to be deleted" VM instance. |
| ROUTE_CREATE_VM | /create_vm | The Cloud Run callback path invoked by Cloud Task when a VM instance should be **created**. The payload contains the name of the "to be created" VM instance. |
| WEBHOOK_SECRET | | The GitHub webhook secret. This is the secret the webhook has been [configured](https://docs.github.com/en/webhooks/using-webhooks/validating-webhook-deliveries) with. |
| PROJECT_ID | | The Google Cloud Project Id. |
| ZONE | | The Google Cloud zone where the VM instance will be created. |
| TASK_QUEUE | | The relative resource name of the Cloud Task queue. |
| INSTANCE_TEMPLATE | | The relative resource name of the instance template from which the VM instance will be created. |
| RUNNER_PREFIX | runner | Prefix for the the name of a new VM instance. A random string (10 random lower case characters) will be added to make the name unique: "<prefix>-<random_string>". |
| RUNNER_GROUP | Default | The GitHub runner group where the VM instance is expected to join as a self hosted runner. |
| RUNNER_LABELS | self-hosted *(comma separated list)* | Only workflow jobs whose labels match **all** the configured labels will be taken into account. If only one configured label is **not** found in the workflow job it will be ignored. |
| PORT | 8080 | To which port the webserver is bound. |
Loading

0 comments on commit 2fb0151

Please sign in to comment.