Bugfix/simplify autoscaling #213

aidanrussell · 2025-01-15T14:53:08Z

This started as a quick bugfix but evolved into a larger piece of work to simplify the way the alarms and autoscaling is working, as it was felt to be confusing and hard to debug in its prior implementation.

The original bug is that scale-down based on backlog was not working at all in the prior state, rather scale-down by CPU was being applied (this is why it hadn't been noticed). This broke when Llama was added because that was using ~7% CPU at rest compared to a 5% CPU threshold, and so even with no backlog it was failing to scale down.

In the new implementation, scaling is split between a logic for 0-1 instance and a logic for anything above 1 instances. This means rules do not overlap - rather scale up/down for backlog is for 0-1 only and scale up/down for CPU/GPU is for 1+ only. Furthermore where possible alarms are created that trigger actions both in alarm state and also an opposing action in non-alarm (ok) state - this helps ensure consistency as there is only one rule to edit instead of two (actually in the end this could only be done for the backlog alarm and not the others).

infra/sagemaker_llm_resources.tf

* sagemaker domain and two async endpoints * Sagemaker endpoints policies * Add sagemaker ECR repo * Sagemaker tools permissions * Variables * Fix * Fix * Fix * fix: tighten IAM permissions for sagemaker * fix:adjust autoscaling policies * fix: adjust sagemaker permissions to enable ECR access * fix: adjust permissions to enable SageMaker to retrieve ECR images * fix: adjust iam polciies to enable access to sagemaker bucket * Ensuring instance spins down to 0 * Modular changes working Modularisation complete; fixed some minor bugs too that led to CPU util alarm being defunct for utilization over 70%. Can start considering moving all models to this methodology going forwards. * Updated to modularise all of sagemaker * Further modularisation for models Making the models easier to redeploy with simplier blocks. Removed redudant code configs. * Updated Alarm params Updated to ensure longer time up and longer time to scale down for improving useability * Updated alarms and policies * Corrections * remove duplicated module * terraform fmt * Cost monitoring functions * tidy up * Added env var + fixed dashboard As above * recoverd * Compliance & alerting Comliance, plus unfinished work on S3 bucket for logs - outstanding one more compliance metric but PR for now. * feat: sagemaker outputs are moved to user's theia space (#180) * S3 bucket and alarms added * Updated with iteration & fixed policies Noted policies accidentally truncated. Made this iterative, updated code, too. * Feat/sagemaker llms update main (#171) * feat: allow multiple secrets for Airflow teams This gives Airflow teams access to a "_2" secret. This is to work around the limitation that an AWS Secret has a max size of 64KB * Added ability for dag processors to fetch from external buckets * Enable intelligent tiering on mirror bucket for objects > 128KB * add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * fixup! add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * remove policy change related to put object, as it will be dealt with in a different PR * chore(deps): bump cross-spawn from 7.0.3 to 7.0.6 Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md) - [Commits](moxystudio/node-cross-spawn@v7.0.3...v7.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * feat: add lifecycle policies to all ECR repos, except visualisation-base images Because we only used tagged images in ECS, to reduce costs and to avoid alerts for vulnrabilities that have since been addressed, we should be able to safely delete untagged images. The exception are the various visualisation-base images which we do (for now) use untagged, although this is being changed. * Add sagemaker ECR repo * terraform fmt --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sophie Glinton <[email protected]> * chore: run terraform fmt with -recursive option (#184) * chore: run terraform fmt with -recursive option * check on each PR * WIP - slack integration Remainder task of updating the other model and then resolving changes with latest PRs * Bugfix/repair errors from recent merges (#186) * testing current version * latest: * minor fixes * Correcting mapping var names * Correction of lambda * Formatted * Minor update * Revert "Feat/sagemaker llms 407 slack" (#193) * Feat/use jumpstart streamlined for new llm options (#192) * chore: run terraform fmt with -recursive option * check on each PR * wip * local * latest wip * all working * formatting * modify * restore status after problematic merge * latest * formatting * Feat/reduced new llm options (#197) * chore: run terraform fmt with -recursive option * check on each PR * wip * local * latest wip * all working * formatting * modify * restore status after problematic merge * latest * formatting * add mistral 7b * latest models * formatting * latest * formatting * restore 407 slack alarm changes (#195) * restore 407 slack alarms following revert * terraform fmt * add mistral * Feat/further expand new llm options (#196) * add gemma 2 27b * latest: * latest work * latest work to get back closer to last working state (#205) * latest work to get back closer to last working state * latest * latest * latest * latest * allow terraform destroy and then apply to work as expected (#207) * ensure data sources are not destroyed * update all s3, ecr, db with explicit deletion prevention@ * successful destroy followed by apply * all working except llama models * Feat/rearrange models deployed (#208) * ensure data sources are not destroyed * update all s3, ecr, db with explicit deletion prevention@ * successful destroy followed by apply * all working except llama models * use llama 3.3 70b-instruct which is working * rearrange models all working except 180b * correct formatting * Chore/update to latest main branch (#206) * feat: allow multiple secrets for Airflow teams This gives Airflow teams access to a "_2" secret. This is to work around the limitation that an AWS Secret has a max size of 64KB * add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * fixup! add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * Added ability for dag processors to fetch from external buckets * remove policy change related to put object, as it will be dealt with in a different PR * Enable intelligent tiering on mirror bucket for objects > 128KB * chore(deps): bump cross-spawn from 7.0.3 to 7.0.6 Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md) - [Commits](moxystudio/node-cross-spawn@v7.0.3...v7.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * feat: add lifecycle policies to all ECR repos, except visualisation-base images Because we only used tagged images in ECS, to reduce costs and to avoid alerts for vulnrabilities that have since been addressed, we should be able to safely delete untagged images. The exception are the various visualisation-base images which we do (for now) use untagged, although this is being changed. * this change allows ddat data science gitlab runner to list, get and put objects to their space in notebooks bucket for private python index * use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * feat: expire preview visualisation (user provided) images This adds to the lifecycle rules for preview visualisation (user provided) images. It should now expire preview images one day after they have been pushed. In order to leave production images alone robustly, they now have a "--prod" suffix so they will match the rule with pattern "*--prod" that expires them in 1000 years. While odd, it seems to be the best way to make "*--prod" images _never_ expire. * fixup! use instance profile in data science launch configuration * fixing resource format for gitlab ds runner * incorrectly removed gitlab_runner user_provided actions, this commit corrects it * rename iam role for gitlab data science runner * feat: allow Airflow teams to use external KMS keys * add list object permissions to check for whl files in each package * feat/ add ecr lifecycle policy for admin, keep last five releases * lint * fix/must provide tag pattern when tag_status = 'tagged' in ecr lifecycle rule * Update ecr.tf * perf: delete expired object delete markers on notebooks bucket We have no need to keep the old delete markers, and apparently there is a performance benefit. * feat: move to single visualisation-base ECR repo, with lifecycle policy We've now moved all visualisation-base images to a single ECR repo, and also use tagged images, so we - Remove the language-specific visualisaiton base repos - And give the remaining repo a lifecycle policy to cleanup old images, both for cost reasons and to save us being alerted on vulnrabilities on old images. * fix: visualisation-base lifecycle rule This fixes the issue where "python" and "rv4" images were being expired from the visualisation-base repository. I incorrectly thought that if an image matches any of the patterns in the tag_pattern_list in a rule, it would match the rule and stop processing. However, it has to match _all_ of the patterns. So to fix the issue, the one rule that matches "python","rv4" is split into two rules. * expire theia images * applied to jupyterlab, theia, rv4-cran, rv4-rstudio, rv4-visualisation, pgadmin, remote-desktop, s3sync, and metrics --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Robinson <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: agalamatis <[email protected]> * bugfix: fix duplicated entries relating to merge conflict (#209) * Bugfix/stop llama deploying and correct one alarm (#211) * bugfix remove llama temporarily * correct alarm * Bugfix/simplify autoscaling (#213) * simplify autoscaling logic for easier debugging * correct linting * Feature/modify user theia permissions (#221) * simplify autoscaling process * alter user permissions for boto3 usage in theia@ * latest * all terraform names with underscores not dashes * correct sagemaker_llms * remove theia permissions adjustment * incorporate latest simplify_autoscaling branch * correct user permissions * shorten timeframes * latest * correct slack channels * correct errors in conflict resolution@ * new format with two alarms for backlog --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Sophie Glinton <[email protected]> Co-authored-by: Isobel Daley <[email protected]> Co-authored-by: Isobel Daley <[email protected]> Co-authored-by: Joseph Hearnshaw <[email protected]> Co-authored-by: Joseph Hearnshaw <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: agalamatis <[email protected]>

aidanrussell requested a review from a team as a code owner January 15, 2025 14:53

aidanrussell changed the base branch from main to feat/sagemaker-llms January 15, 2025 14:54

aidanrussell force-pushed the bugfix/simplify_autoscaling branch from 10e0c56 to 7267ac6 Compare January 20, 2025 11:57

aidanrussell requested review from joehearnshaw-6point6 and isobel-daley-6point6 January 20, 2025 12:00

joehearnshaw-6point6 reviewed Jan 20, 2025

View reviewed changes

infra/sagemaker_llm_resources.tf Show resolved Hide resolved

simplify autoscaling logic for easier debugging

0726e75

aidanrussell force-pushed the bugfix/simplify_autoscaling branch from 6e58c14 to 0726e75 Compare January 21, 2025 12:18

correct linting

0b2293e

aidanrussell merged commit fa29f71 into feat/sagemaker-llms Jan 22, 2025
1 check passed

aidanrussell deleted the bugfix/simplify_autoscaling branch January 22, 2025 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/simplify autoscaling #213

Bugfix/simplify autoscaling #213

aidanrussell commented Jan 15, 2025 •

edited

Loading

Bugfix/simplify autoscaling #213

Bugfix/simplify autoscaling #213

Conversation

aidanrussell commented Jan 15, 2025 • edited Loading

aidanrussell commented Jan 15, 2025 •

edited

Loading