-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix/simplify autoscaling #213
Merged
aidanrussell
merged 2 commits into
feat/sagemaker-llms
from
bugfix/simplify_autoscaling
Jan 22, 2025
Merged
Bugfix/simplify autoscaling #213
aidanrussell
merged 2 commits into
feat/sagemaker-llms
from
bugfix/simplify_autoscaling
Jan 22, 2025
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
aidanrussell
force-pushed
the
bugfix/simplify_autoscaling
branch
from
January 20, 2025 11:57
10e0c56
to
7267ac6
Compare
aidanrussell
requested review from
joehearnshaw-6point6 and
isobel-daley-6point6
January 20, 2025 12:00
aidanrussell
force-pushed
the
bugfix/simplify_autoscaling
branch
from
January 21, 2025 12:18
6e58c14
to
0726e75
Compare
aidanrussell
added a commit
that referenced
this pull request
Jan 23, 2025
* sagemaker domain and two async endpoints * Sagemaker endpoints policies * Add sagemaker ECR repo * Sagemaker tools permissions * Variables * Fix * Fix * Fix * fix: tighten IAM permissions for sagemaker * fix:adjust autoscaling policies * fix: adjust sagemaker permissions to enable ECR access * fix: adjust permissions to enable SageMaker to retrieve ECR images * fix: adjust iam polciies to enable access to sagemaker bucket * Ensuring instance spins down to 0 * Modular changes working Modularisation complete; fixed some minor bugs too that led to CPU util alarm being defunct for utilization over 70%. Can start considering moving all models to this methodology going forwards. * Updated to modularise all of sagemaker * Further modularisation for models Making the models easier to redeploy with simplier blocks. Removed redudant code configs. * Updated Alarm params Updated to ensure longer time up and longer time to scale down for improving useability * Updated alarms and policies * Corrections * remove duplicated module * terraform fmt * Cost monitoring functions * tidy up * Added env var + fixed dashboard As above * recoverd * Compliance & alerting Comliance, plus unfinished work on S3 bucket for logs - outstanding one more compliance metric but PR for now. * feat: sagemaker outputs are moved to user's theia space (#180) * S3 bucket and alarms added * Updated with iteration & fixed policies Noted policies accidentally truncated. Made this iterative, updated code, too. * Feat/sagemaker llms update main (#171) * feat: allow multiple secrets for Airflow teams This gives Airflow teams access to a "_2" secret. This is to work around the limitation that an AWS Secret has a max size of 64KB * Added ability for dag processors to fetch from external buckets * Enable intelligent tiering on mirror bucket for objects > 128KB * add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * fixup! add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * remove policy change related to put object, as it will be dealt with in a different PR * chore(deps): bump cross-spawn from 7.0.3 to 7.0.6 Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md) - [Commits](moxystudio/node-cross-spawn@v7.0.3...v7.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * feat: add lifecycle policies to all ECR repos, except visualisation-base images Because we only used tagged images in ECS, to reduce costs and to avoid alerts for vulnrabilities that have since been addressed, we should be able to safely delete untagged images. The exception are the various visualisation-base images which we do (for now) use untagged, although this is being changed. * Add sagemaker ECR repo * terraform fmt --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sophie Glinton <[email protected]> * chore: run terraform fmt with -recursive option (#184) * chore: run terraform fmt with -recursive option * check on each PR * WIP - slack integration Remainder task of updating the other model and then resolving changes with latest PRs * Bugfix/repair errors from recent merges (#186) * testing current version * latest: * minor fixes * Correcting mapping var names * Correction of lambda * Formatted * Minor update * Revert "Feat/sagemaker llms 407 slack" (#193) * Feat/use jumpstart streamlined for new llm options (#192) * chore: run terraform fmt with -recursive option * check on each PR * wip * local * latest wip * all working * formatting * modify * restore status after problematic merge * latest * formatting * Feat/reduced new llm options (#197) * chore: run terraform fmt with -recursive option * check on each PR * wip * local * latest wip * all working * formatting * modify * restore status after problematic merge * latest * formatting * add mistral 7b * latest models * formatting * latest * formatting * restore 407 slack alarm changes (#195) * restore 407 slack alarms following revert * terraform fmt * add mistral * Feat/further expand new llm options (#196) * add gemma 2 27b * latest: * latest work * latest work to get back closer to last working state (#205) * latest work to get back closer to last working state * latest * latest * latest * latest * allow terraform destroy and then apply to work as expected (#207) * ensure data sources are not destroyed * update all s3, ecr, db with explicit deletion prevention@ * successful destroy followed by apply * all working except llama models * Feat/rearrange models deployed (#208) * ensure data sources are not destroyed * update all s3, ecr, db with explicit deletion prevention@ * successful destroy followed by apply * all working except llama models * use llama 3.3 70b-instruct which is working * rearrange models all working except 180b * correct formatting * Chore/update to latest main branch (#206) * feat: allow multiple secrets for Airflow teams This gives Airflow teams access to a "_2" secret. This is to work around the limitation that an AWS Secret has a max size of 64KB * add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * fixup! add a policy change to allow data workspace users to get objects from notebooks S3 bucket add a policy change to allow gitlab runner to put objects into notebooks S3 bucket * Added ability for dag processors to fetch from external buckets * remove policy change related to put object, as it will be dealt with in a different PR * Enable intelligent tiering on mirror bucket for objects > 128KB * chore(deps): bump cross-spawn from 7.0.3 to 7.0.6 Bumps [cross-spawn](https://github.com/moxystudio/node-cross-spawn) from 7.0.3 to 7.0.6. - [Changelog](https://github.com/moxystudio/node-cross-spawn/blob/master/CHANGELOG.md) - [Commits](moxystudio/node-cross-spawn@v7.0.3...v7.0.6) --- updated-dependencies: - dependency-name: cross-spawn dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> * feat: add lifecycle policies to all ECR repos, except visualisation-base images Because we only used tagged images in ECS, to reduce costs and to avoid alerts for vulnrabilities that have since been addressed, we should be able to safely delete untagged images. The exception are the various visualisation-base images which we do (for now) use untagged, although this is being changed. * this change allows ddat data science gitlab runner to list, get and put objects to their space in notebooks bucket for private python index * use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * fixup! use instance profile in data science launch configuration * feat: expire preview visualisation (user provided) images This adds to the lifecycle rules for preview visualisation (user provided) images. It should now expire preview images one day after they have been pushed. In order to leave production images alone robustly, they now have a "--prod" suffix so they will match the rule with pattern "*--prod" that expires them in 1000 years. While odd, it seems to be the best way to make "*--prod" images _never_ expire. * fixup! use instance profile in data science launch configuration * fixing resource format for gitlab ds runner * incorrectly removed gitlab_runner user_provided actions, this commit corrects it * rename iam role for gitlab data science runner * feat: allow Airflow teams to use external KMS keys * add list object permissions to check for whl files in each package * feat/ add ecr lifecycle policy for admin, keep last five releases * lint * fix/must provide tag pattern when tag_status = 'tagged' in ecr lifecycle rule * Update ecr.tf * perf: delete expired object delete markers on notebooks bucket We have no need to keep the old delete markers, and apparently there is a performance benefit. * feat: move to single visualisation-base ECR repo, with lifecycle policy We've now moved all visualisation-base images to a single ECR repo, and also use tagged images, so we - Remove the language-specific visualisaiton base repos - And give the remaining repo a lifecycle policy to cleanup old images, both for cost reasons and to save us being alerted on vulnrabilities on old images. * fix: visualisation-base lifecycle rule This fixes the issue where "python" and "rv4" images were being expired from the visualisation-base repository. I incorrectly thought that if an image matches any of the patterns in the tag_pattern_list in a rule, it would match the rule and stop processing. However, it has to match _all_ of the patterns. So to fix the issue, the one rule that matches "python","rv4" is split into two rules. * expire theia images * applied to jupyterlab, theia, rv4-cran, rv4-rstudio, rv4-visualisation, pgadmin, remote-desktop, s3sync, and metrics --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: James Robinson <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: agalamatis <[email protected]> * bugfix: fix duplicated entries relating to merge conflict (#209) * Bugfix/stop llama deploying and correct one alarm (#211) * bugfix remove llama temporarily * correct alarm * Bugfix/simplify autoscaling (#213) * simplify autoscaling logic for easier debugging * correct linting * Feature/modify user theia permissions (#221) * simplify autoscaling process * alter user permissions for boto3 usage in theia@ * latest * all terraform names with underscores not dashes * correct sagemaker_llms * remove theia permissions adjustment * incorporate latest simplify_autoscaling branch * correct user permissions * shorten timeframes * latest * correct slack channels * correct errors in conflict resolution@ * new format with two alarms for backlog --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Sophie Glinton <[email protected]> Co-authored-by: Isobel Daley <[email protected]> Co-authored-by: Isobel Daley <[email protected]> Co-authored-by: Joseph Hearnshaw <[email protected]> Co-authored-by: Joseph Hearnshaw <[email protected]> Co-authored-by: Michal Charemza <[email protected]> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: sekharpanja <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sally Mohamed <[email protected]> Co-authored-by: Peter Woodcock <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: James Robinson <[email protected]> Co-authored-by: agalamatis <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This started as a quick bugfix but evolved into a larger piece of work to simplify the way the alarms and autoscaling is working, as it was felt to be confusing and hard to debug in its prior implementation.
The original bug is that scale-down based on backlog was not working at all in the prior state, rather scale-down by CPU was being applied (this is why it hadn't been noticed). This broke when Llama was added because that was using ~7% CPU at rest compared to a 5% CPU threshold, and so even with no backlog it was failing to scale down.
In the new implementation, scaling is split between a logic for 0-1 instance and a logic for anything above 1 instances. This means rules do not overlap - rather scale up/down for backlog is for 0-1 only and scale up/down for CPU/GPU is for 1+ only. Furthermore where possible alarms are created that trigger actions both in alarm state and also an opposing action in non-alarm (ok) state - this helps ensure consistency as there is only one rule to edit instead of two (actually in the end this could only be done for the backlog alarm and not the others).