Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sagemaker llms #234

Draft
wants to merge 101 commits into
base: main
Choose a base branch
from
Draft

Feat/sagemaker llms #234

wants to merge 101 commits into from

Conversation

isobel-daley-6point6
Copy link
Contributor

No description provided.

Sophie Glinton and others added 30 commits November 4, 2024 14:01
fix: tighten IAM permissions for sagemaker
fix: adjust iam polciies to enable access to sagemaker bucket
Modularisation complete; fixed some minor bugs too that led to CPU util alarm being defunct for utilization over 70%. Can start considering moving all models to this methodology going forwards.
Making the models easier to redeploy with simplier blocks. Removed redudant code configs.
Updated to ensure longer time up and longer time to scale down for improving useability
joehearnshaw-6point6 and others added 30 commits January 23, 2025 19:26
Added SNS topic for lambda subscriptions to composite alarms in separate logic - N.B. We NEED to refactor this logic and abstract to another module or something similar as it's overused and not nice like this. makes it harder to diagnose issues.
Added gpu composite to all models
Feature/tf 56 composite alarms - composite alarms
Instead of having to have files local in the filesystem (which is tricky to
store securely), that are then copied to S3, which GitLab pulls from on launch,
this makes it so GitLab secrets are stored in Secrets Manager, which GitLab
pulls from on launch.

This is a part 1 of (probably) 2 parts - this does not remove existing object,
permissions or any associated config, to allow environments to keep on
accessing the secrets as they were, so we don't have to migrate them all at
once. Later parts will likely remove permissions and config.

This is part of our move away from having to have any secrets locally on the
filesystem.
This follows up from #223 by
making it possible to apply the terraform with GitLab enabled, but while not
have GitLab secrets on the local filesystem.
* update for all endpoints

* tweak so all at 5 minutes
* WIP: new experimental version numbers and formatting Makefile

* modifications for a readme

* update readme

* remove github workflow

* modify the way uv install works

* latest

* update path
* 900 seconds uptime for all

* extend alarms for scale down
* 900 seconds uptime for all

* extend alarms for scale down

* correct error
* feat/create new sagemaker vpc and switch sagemaker resources to run inside vpc

* feat: add security group rules for notebook endpoints in new sagemaker vpc

* fix: add new routes for sagemaker vpc to enable access to endpoints from Theia

* fix: add new security group rules to open up access to sagemaker endpoints from Theia

* fix: add new routes for sagemaker vpc to enable access to endpoints from Theia

* fix: adjust subets and security groups to reflect new sagemaker vpc

* fix: remove duplicate lifecycle policies

* fix: adjust changes to test one model in new sagemaker vpc

* fix: move domain back into notebooks vpc to avoid unneccessary changes

* fix: modifications to security groups

* fix: removing sagemaker endpoints in main

* fix: modifications to VPC settings to address ongoing endpoint issues

* fix: add route 53 private DNS record to enable sagemkaer endpoint to be called from Theia

* fix: remove unneccesary peering

* fix: add SageMaker API DNS record

* fix: move VPC endpoints to notebooks VPC

* fix: adjust subnets/security groups to switch models to Sagemaker vpc

* fix: adjustments to get SNS endpoint working

* fix: address SNS notification issue

* fix: sns endpoint

* fix: add route table association for S3 endpoint

* fix: move all endpoints to new sagemaker vpc

* fix: enable connection from SageMaker s3 endpoint to Notebooks bucket

* fix: add security group rules to enable access to s3 endpoint

* fix: modifications to security group naming

* fix: remove unnecessary route53 resources

* fix: reinstate alarms on gpt neo 125m

* fix: resolve merge conflict - remove falcon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants