Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing security posture of HTC Grid by enforcing and fixing relevant encryption, authentication and RBAC issues #78

Merged
merged 62 commits into from
Feb 6, 2024

Conversation

fgogolli
Copy link
Contributor

@fgogolli fgogolli commented Dec 6, 2023

Description

This pull request fixes all the relevant security issues in the current code base, as detected by cfn_lint, trivy, checkov and ScoutSuite.

Terraform State:

  • Encrypt and secure init_grid state and Lambda buckets.
  • Limit the scope of KMS Key policy for State Buckets.
  • Remove AccessControls and use BucketPolicy to keep the bucket private.
  • Configure all Makefiles to use encrypted S3 Buckets for TF State, non-root Dockerfiles, fix HTCGRID_ECR_REPO, name CloudFormation stack outputs, and support updating existing init_grid stack.
  • Improve init_grid Makefile to handle initial and deletion cases better.
  • Add support for cleaning up S3 object versions and standardize bucket variable naming.

HTC Grid Containers:

  • Configure all Dockerfiles to run non-root containers and fix builds.
  • Configure all HTC K8S resources to run with runAsNonRoot, default seccompProfile, and disabled allowPrivilegeEscalation.
  • Rename components, add readOnlyFileSystem and seccomp profile to HTC Agent, fix and cleanup code.
  • Remove file system write dependencies for the agent.
  • Harden K8S manifests and enforce further chekov rules.
  • Configure Grafana Ingress to drop invalid HTTP Header fields.

HTC Grid Control Plane:

  • Configure CMK KMS Key encryption for VPC Flow Logs, ECR Repositories, SQS, DynamoDB, S3, EKS Cluster, EKS MNG EBS Volumes, and all CloudWatch Logs.
  • Add encrypted CloudWatch Logging for API Gateway.
  • Create S3 via TF Module, add encryption support for S3 Data Plane in the agent, fix AWS partition, and DNS Suffix usage.
  • Simplify code and move all lambdas and auth to the control_plane.
  • Configure and consolidate least-privilege permissions on KMS, Lambda, and Agent IAM policies.
  • Add KMS Decrypt and GenerateDataKey permissions to Lambda and Agent permissions.
  • Move installation of jq onto lambda images and fix the bootstrap script.
  • Convert EC Redis to a single replica cluster mode and add encryption.
  • Add AUTH for ElastiCache Redis Cluster.
  • Enable XRay tracing for Lambda functions and adjust Redis config.
  • Add an explicit ASG Service Linked Role declaration to enable KMS support for ASG EBS Volumes.
  • Handle cases where AWSServiceRoleForAutoScaling already exists.
  • Add S3 and SQS Resource Policies to enforce HTTPS and create separate CMK KMS Keys for DLQs per each SQS Queue.
  • Configure the DLQs to be used with the respective SQS Queues and fix naming/references.
  • Add security group and ACL controls where possible.
  • Configure securityContext for OpenAPI.

General:

  • Add GitHub workflows for cfn_lint, trivy, and checkov.
  • Standardize, fix, and simplify tests.
  • Standardize the naming of TF resources.
  • Fix docs and random_password to align with pipelines.
  • Add auto deploy & destroy stages for images.

Cloud9:

  • Fix Cloud9 deployment script to target correct instances.
  • Fix Cloud9 bootstrap race condition and adjust to WS.
  • Force a reinstall at bootstrap time to fix virtualenv issues.
  • Add support for specifying a Git repo/branch for HTCGridSource.
  • Remove Admin role from KMS Admins as it doesn't exist in WS.

Checklist

  • Added tests that cover your change (if possible)
  • Added/modified documentation as required (such as the README.md, or the docs directory)
  • Manually tested
  • Made sure the title of the PR is a good description that can go into the release notes
  • (Core team) Added labels for change area (e.g. area/controlplane) and kind (e.g. kind/improvement)

BONUS POINTS checklist:

  • Backfilled missing tests for code in same general area
  • Refactored something and made the world a better place

…compProfile and disabled allowPrivilegeEscalation
…-root Dockerfiles, fixed HTCGRID_ECR_REPO, named CloudFormation stack outputs and support for updating existing init_grid stack
… CMK KMS Keys for the DLQs per each SQS Queue
…agent, fix AWS partition and DNS Suffix usage
@fgogolli fgogolli force-pushed the v043_fixes branch 2 times, most recently from d050e97 to 93a029d Compare December 12, 2023 12:00
policy_arn = "arn:${local.partition}:iam::aws:policy/service-role/AmazonAPIGatewayPushToCloudWatchLogs"
}

resource "aws_api_gateway_account" "apigateway_account" {

Check warning

Code scanning / tflint

Missing version constraint for provider "aws" in required_providers Warning

Missing version constraint for provider "aws" in required\_providers

for i, r in enumerate(results):
# print(i, (abs(all_expected_results[i] - results[i])), all_expected_results[i], results[i])
assert (abs(all_expected_results[i] - results[i]) < 0.000001)
assert abs(all_expected_results[i] - results[i]) < 0.000001

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
stdout=f_stdout,
stderr=f_stderr,
shell=False)
proc = subprocess.Popen(command, stdout=f_stdout, stderr=f_stderr, shell=False)

Check notice

Code scanning / Bandit

subprocess call - check for execution of untrusted input. Note

subprocess call - check for execution of untrusted input.
logging.info("Could not acquire a task from the queue, backing off for {}".
format(timeout)
)
timeout = random.uniform(

Check notice

Code scanning / Bandit

Standard pseudo-random generators are not suitable for security/cryptographic purposes. Note

Standard pseudo-random generators are not suitable for security/cryptographic purposes.

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

KUBE_FILEPATH = '/tmp/kubeconfig'
region = os.environ['AWS_REGION']
KUBE_FILEPATH = "/tmp/kubeconfig"

Check warning

Code scanning / Bandit

Probable insecure usage of temp file/directory. Warning

Probable insecure usage of temp file/directory.

else:
errlog.log("Unimplemented path, exiting")
assert(False)
assert False

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.


def get_time_now_ms():
return int(round(time.time() * 1000))


def get_tasks_statuses_in_session(session_id):

assert(session_id is not None)
assert session_id is not None

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

else:
errlog.log("Uniplemented path, exiting")
assert(False)
assert False

Check notice

Code scanning / Bandit

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code. Note

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
@fgogolli fgogolli force-pushed the v043_fixes branch 2 times, most recently from 395e14d to 4689e66 Compare December 13, 2023 14:18
Copy link
Collaborator

@clementrey-dev clementrey-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Flamur,

Thank you for your work.

I been running the happy path section (from my laptop) and I got error from multiple sections:

  • REGION not set after different make command
  • Mallformed policy during the deployment
  • bucket not fully emptied during the cleaning process

Can you please take a look at this ?

deployment/init_grid/cloudformation/Makefile Outdated Show resolved Hide resolved
@kirillsc
Copy link
Collaborator

kirillsc commented Feb 6, 2024

lgtm

Copy link
Collaborator

@kirillsc kirillsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Comment on lines +153 to +172
ImageTFStateBucket:
Type: 'AWS::S3::Bucket'
DeletionPolicy: Delete
Properties:
BucketName: !Sub
- '${BucketTag}-tfstate-htc-grid-${RANDOM}'
- '${BucketTag}-htc-grid-image-tfstate-${RANDOM}'
- RANDOM: !Select [0, !Split ['-', !Select [2, !Split ['/', !Ref 'AWS::StackId' ]]]]
DeletionPolicy: Delete
LambdaUnitHtcGrid:
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
VersioningConfiguration:
Status: Enabled
BucketEncryption:
ServerSideEncryptionConfiguration:
- BucketKeyEnabled: true
ServerSideEncryptionByDefault:
SSEAlgorithm: 'aws:kms'
KMSMasterKeyID: !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:${HTCStateS3KeyAlias}'

Check notice

Code scanning / Trivy

S3 Bucket Logging Low

Artifact: deployment/init_grid/cloudformation/grid_state.yaml
Type: cloudformation
Vulnerability AVD-AWS-0089
Severity: LOW
Message: Bucket has logging disabled
Link: AVD-AWS-0089
Comment on lines +239 to +258
LambdaLayerBucket:
Type: 'AWS::S3::Bucket'
DeletionPolicy: Delete
Properties:
BucketName: !Sub
- '${BucketTag}-lambda-unit-htc-grid-${RANDOM}'
- '${BucketTag}-htc-grid-lambda-layer-${RANDOM}'
- RANDOM: !Select [0, !Split ['-', !Select [2, !Split ['/', !Ref 'AWS::StackId' ]]]]
DeletionPolicy: Delete
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
VersioningConfiguration:
Status: Enabled
BucketEncryption:
ServerSideEncryptionConfiguration:
- BucketKeyEnabled: true
ServerSideEncryptionByDefault:
SSEAlgorithm: 'aws:kms'
KMSMasterKeyID: !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:${HTCStateS3KeyAlias}'

Check notice

Code scanning / Trivy

S3 Bucket Logging Low

Artifact: deployment/init_grid/cloudformation/grid_state.yaml
Type: cloudformation
Vulnerability AVD-AWS-0089
Severity: LOW
Message: Bucket has logging disabled
Link: AVD-AWS-0089
Comment on lines +67 to +86
GridTFStateBucket:
Type: 'AWS::S3::Bucket'
DeletionPolicy: Delete
Properties:
BucketName: !Sub
- '${BucketTag}-image-tfstate-htc-grid-${RANDOM}'
- '${BucketTag}-htc-grid-tfstate-${RANDOM}'
- RANDOM: !Select [0, !Split ['-', !Select [2, !Split ['/', !Ref 'AWS::StackId' ]]]]
DeletionPolicy: Delete
TfstateHtcGrid:
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
VersioningConfiguration:
Status: Enabled
BucketEncryption:
ServerSideEncryptionConfiguration:
- BucketKeyEnabled: true
ServerSideEncryptionByDefault:
SSEAlgorithm: 'aws:kms'
KMSMasterKeyID: !Sub 'arn:${AWS::Partition}:kms:${AWS::Region}:${AWS::AccountId}:${HTCStateS3KeyAlias}'

Check notice

Code scanning / Trivy

S3 Bucket Logging Low

Artifact: deployment/init_grid/cloudformation/grid_state.yaml
Type: cloudformation
Vulnerability AVD-AWS-0089
Severity: LOW
Message: Bucket has logging disabled
Link: AVD-AWS-0089
@@ -1,20 +1,22 @@
FROM python:3.7.7-slim-buster
ARG HTCGRID_ECR_REPO

Check notice

Code scanning / Trivy

No HEALTHCHECK defined Low

Artifact: examples/submissions/k8s_jobs/Dockerfile.Submitter
Type: dockerfile
Vulnerability DS026
Severity: LOW
Message: Add HEALTHCHECK instruction in your Dockerfile
Link: DS026
RUN yum install -d1 -y make gcc-c++ zip
RUN mkdir -p /app
WORKDIR /app
ARG HTCGRID_ECR_REPO

Check notice

Code scanning / Trivy

No HEALTHCHECK defined Low

Artifact: examples/workloads/c++/mock_computation/Dockerfile.Build
Type: dockerfile
Vulnerability DS026
Severity: LOW
Message: Add HEALTHCHECK instruction in your Dockerfile
Link: DS026
@@ -1,21 +1,26 @@
FROM python:3.7.7-slim-buster
ARG HTCGRID_ECR_REPO

Check notice

Code scanning / Trivy

No HEALTHCHECK defined Low

Artifact: source/compute_plane/python/agent/Dockerfile.Lambda
Type: dockerfile
Vulnerability DS026
Severity: LOW
Message: Add HEALTHCHECK instruction in your Dockerfile
Link: DS026
@@ -1,41 +1,54 @@
FROM python:3.7.7-slim-buster
#Builder Container

Check notice

Code scanning / Trivy

No HEALTHCHECK defined Low

Artifact: source/compute_plane/python/agent/Dockerfile.Local
Type: dockerfile
Vulnerability DS026
Severity: LOW
Message: Add HEALTHCHECK instruction in your Dockerfile
Link: DS026
ENV LAYER_NAME lambda
ENV LAYER_VERSION 1
ENV LAYER_ROOT .
ARG HTCGRID_ECR_REPO

Check notice

Code scanning / Trivy

No HEALTHCHECK defined Low

Artifact: source/compute_plane/shell/attach-layer/Dockerfile
Type: dockerfile
Vulnerability DS026
Severity: LOW
Message: Add HEALTHCHECK instruction in your Dockerfile
Link: DS026
Comment on lines +123 to +136
resource "aws_api_gateway_method_settings" "htc_public_api_method_settings" {
#checkov:skip=CKV_AWS_308: API Gateway method setting caching encryption wouldn't work for this API
#checkov:skip=CKV_AWS_225: API Gateway method setting caching wouldn't work for this API

rest_api_id = aws_api_gateway_rest_api.htc_public_api.id
stage_name = aws_api_gateway_stage.htc_public_api_stage.stage_name

method_path = "*/*"

settings {
metrics_enabled = true
logging_level = "ERROR"
}
}

Check notice

Code scanning / Trivy

Ensure that response caching is enabled for your Amazon API Gateway REST APIs. Low

Artifact: deployment/grid/terraform/control_plane/openapi_public.tf
Type: terraform
Vulnerability AVD-AWS-0190
Severity: LOW
Message: Cache data is not enabled.
Link: AVD-AWS-0190
Comment on lines +124 to +137
resource "aws_api_gateway_method_settings" "htc_private_api_method_settings" {
#checkov:skip=CKV_AWS_308: API Gateway method setting caching encryption wouldn't work for this API
#checkov:skip=CKV_AWS_225: API Gateway method setting caching wouldn't work for this API

rest_api_id = aws_api_gateway_rest_api.htc_private_api.id
stage_name = aws_api_gateway_stage.htc_private_api_stage.stage_name

method_path = "*/*"

settings {
metrics_enabled = true
logging_level = "ERROR"
}
}

Check notice

Code scanning / Trivy

Ensure that response caching is enabled for your Amazon API Gateway REST APIs. Low

Artifact: deployment/grid/terraform/control_plane/openapi_private.tf
Type: terraform
Vulnerability AVD-AWS-0190
Severity: LOW
Message: Cache data is not enabled.
Link: AVD-AWS-0190
@fgogolli fgogolli merged commit 4813f07 into finos:main Feb 6, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants