Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

19530 move copy local path based pvs outside of hwn01 or make basemap stateless #1409

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
57c3dd6
feat(flux/db-event-api/dev): add new slave on hwn04 (#19530 step 01.01)
hleb-rubanau Oct 1, 2024
052c11c
feat(flux/db-event-api/dev): switchover master to hwn04 (#19530 step …
hleb-rubanau Oct 1, 2024
6a0d984
feat(flux/db-event-api/dev): add new slave on hwn03 (#19530 step 01.03)
hleb-rubanau Oct 1, 2024
c88ddea
feat(flux/db-event-api/dev): eliminate legacy instance and update ins…
hleb-rubanau Oct 1, 2024
0f3a3a1
feat(flux/db-event-api/prod): add new slave on hwn04 (#19530 step 02)
hleb-rubanau Oct 1, 2024
e74bd68
feat(helm/basemap): support switching to emptydir (#19530, step 03.01)
hleb-rubanau Sep 20, 2024
673f17a
fix(helm/basemap/tileserver/dev): switch to emptydir (#19530, step 03…
hleb-rubanau Sep 27, 2024
2009285
fix(helm+flux/basemap/tileserver/test): switch to emptyDir and move f…
hleb-rubanau Sep 23, 2024
94b9335
fix(helm+flux/basemap/tileserver/prod): switch to emptyDir and move f…
hleb-rubanau Sep 23, 2024
6af3399
feat(helm/raster-tiler): support emptyDir storage (#19530, step 04.01)
hleb-rubanau Sep 23, 2024
63bb8ee
fix(helm+flux/raster-tiler/dev): switch to emptyDir and move from hwn…
hleb-rubanau Sep 23, 2024
0ea219a
fix(helm+flux/raster-tiler/geocint): switch to emptyDir and move from…
hleb-rubanau Sep 23, 2024
ef25f69
fix(flux/basemap/generator/test): stick to hwn02 (#19530, step 05.01)
hleb-rubanau Sep 27, 2024
323982b
fix(helm/basemap/generator/test): switch to emptydir (#19530, step 05…
hleb-rubanau Sep 27, 2024
98310a3
feat(helm/basemap/generator): support suspended jobs (#19530, step 06…
hleb-rubanau Oct 1, 2024
06bddaf
fix(helm+flux/basemap/generator/dev): suspend cronjob (#19530 step 06…
hleb-rubanau Oct 1, 2024
2439614
feat(flux/db-event-api/prod): switchover master to hwn04 (#19530 step…
hleb-rubanau Oct 1, 2024
55759e8
feat(flux/db-event-api/prod): new named replica on hwn03 (#19530 step…
hleb-rubanau Oct 1, 2024
1948caf
feat(flux/db-event-api/prod): eliminate legacy instance and update in…
hleb-rubanau Oct 1, 2024
ca114b4
fix(helm+flux/basemap/generator/dev): switch to emptyDir, move hwn03-…
hleb-rubanau Sep 23, 2024
70bfa1b
fix(helm+flux/basemap/generator/prod): switch to emptydir and move hw…
hleb-rubanau Sep 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion flux/clusters/k8s-01/basemap/dev/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,20 @@ patches:
value:
- ./helm/basemap/values.yaml
- ./helm/basemap/values/values-dev.yaml
---
- target:
kind: CronJob
name: 'dev-basemap'
group: batch
version: v1
patch: |-
- op: add
path: /spec/jobTemplate/spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn02.k8s-01.kontur.io
Comment on lines +19 to +34
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistent Node Affinity Strategy Detected

The addition of a hard-coded hostname (hwn02.k8s-01.kontur.io) for node affinity in the CronJob introduces an inconsistency with the repository's existing practices, which utilize labels for node selection. This deviation can lead to maintenance challenges and reduced flexibility.

Recommendations:

  1. Align with Existing Practices:

    • Use node labels instead of hard-coded hostnames for affinity. For example:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: node-type
                  operator: In
                  values:
                    - basemap-storage
  2. Document Rationale:

    • If using a specific hostname is necessary, include comments within the YAML file or in the PR description to explain the reasoning behind this choice.
  3. Consider Flexibility:

    • Implement a fallback mechanism or use preferredDuringSchedulingIgnoredDuringExecution to ensure the CronJob can run even if the specified node is unavailable.
🔗 Analysis chain

Consider using more flexible node selection criteria

The addition of node affinity to the CronJob aligns with the PR objective of moving copy local path based PVs outside of hwn01. However, there are some concerns and suggestions to consider:

  1. Hard-coding a specific node hostname (hwn02.k8s-01.kontur.io) reduces flexibility and can cause issues if the node is unavailable or renamed.
  2. This approach might not scale well in a dynamic cluster environment.

Consider the following recommendations:

  1. Use more flexible node selection criteria, such as node labels or zones, instead of a specific hostname. For example:

    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values:
                  - basemap-storage
  2. Document the reason for this specific node requirement in a comment within the YAML file or in the PR description. This will help future maintainers understand the rationale behind this decision.

  3. Consider implementing a fallback mechanism or using preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution to ensure the job can still run if the preferred node is unavailable.

To ensure this change doesn't conflict with other scheduling policies, let's check for any existing affinity rules or node selectors:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing affinity rules or node selectors in the dev environment
rg --type yaml 'affinity:|nodeSelector:' flux/clusters/k8s-01/basemap/dev/

Length of output: 76


Script:

#!/bin/bash
# Comprehensive search for affinity rules or node selectors in the entire repository
rg --type yaml 'affinity:|nodeSelector:'

Length of output: 1176


Script:

#!/bin/bash
# Search for node affinity definitions using labels instead of hard-coded hostnames
rg --type yaml 'nodeSelector' flux/clusters/k8s-01/ | rg -v 'kubernetes\.io/hostname'

Length of output: 2624

---
34 changes: 33 additions & 1 deletion flux/clusters/k8s-01/basemap/prod/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,36 @@ patches:
value:
- ./helm/basemap/values.yaml
- ./helm/basemap/values/values-prod.yaml
---
- target:
kind: Deployment
name: 'prod-basemap-tileserver'
group: apps
version: v1
patch: |-
- op: add
path: /spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- hwn01.k8s-01.kontur.io
- target:
kind: CronJob
name: 'prod-basemap'
group: batch
version: v1
patch: |-
- op: add
path: /spec/jobTemplate/spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn02.k8s-01.kontur.io
---
34 changes: 33 additions & 1 deletion flux/clusters/k8s-01/basemap/test/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,36 @@ patches:
value:
- ./helm/basemap/values.yaml
- ./helm/basemap/values/values-test.yaml
---
- target: # temporary stick to hwn02
kind: CronJob
name: 'test-basemap'
group: batch
version: v1
patch: |-
- op: add
path: /spec/jobTemplate/spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn02.k8s-01.kontur.io
- target: # temporary move out from hwn01
kind: Deployment
name: 'test-basemap-tileserver'
group: apps
version: v1
patch: |-
- op: add
path: /spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- hwn01.k8s-01.kontur.io
---
Original file line number Diff line number Diff line change
Expand Up @@ -21,18 +21,12 @@ patchesJson6902:
kind: PostgresCluster
name: db-event-api
path: postgrescluster-custom-image.yaml
# - target:
# group: postgres-operator.crunchydata.com
# version: v1beta1
# kind: PostgresCluster
# name: db-event-api
# path: postgrescluster-resource-adjustment.yaml
- target:
- target: # new config for instances management
group: postgres-operator.crunchydata.com
version: v1beta1
kind: PostgresCluster
name: db-event-api
path: postgrescluster-assigning.yaml
path: postgrescluster-instances.yaml
Comment on lines +24 to +29
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Incomplete removal of patch files

The patch files postgrescluster-resource-adjustment.yaml and postgrescluster-assigning.yaml are still referenced in several kustomization.yaml files across different overlays. Please remove these references to ensure consistency and prevent configuration issues.

  • References to postgrescluster-resource-adjustment.yaml:

    • flux/clusters/k8s-01/user-profile-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/event-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/DEV/kustomization.yaml
  • References to postgrescluster-assigning.yaml:

    • flux/clusters/k8s-01/user-profile-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/DEV/kustomization.yaml
🔗 Analysis chain

New patch added for instances management

A new patch postgrescluster-instances.yaml has been added for the PostgresCluster resource db-event-api. This addition aligns with the PR objective of potentially making the basemap stateless or moving persistent volumes.

To ensure this change is consistent with the rest of the configuration, please run the following script:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the contents of the new patch file and its references

# Test 1: Check if the new patch file exists
if [ -f "flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml" ]; then
    echo "postgrescluster-instances.yaml exists."
    # Display the contents of the file
    cat "flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml"
else
    echo "Error: postgrescluster-instances.yaml not found."
fi

# Test 2: Check for any other references to the new patch file
rg "postgrescluster-instances.yaml" --type yaml

Length of output: 3449


Script:

#!/bin/bash
# Description: Verify the removal of postgrescluster-resource-adjustment.yaml and postgrescluster-assigning.yaml

# Files to check
FILES=("postgrescluster-resource-adjustment.yaml" "postgrescluster-assigning.yaml")
BASE_DIR="flux/clusters/k8s-01/event-api-db/overlays/DEV"

for FILE in "${FILES[@]}"; do
    # Test 1: Check if the patch file exists
    if [ -f "${BASE_DIR}/${FILE}" ]; then
        echo "Error: ${FILE} still exists in ${BASE_DIR}."
    else
        echo "Success: ${FILE} has been removed from ${BASE_DIR}."
    fi

    # Test 2: Check for any references to the removed patch file in all kustomization.yaml files
    MATCHES=$(rg "${FILE}" --type yaml)
    if [ -z "$MATCHES" ]; then
        echo "Success: No references to ${FILE} found in any kustomization.yaml files."
    else
        echo "Error: References to ${FILE} found in the following kustomization.yaml files:"
        echo "$MATCHES"
    fi
done

Length of output: 12927


generatorOptions:
disableNameSuffixHash: true
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Always replace placeholder instances from `base`
- op: replace
path: /spec/instances
value:
- name: hwn04
replicas: 1
resources: &resources
limits:
memory: 512Gi # # let Postres use the page cache
cpu: '64' # set as event-api pool size + autovacuum workers + a bit for parallel workers
# empyrical recommendations below
# memory: 100Gi
# cpu: '5'
requests:
memory: 22Gi # roughly shared_buffers + pool size * work_mem
cpu: '2' # if we don't have CPU we can run on potato
# empyrical recommendations below
# memory: 7Gi
# cpu: '1'
dataVolumeClaimSpec: &storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
# empyrical recommendations below -- actual storage varies from 2300Gi to 6200Gi
# storage: 2500Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn04.k8s-01.kontur.io
- name: hwn03
replicas: 1
resources:
<<: *resources
dataVolumeClaimSpec:
<<: *storage
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn03.k8s-01.kontur.io

# Switchover section
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#changing-the-primary
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#targeting-an-instance
# 1. Requires fully-qualified instance id of new master
# 2. New master must be in sink with current master (check via `patronictl list`)
# 3. Its recommended to retain switchover section, to keep track of last desired master both in code and k8s
- op: add
path: /spec/patroni/switchover
value:
enabled: true
targetInstance: db-event-api-hwn04-OVERRIDEME
#type: Failover
# trigger-switchover annotation triggers actual switchover whenever its updated
# value is arbitrary, but it's recommended to reference the date of switchover
# special syntax is used to escape slash with sequence ~1
# See https://windsock.io/json-pointer-syntax-in-json-patches-with-kustomize/
- op: add
path: /metadata/annotations/postgres-operator.crunchydata.com~1trigger-switchover
value: OVERRIDEME
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ patchesJson6902:
kind: PostgresCluster
name: db-event-api
path: postgrescluster-s3-backups.yaml
- target:
- target: # new config for instances management
group: postgres-operator.crunchydata.com
version: v1beta1
kind: PostgresCluster
name: db-event-api
path: postgrescluster-resource-adjustment.yaml
path: postgrescluster-instances.yaml
- target:
group: postgres-operator.crunchydata.com
version: v1beta1
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Always replace placeholder instances from `base`
- op: replace
path: /spec/instances
value:
- name: hwn04
replicas: 1
resources: &resources
limits:
memory: 512Gi # # let Postres use the page cache
cpu: '64' # set as event-api pool size + autovacuum workers + a bit for parallel workers
# empyrical recommendations below
# memory: 100Gi
# cpu: '5'
requests:
memory: 22Gi # roughly shared_buffers + pool size * work_mem
cpu: '2' # if we don't have CPU we can run on potato
# empyrical recommendations below
# memory: 7Gi
# cpu: '1'
dataVolumeClaimSpec: &storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
# empyrical recommendations below -- actual storage varies from 2300Gi to 6200Gi
# storage: 2500Gi
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn04.k8s-01.kontur.io
- name: hwn03
replicas: 1
resources:
<<: *resources
dataVolumeClaimSpec:
<<: *storage
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn03.k8s-01.kontur.io

# Switchover section
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#changing-the-primary
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#targeting-an-instance
# 1. Requires fully-qualified instance id of new master
# 2. New master must be in sink with current master (check via `patronictl list`)
# 3. Its recommended to retain switchover section, to keep track of last desired master both in code and k8s
- op: add
path: /spec/patroni/switchover
value:
enabled: true
targetInstance: db-event-api-hwn04-OVERRIDEME
#type: Failover
# trigger-switchover annotation triggers actual switchover whenever its updated
# value is arbitrary, but it's recommended to reference the date of switchover
# special syntax is used to escape slash with sequence ~1
# See https://windsock.io/json-pointer-syntax-in-json-patches-with-kustomize/
- op: add
path: /metadata/annotations/postgres-operator.crunchydata.com~1trigger-switchover
value: OVERRIDEME
Comment on lines +53 to +71
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Action Required: Replace Placeholder Values Before Applying Configuration

The switchover configuration contains placeholder values that must be updated before it can be safely applied:

  1. Line 63: targetInstance: db-event-api-hwn04-OVERRIDEME
  2. Line 71: value: OVERRIDEME

Please replace these placeholders with the appropriate instance identifiers and values to ensure proper functionality.

🔗 Analysis chain

LGTM: Switchover configuration with a note on placeholders.

The switchover configuration is well-documented with helpful comments explaining the process and requirements. The use of an annotation to trigger the switchover is a good practice for Kubernetes-based operations.

However, please note that there are placeholders that need to be replaced before this configuration can be used:

  1. Line 63: db-event-api-hwn04-OVERRIDEME
  2. Line 71: OVERRIDEME

Ensure these are replaced with appropriate values before applying this configuration.

To verify the presence of these placeholders, you can run:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for OVERRIDEME placeholders in the file
grep -n "OVERRIDEME" flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml

Length of output: 175

This file was deleted.

16 changes: 16 additions & 0 deletions flux/clusters/k8s-01/raster-tiler/dev/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,20 @@ patches:
value:
- ./helm/raster-tiler/values.yaml
- ./helm/raster-tiler/values/values-dev.yaml
- target:
kind: Deployment
name: 'dev-raster-tiler'
group: apps
version: v1
patch: |-
- op: add
path: /spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- hwn01.k8s-01.kontur.io
---
16 changes: 16 additions & 0 deletions flux/clusters/k8s-01/raster-tiler/geocint/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,20 @@ patches:
value:
- ./helm/raster-tiler/values.yaml
- ./helm/raster-tiler/values/values-geocint.yaml
- target:
kind: Deployment
name: 'geocint-raster-tiler'
group: apps
version: v1
patch: |-
- op: add
path: /spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: NotIn
values:
- hwn01.k8s-01.kontur.io
---
2 changes: 1 addition & 1 deletion helm/basemap/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.0.37
version: 0.0.45

#Don't use appVersion, use {{ .Values.images...tag's. }} instead. That's required for Flux automation - so that different
# stages can have different versions within the same branch watched by Flux.
Loading