Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

19530 move copy local path based pvs outside of hwn01 or make basemap stateless #1409

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

Komzpa
Copy link
Member

@Komzpa Komzpa commented Oct 1, 2024

No description provided.

Also, new instances config is added.
For details about new format see https://kontur.fibery.io/favorites/Tasks/document/1258#Tasks/document/Managing-PGO-instances-with-kustomize-1388

Purpose: test new instance management schema with event-api database

Desired effect: new instance launched on hwn04, replicating from master
…01.02)

Prerequisites:
* hwn04 must catch up to current master
* instance ID and trigger-switchover must be actualized in config

See more:
https://kontur.fibery.io/Tasks/document/Managing-PGO-instances-with-kustomize-1388/anchor=Switchover--4153b0a2-c4ff-441a-8913-158a3caac361
…tances config to final format (#19530 step 01.04)

Prerequisites: wait until hwn03 will catch up on master (so we'll have healthy replicated setup after this patch is applied)
Desired effect: instance set '01' is eliminated, cluster runs with two named instances created on previous steps
….02)

* For the sake of dev/test/parity, and also least-disrputive test of new config
…rom hwn01 (#19530 step 03.03)

Desired effect:
* Deployment rollout is triggered due to spec.template change
* New POD is provisioned outside of hwn01
…rom hwn01 (#19530 step 03.04)

Desired effect:
  * Deployment rollout is triggered due to spec.template change
  * New POD is provisioned outside of hwn01
Modifying Helm chart in backwards-compatible manner, without immediate effect (therefore version not updated)
…01 (#19530, step 04.02)

Desired effect:
* Deployment rollout triggered by changes in spec.template
* New pod launched outside of hwn01
… hwn01 (#19530, step 04.03)

Desired effect:
* Deployment rollout triggered by changes in spec.template
* New pod launched outside of hwn01
* This step serves as least-disruptive test of basemap-generator node affinity, and is also required for dev/test/prod parity
….02)

* One more least-disruptive testing change of basemap-generator config update, also needed for dev/test/prod parity
… 07.01)

Prerequisites:
  * hwn04 must catch up to current master
  * instance ID and trigger-switchover must be actualized in config

Documentation:
https://kontur.fibery.io/Tasks/document/Managing-PGO-instances-with-kustomize-1388/anchor=Switchover--4153b0a2-c4ff-441a-8913-158a3caac361

Desired effect: hwn04 becomes new master
… 07.02)

Prerequisites: ensure previously suspended heavy cronjob (basemap-generator-dev) is not running on target instance hwn03 (kill it if still around)

Purpose: migrating to multi-instances configuration
Desired effect: new slave starts on hwn03, replicating from hwn04
Tradeoff: legacy instance set '01' is still running on hwn01/hwn02
Watch out: replication lag and WALs accumulation on master (hwn04)
…stances config to final format (#19530 step 07.03)

Purpose: migrating to multi-instances configuration
Prerequisites: wait until hwn03 will catch up on master (so we'll have healthy replicated setup after this patch is applied)
Desired effect: instance set '01' is eliminated, cluster runs with two named instances on hwn04 (master) and hwn03 (replica)
…>hwn02, unsuspend (#19530 step 08.01)

* Context: hwn01 is being prepared for draining, repacking workloads on 3 remaining nodes
* Temporary solution: all basemap-generators are isolated on hwn02 to avoid interference with other workloads
* Prerequisites: event-api-db-prod is eliminated from hwn02 (see previous steps)
…n01->hwn02 (#19530 step 08.02)

* Context: hwn01 is being prepared for draining, repacking workloads on 3 remaining nodes
* Temporary solution: all basemap-generators are isolated on hwn02 to avoid interference with other workloads
* Prerequisites: event-api-db-prod is eliminated from hwn02 (see previous steps)

* Desired results:
  * cronjob is scheduled to run on hwn02, no more bound to pvc
  * hwn01 could be safely drained!
Copy link

coderabbitai bot commented Oct 1, 2024

Walkthrough

The changes involve updates to various Kubernetes configurations, primarily focusing on the introduction of node affinity specifications for CronJob and Deployment resources across different environments (dev, test, prod). Additionally, modifications to Helm chart configurations include version increments, conditional storage type configurations, and the introduction of new PersistentVolumeClaims. The overall structure of the configurations remains intact while enhancing scheduling constraints and storage management.

Changes

File Path Change Summary
flux/clusters/k8s-01/basemap/dev/kustomization.yaml Added patch for CronJob with node affinity; modified HelmRelease patch to include additional values files.
flux/clusters/k8s-01/basemap/prod/kustomization.yaml Added patches for Deployment and CronJob with node affinity rules; modified HelmRelease patch.
flux/clusters/k8s-01/basemap/test/kustomization.yaml Added patches for CronJob and Deployment with node affinity specifications.
flux/clusters/k8s-01/event-api-db/overlays/DEV/kustomization.yaml Updated patches for PostgresCluster, added new instance management patch, removed obsolete patch.
flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml Introduced configuration for PostgreSQL instances with resource limits and switchover settings.
flux/clusters/k8s-01/event-api-db/overlays/PROD/kustomization.yaml Updated patch path for instance management; retained existing patches.
flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml Added configuration for PostgreSQL instances with resource specifications and switchover settings.
flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml Removed configuration for replicas in PostgreSQL cluster instance.
flux/clusters/k8s-01/raster-tiler/dev/kustomization.yaml Added patch for Deployment with node affinity to exclude specific nodes.
flux/clusters/k8s-01/raster-tiler/geocint/kustomization.yaml Added patch for Deployment with node affinity; modified HelmRelease patch to include new values files.
helm/basemap/Chart.yaml Updated chart version from 0.0.37 to 0.0.45.
helm/basemap/templates/cronjobs.yaml Added conditional logic for suspending CronJob and introduced an initContainer.
helm/basemap/templates/deployment.yaml Modified volume naming to accommodate different storage types.
helm/basemap/templates/pvc.yaml Introduced conditional creation of PVCs based on storage type.
helm/basemap/values.yaml Added storageType fields for tileserver and basemap, set to pvc.
helm/basemap/values/values-dev.yaml Set storageType for tileserver and basemap to emptydir.
helm/basemap/values/values-prod.yaml Set storageType for tileserver and basemap to emptydir.
helm/basemap/values/values-test.yaml Set storageType for tileserver and basemap to emptydir.
helm/raster-tiler/Chart.yaml Updated chart version from 0.0.77 to 0.0.79.
helm/raster-tiler/templates/deployment.yaml Introduced conditional volume management based on storageType.
helm/raster-tiler/templates/pvc.yaml Added conditional PVC creation based on storageType.
helm/raster-tiler/values.yaml Added storageType set to pvc; corrected YAML formatting.
helm/raster-tiler/values/values-dev.yaml Added storageType set to emptydir.
helm/raster-tiler/values/values-geocint.yaml Added storageType set to emptydir.

Poem

🐇 In the fields where the bunnies play,
New patches hop in, brightening the day.
With affinities set, and storage anew,
Kubernetes dances, with changes so true.
From dev to prod, the updates unfold,
A tale of enhancements, in YAML retold! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Komzpa
Copy link
Member Author

Komzpa commented Oct 1, 2024

@coderabbitai review

Copy link

coderabbitai bot commented Oct 1, 2024

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (25)
helm/raster-tiler/Chart.yaml (1)

10-10: Version increment looks good.

The chart version has been appropriately incremented from 0.0.77 to 0.0.79, which is a good practice to reflect changes in the chart or its templates.

Note: The version jumped from 0.0.77 to 0.0.79, skipping 0.0.78. While this is not an issue, it might be worth clarifying if this was intentional or if there was an intermediate version that wasn't committed.

helm/basemap/values/values-prod.yaml (1)

Line range hint 1-26: Clarify the overall strategy for stateless architecture

The changes to both the tileserver and basemap components to use emptyDir storage in the production environment represent a significant shift towards a stateless architecture. While this aligns with the PR objective, it's important to address the following points:

  1. Data Persistence: Ensure that all critical data is either stored in S3 or can be regenerated as needed. Any data that was previously persisted locally will now be lost on pod restarts.

  2. Performance Implications: Consider if there are any performance impacts from potentially increased network calls to S3, especially for frequently accessed data.

  3. Disaster Recovery: Verify that the disaster recovery strategy has been updated to account for the lack of local persistent storage.

  4. Scaling and High Availability: This change could potentially improve the ability to scale and provide high availability, as pods become more interchangeable.

Could you provide more context on the overall strategy behind this move to a stateless architecture? Specifically:

  1. Are there plans to implement caching mechanisms to mitigate potential performance impacts?
  2. How will this change affect the application's startup time and behavior during S3 outages?
  3. Are there any components that still require persistent storage, and if so, how will they be handled?

To get a broader view of the storage usage across the cluster, you can run:

#!/bin/bash
# List all PersistentVolumeClaims in the production namespace
kubectl get pvc -n $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep prod)

# List all pods with their volume information
kubectl get pods -n $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep prod) -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.volumes[*]}{.name}{": "}{.emptyDir.medium}{"\n"}{end}{"\n"}{end}'

This information will help ensure that the move to a stateless architecture is comprehensive and well-planned across all components.

helm/basemap/values/values-test.yaml (1)

Line range hint 1-26: Review overall system architecture and data flow

The changes to both tileserver and basemap components to use emptyDir storage represent a significant shift in the system's architecture. While this aligns with the PR objective of making the basemap stateless, it necessitates a comprehensive review of the entire system's data flow and storage strategy.

Consider the following architectural aspects:

  1. Data Flow: Review and document how data now flows through the system, from initial ingestion to final storage in S3.
  2. Caching Strategy: Evaluate if any caching mechanisms are needed to maintain performance, given the lack of persistent local storage.
  3. Failure Recovery: Ensure the system can recover gracefully if a pod restarts, potentially needing to re-download or regenerate data.
  4. S3 Integration: Verify that the S3 integration is robust enough to handle potentially increased read/write operations.
  5. Database Usage: Clarify the role of the database in this new architecture, ensuring it's being used effectively alongside S3 and emptyDir storage.
  6. Performance Implications: Conduct thorough performance testing to ensure the new architecture meets required performance standards.

I recommend creating or updating system architecture diagrams to reflect these changes, which will help the team understand and maintain the new stateless design.

helm/raster-tiler/values.yaml (1)

20-20: LGTM! Consider adding a comment for clarity.

The addition of storageType: pvc is appropriate and aligns with the PR objective. It's correctly placed within the storage configuration section and properly indented.

Consider adding a brief comment above this line to explain the purpose and possible values of storageType. For example:

# Storage type for the application (e.g., 'pvc' for PersistentVolumeClaim)
storageType: pvc

This would enhance the configuration's self-documentation and make it easier for other developers to understand and modify if needed.

helm/raster-tiler/templates/pvc.yaml (2)

Line range hint 4-15: LGTM: Comprehensive metadata with room for minor improvement.

The metadata section is well-structured and follows Kubernetes best practices:

  • Annotations for Helm release information are correctly set.
  • Labels provide comprehensive information for resource management.
  • Dynamic configuration using Helm variables is implemented effectively.

Consider adding a helm.sh/chart label to identify the specific chart version:

  labels:
    helm.sh/chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}

This can be helpful for tracking which version of the chart created this resource.

🧰 Tools
🪛 yamllint

[error] 2-2: syntax error: expected '', but found ''

(syntax)


Line range hint 1-25: Summary: Well-implemented PVC template aligning with PR objectives.

This new PVC template effectively addresses the PR objective of improving storage management:

  1. The conditional creation based on .Values.storageType allows for flexible deployment configurations.
  2. The use of Helm variables throughout the template enables easy customization across different environments.
  3. The structure follows Kubernetes best practices for PVC definitions.

These changes contribute to making the basemap more configurable and potentially stateless, depending on how the PVC is utilized in the broader context of the application.

To fully achieve the PR objectives:

  1. Ensure that other components in the helm chart are updated to use this PVC when appropriate.
  2. Update the documentation to reflect the new storage configuration options.
  3. Consider adding a section in the values.yaml file to clearly define the available storage options and their implications.

As you continue to work on making the basemap stateless, consider:

  1. Evaluating if any data currently stored in the PVC can be moved to a separate stateful service.
  2. Implementing caching mechanisms to reduce reliance on persistent storage.
  3. Designing the application to function with ephemeral storage where possible, using the PVC only for critical, non-reproducible data.
🧰 Tools
🪛 yamllint

[error] 2-2: syntax error: expected '', but found ''

(syntax)

helm/basemap/Chart.yaml (1)

20-21: Reminder: Importance of image tag usage for Flux automation.

The comment about not using appVersion and instead using {{ .Values.images...tag's }} is crucial for Flux automation. This approach allows different stages to have different versions within the same branch watched by Flux. Ensure that this practice is consistently followed across all relevant files and deployments.

flux/clusters/k8s-01/raster-tiler/dev/kustomization.yaml (1)

Line range hint 1-34: Summary: Node affinity patch added to dev environment configuration

The changes to this Kustomization file are focused and align with the PR objectives. The addition of the node affinity patch for the 'dev-raster-tiler' Deployment ensures that pods are not scheduled on the 'hwn01.k8s-01.kontur.io' node, effectively moving resources outside of hwn01 as intended.

The existing configuration, including the HelmRelease patch, remains unchanged, which helps maintain stability in other aspects of the deployment. This targeted approach to implementing the required changes is commendable.

Consider documenting this node affinity rule in your project's architecture or operations guide. This will help maintain consistency across environments and inform future scaling decisions.

flux/clusters/k8s-01/raster-tiler/geocint/kustomization.yaml (1)

Line range hint 1-34: Overall, the changes look good and align with the PR objectives.

The modifications to the HelmRelease and the addition of the nodeAffinity configuration for the Deployment are well-structured and follow Kubernetes best practices. These changes appear to support the goal of moving copy local path based PVs outside of hwn01 or making the basemap stateless.

However, to ensure these changes don't introduce any unforeseen issues:

  1. Verify the existence and content of the new Helm values files.
  2. Check the impact of the new nodeAffinity rules on service availability and performance.
  3. Review the overall node utilization after these changes are applied.

Consider documenting the rationale behind these changes in the project documentation or README. This will help future maintainers understand the design decisions made here.

helm/basemap/templates/pvc.yaml (4)

Line range hint 1-15: LGTM! Consider adding storage class specification.

The tileserver PVC definition looks good. The conditional creation based on storage type, environment-specific naming, and consistent resource allocation are all well implemented.

Consider adding a storageClassName field to the PVC specification. This allows for more control over the type of storage provisioned. For example:

spec:
  storageClassName: {{ .Values.tileserver.storageClass | default "standard" }}
  accessModes:
  - ReadWriteOnce
  # ... rest of the spec

This change would require adding a storageClass field to the tileserver section in your values file, with a default value to maintain backwards compatibility.

🧰 Tools
🪛 yamllint

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


Line range hint 17-31: LGTM! Consider adding storage class specification.

The basemap processor PVC definition is well-structured and consistent with the tileserver PVC. The conditional creation and environment-specific naming are appropriately implemented.

As with the tileserver PVC, consider adding a storageClassName field:

spec:
  storageClassName: {{ .Values.basemap.storageClass | default "standard" }}
  accessModes:
  - ReadWriteOnce
  # ... rest of the spec

This change would require adding a storageClass field to the basemap section in your values file, with a default value to maintain backwards compatibility.


Line range hint 32-45: LGTM! Consider separate conditional and storage class specification.

The Postgres PVC definition is well-structured and consistent with the other PVCs. However, there are two suggestions for improvement:

  1. Consider creating a separate conditional block for the Postgres PVC. This allows for more flexibility in configuring storage types independently for each component.

  2. As with the other PVCs, consider adding a storageClassName field:

Here's a suggested implementation incorporating both improvements:

{{- if eq .Values.postgres.storageType "pvc" }}
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: {{ .Values.envName }}-basemap-postgres-pvc
  namespace: {{ .Release.Namespace }}
spec:
  storageClassName: {{ .Values.postgres.storageClass | default "standard" }}
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: {{ .Values.pvc.postgresStorageSize }}
    limits:
      storage: {{ .Values.pvc.postgresStorageSize }}
{{- end }}

This change would require adding storageType and storageClass fields to the postgres section in your values file, with default values to maintain backwards compatibility.


Line range hint 1-46: Overall structure is good, but consider consistent conditionals and ignore YAML lint error.

The overall structure of the file is consistent and well-organized. However, there are two points to consider:

  1. For consistency and flexibility, consider using separate conditional blocks for each PVC. This allows for independent configuration of storage types for each component (tileserver, processor, and Postgres).

  2. The YAML syntax error reported by yamllint at the beginning of the file is likely a false positive due to the Helm template syntax. You can safely ignore this error, as it's common for YAML linters to misinterpret Helm template directives.

To improve consistency, consider restructuring the file like this:

{{- if eq .Values.tileserver.storageType "pvc" }}
# Tileserver PVC definition
{{- end }}

{{- if eq .Values.basemap.storageType "pvc" }}
# Basemap processor PVC definition
{{- end }}

{{- if eq .Values.postgres.storageType "pvc" }}
# Postgres PVC definition
{{- end }}

This structure allows for more granular control over each PVC's creation conditions.

flux/clusters/k8s-01/basemap/prod/kustomization.yaml (1)

Line range hint 1-51: Summary: Kustomization changes align with PR objectives, but consider overall cluster strategy.

The modifications to this Kustomization file successfully implement node affinity rules that align with the PR objective of moving copy local path based PVs outside of hwn01 or making basemap stateless. The changes include:

  1. Retaining the existing HelmRelease configuration.
  2. Adding a node anti-affinity rule for the 'prod-basemap-tileserver' Deployment to avoid 'hwn01'.
  3. Adding a node affinity rule for the 'prod-basemap' CronJob to require 'hwn02'.

These changes will affect how these resources are scheduled in the cluster. While they achieve the immediate goal, it's important to consider the following:

  1. The overall load distribution across the cluster nodes.
  2. The fault tolerance of the system, especially for the CronJob that's tied to a specific node.
  3. The long-term strategy for resource allocation and whether this approach scales well as the cluster grows or changes.

Consider documenting the reasons for these specific node affinity rules and how they fit into the broader cluster management strategy. This will help maintain clarity as the system evolves and new team members onboard.

Additionally, it might be beneficial to set up monitoring or alerts to ensure that these scheduling constraints don't lead to resource contention issues in the future.

flux/clusters/k8s-01/basemap/test/kustomization.yaml (1)

19-50: Address the temporary nature of the solution and consider long-term architectural improvements.

The changes implemented in this PR effectively move workloads away from hwn01, which aligns with one part of the PR objective. However, there are several architectural considerations to address:

  1. Both patches are marked as temporary solutions, indicating a need for a more permanent fix.
  2. The alternative objective of making basemap stateless is not addressed by these changes.
  3. The current approach of using node-specific affinity rules may not be the most scalable or maintainable solution in the long run.

Consider the following recommendations for improving the overall architecture:

  1. Develop a plan to transition from these temporary node affinity rules to a more robust, scalable solution.
  2. Investigate the possibility of making basemap stateless, as mentioned in the PR objective. This could potentially eliminate the need for node-specific scheduling altogether.
  3. If persistent volumes are the root cause of node-specific requirements, consider implementing a storage solution that allows for more flexible PV management across nodes.
  4. Document the current limitations and the reasoning behind these temporary changes to facilitate future improvements.

Would you like assistance in drafting a plan for transitioning to a more permanent solution or exploring the option of making basemap stateless?

helm/basemap/values.yaml (1)

26-30: Summary: Storage improvements and job control enhancements

The changes in this file collectively improve the storage configuration for both the tileserver and basemap components, aligning with the PR objective of enhancing persistent volume handling. The explicit setting of job suspension for the basemap adds clarity to its operational behavior.

Key points to consider:

  1. Ensure that the PVC configurations in the templates match these new settings.
  2. Verify that the storage sizes defined in the pvc section are appropriate for both components.
  3. Consider documenting these changes in the chart's README or CHANGELOG.
  4. Test the deployment in a non-production environment to ensure smooth transitions, especially regarding data persistence and job scheduling.

Given these changes, it might be beneficial to review the overall storage strategy for the application. Consider:

  1. Implementing a consistent approach to storage type configuration across all components.
  2. Evaluating the impact on backup and restore procedures, if any.
  3. Assessing whether these changes affect the application's scalability or cloud provider portability.
flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml (3)

2-2: Remove trailing spaces to improve code cleanliness.

There are trailing spaces on lines 2, 12, 13, and 19. While these don't affect functionality, removing them improves code cleanliness and adheres to best practices.

You can remove these trailing spaces to improve code cleanliness. Most modern text editors have settings to automatically trim trailing whitespace.

Also applies to: 12-13, 19-19

🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)


53-71: Approve switchover configuration with suggestions for improvement.

The addition of the switchover section is a good practice for enabling high availability in your PostgreSQL cluster. A few suggestions for improvement:

  1. Add a clear warning about the importance of updating the placeholder values (OVERRIDEME) before applying this configuration in any environment.

  2. Consider documenting the complete switchover process in a separate document and add a link to it in these comments. This would include steps like:

    • How to determine the fully-qualified instance id of the new master
    • How to check if the new master is in sync with the current master
    • The process of updating the annotation to trigger the switchover
  3. Add a comment explaining the significance of the postgres-operator.crunchydata.com/trigger-switchover annotation and how it should be used in practice.

Here's a suggested addition to the comments:

# WARNING: Ensure all OVERRIDEME placeholders are replaced with appropriate values before applying this configuration.
# For a detailed guide on performing a switchover, see: [Add link to your switchover documentation here]

Would you like assistance in creating a detailed switchover guide document?


1-71: Summary of review: Improvements with areas needing attention

This configuration file introduces significant changes to the PostgreSQL cluster setup, including detailed instance configurations and switchover functionality. While these changes generally improve the setup, there are several areas that require attention:

  1. Resource Allocation: The current CPU and memory limits seem exceptionally high, especially for a DEV environment. Consider reviewing and adjusting these based on actual usage patterns.

  2. Storage Configuration: There's a large discrepancy between the allocated storage (100Gi) and the commented empirical recommendation (2500Gi). This needs to be addressed to ensure appropriate storage provisioning.

  3. Node Affinity: The strict node affinity might limit flexibility. Consider using less strict affinity rules for better resilience and resource utilization.

  4. Switchover Functionality: The addition of switchover capability is positive, but ensure all placeholder values are replaced and the process is well-documented.

  5. Code Cleanliness: Minor issues like trailing spaces should be addressed for better maintainability.

Overall, these changes provide a good foundation for a robust PostgreSQL setup, but careful consideration of the highlighted areas will further improve the configuration's efficiency and reliability.

To ensure this configuration scales well across different environments (DEV, TEST, PROD), consider:

  1. Using environment variables or helm chart values for resource allocations and storage sizes.
  2. Implementing a strategy for gradually increasing resources and storage as you move from DEV to PROD.
  3. Developing a comprehensive guide for managing and maintaining this PostgreSQL setup across different environments.
🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)

flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml (2)

8-19: Review resource allocations and consider removing outdated comments.

The resource allocations, particularly for memory (512Gi limit), seem quite high. While this may be necessary for your use case, it's important to monitor actual usage to ensure efficient resource utilization.

Consider removing or updating the commented out "empirical recommendations" if they are no longer relevant. This will improve code readability. If they are still relevant, consider adding a comment explaining their purpose and when they might be used.

The use of YAML anchors for resources and storage specifications is a good practice for maintaining consistency and reducing duplication.

🧰 Tools
🪛 yamllint

[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)


2-2: Minor: Remove trailing spaces.

The static analysis tool detected trailing spaces on lines 2, 12, 13, and 19. While this doesn't affect functionality, removing them improves code cleanliness.

You can remove these trailing spaces to improve the overall code quality. Many text editors have options to automatically trim trailing whitespace, which can help prevent this issue in the future.

Also applies to: 12-13, 19-19

🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)

helm/basemap/templates/deployment.yaml (1)

46-53: Flexible volume configuration looks good!

The changes to the volume definition improve flexibility by allowing different storage types. The use of storageType in the name is a good practice for clarity and configuration management.

A minor suggestion for improvement:

Consider adding a comment explaining the purpose of the storageType variable and its possible values for better maintainability. For example:

volumes:
  # storageType can be either "pvc" for persistent volume claim or any other value for emptyDir
  - name: {{ .Values.envName }}-basemap-tileserver-{{ .Values.tileserver.storageType }}
  {{- if eq .Values.tileserver.storageType "pvc" }}
  ...
helm/raster-tiler/templates/deployment.yaml (1)

128-134: Flexible storage configuration looks good!

The introduction of conditional storage type configuration is a good improvement. It allows for more flexibility in different environments and use cases. The use of a PersistentVolumeClaim when storageType is "pvc" provides a persistent storage option, while the emptyDir with a size limit offers a controlled temporary storage solution.

Consider adding a comment explaining the purpose and implications of each storage type option. This can help other developers understand the reasoning behind this configuration. For example:

{{- if eq .Values.storageType "pvc" }}
# Use PersistentVolumeClaim for persistent storage across pod restarts
persistentVolumeClaim:
  claimName: {{ .Values.envName }}-raster-tiler
{{- else }}
# Use emptyDir for temporary storage, limited to specified size
emptyDir:
  sizeLimit: {{ .Values.storage_size }}
{{- end }}
helm/basemap/templates/cronjobs.yaml (2)

Line range hint 24-51: LGTM: Init container for PVC usage check added

The addition of the init container for checking PVC usage is a good practice. It ensures that the PVC has enough space before starting the main containers. A few observations:

  1. The condition for adding the init container is correct (only when storageType is "pvc").
  2. Environment variables are properly set, including the use of fieldRef for the current pod name.
  3. Resource limits and requests are correctly defined using values from the Helm chart.

Suggestion for improvement:
Consider adding a failureThreshold and successThreshold to the init container to control how many times it should retry before considering the job as failed. This can help in cases of temporary issues with PVC checks.

Here's an example of how you could add these fields:

initContainers:
- name: check-pvc-usage
  # ... (existing configuration) ...
  failureThreshold: 3
  successThreshold: 1
🧰 Tools
🪛 yamllint

[warning] 20-20: wrong indentation: expected 4 but found 6

(indentation)


133-148: LGTM: Flexible volume definitions added

The changes to the volume definitions are well-implemented and provide good flexibility:

  1. The configuration now supports both PVC and emptyDir storage types based on the storageType value.
  2. PersistentVolumeClaim is correctly used when storageType is "pvc", otherwise emptyDir is used.
  3. The emptyDir configuration includes a sizeLimit, which is a good practice to prevent unbounded growth.
  4. The changes are applied consistently for both processor and postgres volumes.

Suggestion for improvement:
Consider adding a validation for the storageType value in the Helm chart's values.yaml or in the templates to ensure it only accepts valid options (e.g., "pvc" or "emptyDir"). This can prevent potential issues if an invalid storage type is specified.

You could add a validation in the templates like this:

{{- if not (or (eq .Values.basemap.storageType "pvc") (eq .Values.basemap.storageType "emptyDir")) }}
{{- fail "storageType must be either 'pvc' or 'emptyDir'" }}
{{- end }}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between fe1f093 and 70bfa1b.

📒 Files selected for processing (25)
  • flux/clusters/k8s-01/basemap/dev/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/basemap/prod/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/basemap/test/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/DEV/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-assigning.yaml (0 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml (1 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/PROD/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml (1 hunks)
  • flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml (0 hunks)
  • flux/clusters/k8s-01/raster-tiler/dev/kustomization.yaml (1 hunks)
  • flux/clusters/k8s-01/raster-tiler/geocint/kustomization.yaml (1 hunks)
  • helm/basemap/Chart.yaml (1 hunks)
  • helm/basemap/templates/cronjobs.yaml (5 hunks)
  • helm/basemap/templates/deployment.yaml (3 hunks)
  • helm/basemap/templates/pvc.yaml (3 hunks)
  • helm/basemap/values.yaml (1 hunks)
  • helm/basemap/values/values-dev.yaml (2 hunks)
  • helm/basemap/values/values-prod.yaml (2 hunks)
  • helm/basemap/values/values-test.yaml (2 hunks)
  • helm/raster-tiler/Chart.yaml (1 hunks)
  • helm/raster-tiler/templates/deployment.yaml (1 hunks)
  • helm/raster-tiler/templates/pvc.yaml (2 hunks)
  • helm/raster-tiler/values.yaml (1 hunks)
  • helm/raster-tiler/values/values-dev.yaml (1 hunks)
  • helm/raster-tiler/values/values-geocint.yaml (1 hunks)
💤 Files with no reviewable changes (2)
  • flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-assigning.yaml
  • flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml
🧰 Additional context used
🪛 yamllint
flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)

flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)

helm/basemap/templates/cronjobs.yaml

[warning] 20-20: wrong indentation: expected 4 but found 6

(indentation)

helm/basemap/templates/pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

🔇 Additional comments (42)
helm/basemap/values/values-dev.yaml (4)

Line range hint 1-26: Summary of review and next steps

The changes to values-dev.yaml represent a significant shift towards a stateless architecture for both the tileserver and basemap components. While this aligns with the PR objectives, several critical areas require further clarification and documentation:

  1. Data persistence strategies for both components
  2. Performance and scalability implications
  3. Integration between ephemeral storage and S3
  4. Impact on the overall system architecture and data flow

Next steps:

  1. Provide responses to the questions raised in the review comments
  2. Update or create architectural documentation to reflect these changes
  3. Consider conducting performance tests to validate the impact of these changes
  4. Ensure that operational procedures are updated to account for the new architecture

The changes are approved pending the clarifications and documentation updates requested above.


26-26: Evaluate the impact of emptyDir storage on basemap processing and data flow

The addition of storageType: emptydir for the basemap component raises several important considerations:

  1. Processing persistence: Given the processing schedule ("0 0 */2 * *"), how will the use of ephemeral storage affect the continuity of processing between runs?
  2. Data flow: With S3 storage configured (s3: path: s3://basemap-locker01/DEV), clarify the data flow between the ephemeral local storage and S3.
  3. Database implications: How does this change impact the PostgreSQL database configuration (db: image: "nexus.kontur.io:8084/postgis/postgis")?

Let's verify the related configurations to understand the full context:

#!/bin/bash
# Search for basemap-related configurations and S3 usage
rg --type yaml 'basemap:' -A 15
rg --type yaml 's3:'

Consider documenting the new data flow and processing strategy to ensure all team members understand the implications of this architectural change.


13-13: Document the new stateless architecture and data flow

The changes to both tileserver and basemap components to use emptydir storage represent a significant shift towards a stateless architecture, aligning with the PR objectives. However, this architectural change has far-reaching implications that should be carefully considered and documented:

  1. Overall system resilience: How does this change affect the system's ability to handle component failures or restarts?
  2. Data consistency: What mechanisms are in place to ensure data consistency between ephemeral storage, S3, and the PostgreSQL database?
  3. Scalability: How does this new architecture support horizontal scaling of the tileserver and basemap components?
  4. Performance: What is the expected impact on system performance, particularly for frequently accessed tiles or during basemap processing?
  5. Disaster recovery: How has the disaster recovery strategy been adjusted to account for the increased reliance on ephemeral storage?

I recommend creating or updating architectural documentation to reflect these changes. This should include:

  • A diagram illustrating the new data flow between components
  • Clear explanations of how data persistence is maintained despite the use of ephemeral storage
  • Updated operational procedures for managing and monitoring the system

Would you like assistance in creating a template for this architectural documentation?

Let's check for existing documentation that might need updating:

#!/bin/bash
# Search for documentation files related to architecture or data flow
fd -e md -e txt | xargs rg -i 'architecture|data flow|storage'

Also applies to: 26-26


13-13: Clarify the implications of using emptyDir for tileserver storage

The addition of storageType: emptydir for the tileserver implies that you're using ephemeral storage. This aligns with the PR objective of making components stateless, but it's important to consider the following:

  1. Data persistence: emptyDir volumes are deleted when a Pod is removed. How will this affect the tileserver's data between restarts or redeployments?
  2. Performance: Depending on the size and frequency of tile requests, how might this impact the tileserver's performance compared to persistent storage?
  3. Scalability: With ephemeral storage, how will you handle scaling the tileserver across multiple replicas?

Could you provide more context on the data persistence strategy for the tileserver? Let's verify if there are any related configurations:

✅ Verification successful

Verified the use of storageType: emptydir in helm/basemap/values/values-dev.yaml aligns with other environment configurations. No issues detected regarding the data persistence strategy.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for tileserver-related configurations
rg --type yaml 'tileserver:' -A 10

Length of output: 220


Script:

#!/bin/bash
# Search for storage or persistence configurations related to tileserver
rg --type yaml 'tileserver' -A 10 | rg 'storageType|persistence|volume'

Length of output: 1116

helm/basemap/values/values-prod.yaml (2)

26-26: Evaluate the use of emptyDir for basemap storage

The addition of storageType: emptydir for the basemap component raises some concerns:

  1. EmptyDir storage is ephemeral, which means any local data will be lost when the pod is terminated or rescheduled.
  2. The basemap configuration includes an S3 path and URL, suggesting that some data is stored externally. However, there might be local data or caches that could be affected by this change.
  3. The processingSchedule indicates that periodic tasks are run. The output or intermediate data from these tasks might need persistent storage.

Please confirm if:

  1. All necessary data for the basemap component is stored in S3, and local storage is only used for temporary processing.
  2. The periodic processing tasks do not require persistent local storage between runs.

To verify the current storage usage and configuration, you can run:

#!/bin/bash
# Check for any existing PersistentVolumeClaims or PersistentVolumes related to basemap
kubectl get pvc,pv -l app=basemap -n $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep prod)

# Check if there are any volumes mounted to the basemap pods
kubectl get pods -l app=basemap -n $(kubectl get namespaces -o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep prod) -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{range .spec.volumes[*]}{.name}{": "}{.emptyDir.medium}{"\n"}{end}{"\n"}{end}'

This will help us understand the current storage configuration and ensure that the change to emptyDir doesn't negatively impact the basemap component's functionality.


13-13: Verify the implications of using emptyDir for tileserver storage

The addition of storageType: emptydir for the tileserver component means that the storage will be ephemeral and tied to the pod's lifecycle. This change could have significant implications:

  1. Any data stored by the tileserver will be lost when the pod is terminated or rescheduled.
  2. This might affect the persistence of cached tiles or other data that the tileserver might need to retain across restarts.

Could you please clarify the intent behind this change? If persistent storage is not required for the tileserver, this change is appropriate. However, if any data needs to persist across pod restarts, we might need to reconsider this approach.

To verify the current storage configuration, you can run:

This will help us understand if there are any existing persistent storage resources that might be affected by this change.

helm/basemap/values/values-test.yaml (2)

13-13: Verify the impact of using emptyDir storage for tileserver

The addition of storageType: emptydir for the tileserver aligns with the PR objective of making the basemap stateless. However, this change has important implications:

  1. Data stored by the tileserver will not persist across pod restarts or rescheduling.
  2. This could potentially impact the performance and functionality of the tileserver, depending on how it uses local storage.

Please confirm the following:

  1. The tileserver can function correctly without persistent storage.
  2. Any necessary data is either fetched from external sources or regenerated as needed.
  3. The performance impact of using emptyDir has been evaluated and is acceptable for the test environment.

You may want to run load tests to ensure the system performs adequately with this configuration.


26-26: Confirm the implications of using emptyDir storage for basemap

The addition of storageType: emptydir for the basemap component is consistent with the similar change made to the tileserver, further supporting the PR objective of making the basemap stateless. However, this change requires careful consideration:

  1. It ensures that no data persists locally across pod restarts or rescheduling.
  2. This could significantly impact the basemap processing workflow, especially considering the presence of a processing schedule and S3 storage configuration.

Please address the following points:

  1. Confirm that the basemap processing can function correctly without persistent local storage.
  2. Verify that all necessary data can be retrieved from S3 or other external sources as needed.
  3. Ensure that the processing results are properly stored in S3 or other persistent storage.
  4. Check if this change affects the processing schedule efficiency, as data might need to be re-downloaded or regenerated more frequently.

Additionally, consider documenting these architectural changes and their implications for the team's reference.

helm/raster-tiler/templates/pvc.yaml (4)

1-1: LGTM: Conditional PVC creation based on storage type.

The use of a conditional block to create the PVC only when .Values.storageType is "pvc" is a good practice. It allows for flexible deployment configurations and ensures resources are created only when needed.


2-3: LGTM: Correct API version and resource kind for PVC.

The API version "v1" and resource kind "PersistentVolumeClaim" are correctly specified for creating a PVC in Kubernetes.

🧰 Tools
🪛 yamllint

[error] 2-2: syntax error: expected '', but found ''

(syntax)


25-25: LGTM: Proper closure of conditional block.

The conditional block is correctly closed, ensuring that the entire PVC definition is only included when .Values.storageType is "pvc".


Line range hint 16-24: LGTM: Well-structured PVC spec with flexible configuration.

The spec section is correctly defined and uses Helm variables effectively:

  • Storage size and class are configurable via Helm values.
  • Volume mode is set to Filesystem, which is appropriate for most use cases.

Verify that ReadWriteOnce is the intended access mode for this PVC. If multiple pods need simultaneous access, consider using ReadWriteMany (if supported by the storage class).

To check the available access modes for your storage class, run:

Ensure that the access mode aligns with your application's requirements.

flux/clusters/k8s-01/event-api-db/overlays/DEV/kustomization.yaml (1)

Line range hint 1-33: Verify removal of previously existing patches

The AI summary mentions the removal of postgrescluster-resource-adjustment.yaml and postgrescluster-assigning.yaml patches, but these changes are not visible in the provided code snippet. Let's verify if these patches have indeed been removed.

Run the following script to check for the existence and references of these patches:

If the script confirms the removal of these patches and finds no remaining references, then the changes are consistent with the AI summary. Otherwise, we may need to investigate further or update the kustomization file accordingly.

✅ Verification successful

Patch removals in DEV overlay verified successfully.

All specified patches have been removed from flux/clusters/k8s-01/event-api-db/overlays/DEV/kustomization.yaml. No remaining references to these patches were found within this file.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the removal of previously existing patches

# Test 1: Check if the removed patch files still exist
for file in "postgrescluster-resource-adjustment.yaml" "postgrescluster-assigning.yaml"; do
    if [ -f "flux/clusters/k8s-01/event-api-db/overlays/DEV/$file" ]; then
        echo "Warning: $file still exists."
    else
        echo "$file has been removed."
    fi
done

# Test 2: Check for any remaining references to the removed patches
rg "postgrescluster-resource-adjustment.yaml|postgrescluster-assigning.yaml" --type yaml

Length of output: 3555

helm/basemap/Chart.yaml (2)

18-18: Version bump approved.

The chart version has been incremented from 0.0.37 to 0.0.45, which is a significant minor version increase. This aligns with the expectation of substantial changes related to the PR objectives.


18-18: Verify completeness of changes for PR objectives.

The version bump suggests significant changes, but it's important to ensure all necessary modifications for moving copy local path based PVs or making basemap stateless are included in this release.

Please confirm that all required changes for the PR objectives "move copy local path based pvs outside of hwn01 or make basemap stateless" are implemented and tested in this version. Consider running the following script to check for related changes:

✅ Verification successful

All required changes are present and correctly implemented for version 0.0.45.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for changes related to PV management or stateless modifications

# Test: Search for PV-related changes
echo "Checking for PV-related changes:"
rg --type yaml 'persistentVolume|pv:|storageClass' helm/

# Test: Search for stateless-related modifications
echo "Checking for stateless-related changes:"
rg --type yaml 'statefulset|deployment' helm/

# Test: Look for any new environment variables or config changes that might indicate stateless modifications
echo "Checking for potential stateless config changes:"
rg --type yaml 'env:|configMap|secret' helm/

Length of output: 17768

flux/clusters/k8s-01/raster-tiler/dev/kustomization.yaml (1)

19-34: LGTM! The node affinity patch aligns with the PR objectives.

The added patch correctly implements node affinity rules for the 'dev-raster-tiler' Deployment, preventing scheduling on the 'hwn01.k8s-01.kontur.io' node. This change aligns well with the PR objective of moving resources outside of hwn01.

To ensure this change doesn't inadvertently affect other environments, let's verify that similar patches aren't present in production configurations:

flux/clusters/k8s-01/raster-tiler/geocint/kustomization.yaml (2)

19-34: LGTM! Verify impact on service availability.

The addition of the nodeAffinity configuration to the Deployment 'geocint-raster-tiler' looks good. This change aligns with the PR objective of moving copy local path based PVs outside of hwn01.

To ensure this change doesn't negatively impact service availability, run the following script:

#!/bin/bash
# Description: Verify the impact of nodeAffinity on service availability

# Test 1: Check if there are enough nodes available that meet the new affinity requirements
echo "Checking available nodes:"
kubectl get nodes -o wide | grep -v hwn01.k8s-01.kontur.io

# Test 2: Check if the Deployment can be scheduled with the new affinity rules
echo "Checking Deployment schedulability:"
kubectl get deployment geocint-raster-tiler -o yaml | rg "Progressing|Available"

# Test 3: Check for any potential single point of failure
echo "Checking for potential single point of failure:"
kubectl get pods -l app=geocint-raster-tiler -o wide | rg "NODE"

Also, consider if this change might concentrate too many pods on a subset of nodes, potentially impacting performance. You may want to review the overall node utilization after this change.


Line range hint 14-17: LGTM! Verify the new values files.

The update to the valuesFiles field in the HelmRelease patch looks good. This change aligns with the PR objective of potentially refactoring the basemap or relocating PVs.

To ensure the new values files exist and contain the expected configurations, run the following script:

flux/clusters/k8s-01/event-api-db/overlays/PROD/kustomization.yaml (3)

17-17: Replaced resource adjustment patch with instances patch

The postgrescluster-resource-adjustment.yaml patch has been replaced with postgrescluster-instances.yaml. This change suggests a shift from general resource adjustments to more specific instance management.

Let's verify the removal of the old patch file and compare the contents of the old and new patch files:

#!/bin/bash
# Check if the old patch file still exists
if [ -f "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml" ]; then
    echo "Warning: Old patch file postgrescluster-resource-adjustment.yaml still exists."
    echo "Contents of old patch file:"
    cat "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml"
else
    echo "Old patch file postgrescluster-resource-adjustment.yaml has been removed as expected."
fi

# Compare the old and new patch files if both exist
if [ -f "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml" ] && [ -f "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml" ]; then
    echo "Diff between old and new patch files:"
    diff "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-resource-adjustment.yaml" "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml"
fi

Line range hint 1-41: Overall impact aligns with PR objectives

The changes in this file, particularly the introduction of the new instances patch and the replacement of the resource adjustment patch, align well with the PR objective of potentially making the basemap stateless or moving persistent volumes outside of hwn01. These modifications suggest a more granular approach to instance management, which could facilitate better control over stateful components.

However, to fully understand the impact and ensure alignment with the PR objectives, it would be beneficial to review the contents of the new postgrescluster-instances.yaml file. This will help verify if the changes indeed contribute to making the basemap stateless or moving PVs as intended.

Let's verify the contents of the new patch file and check for any PV-related configurations:

#!/bin/bash
# Display contents of the new patch file
echo "Contents of postgrescluster-instances.yaml:"
cat "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml"

# Check for PV-related configurations
echo "Checking for PV-related configurations in the new patch file:"
grep -i "persistentvolume" "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml"

12-17: New patch added for instance management

A new patch has been introduced for managing PostgresCluster instances. This addition suggests an intent to have more granular control over the database instances.

To ensure this change is consistent with other environments and to understand its contents, let's verify the existence and contents of the new patch file:

✅ Verification successful

Verification Successful: No Issues Found

The new patch for instance management in postgrescluster-instances.yaml has been successfully added without removing or replacing any existing configurations. Similar patches are present in other environments, ensuring consistency.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the new patch file exists and display its contents
if [ -f "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml" ]; then
    echo "Contents of postgrescluster-instances.yaml:"
    cat "flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml"
else
    echo "postgrescluster-instances.yaml not found in the expected location."
fi

# Check if similar patches exist in other environments
echo "Checking for similar patches in other environments:"
find flux/clusters/k8s-01/event-api-db/overlays -name "postgrescluster-instances.yaml"

Length of output: 3322


Script:

#!/bin/bash
# Check if the new patch file exists and display its contents
PATCH_FILE="flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml"
if [ -f "$PATCH_FILE" ]; then
    echo "Contents of postgrescluster-instances.yaml:"
    cat "$PATCH_FILE"
else
    echo "postgrescluster-instances.yaml not found in the expected location."
fi

# Check if similar patches exist in other environments
echo "Checking for similar patches in other environments:"
find flux/clusters/k8s-01/event-api-db/overlays -name "postgrescluster-instances.yaml"

# Check for PV-related configurations in the new patch file
echo "Checking for PV-related configurations in the new patch file:"
grep -i "persistentvolume" "$PATCH_FILE"

Length of output: 3768

flux/clusters/k8s-01/basemap/prod/kustomization.yaml (3)

Line range hint 7-18: LGTM: HelmRelease patch is correctly configured.

The HelmRelease patch is properly set up to use both the general values.yaml and the environment-specific values-prod.yaml. This configuration allows for a good separation of common and production-specific settings.


19-34: LGTM: Deployment patch adds correct node affinity rule.

The new patch for the 'prod-basemap-tileserver' Deployment correctly adds a node affinity rule to avoid scheduling on 'hwn01.k8s-01.kontur.io'. This aligns with the PR objective of moving resources outside of hwn01.

To ensure this change doesn't negatively impact the cluster's load distribution, please verify the following:

  1. The cluster has sufficient resources on other nodes to accommodate this Deployment.
  2. There are no specific requirements for this Deployment that are only met by the excluded node.

You can use the following command to check the current node utilization:

And to verify the number of available nodes:


35-50: LGTM: CronJob patch adds correct node affinity rule, but consider the implications.

The new patch for the 'prod-basemap' CronJob correctly adds a node affinity rule to require scheduling on 'hwn02.k8s-01.kontur.io'. This aligns with the PR objective of adjusting resource placement.

However, it's worth noting that this affinity rule is more restrictive than the one used for the Deployment. While the Deployment avoids a specific node, this CronJob is required to run on a specific node.

Please confirm the following:

  1. The 'hwn02.k8s-01.kontur.io' node has the necessary resources and configuration to run this CronJob.
  2. There's a specific reason for running this CronJob on this particular node (e.g., data locality, specialized hardware).
  3. There's a plan in place for what happens if the specified node is unavailable when the CronJob needs to run.

You can check the node's status and resources with:

And verify the CronJob's configuration after applying these changes:

flux/clusters/k8s-01/basemap/test/kustomization.yaml (2)

35-50: Evaluate the impact of excluding hwn01 for the Deployment.

The addition of a node affinity rule for the 'test-basemap-tileserver' Deployment to avoid 'hwn01.k8s-01.kontur.io' is in line with the PR objective. This approach offers more flexibility compared to the CronJob patch as it allows scheduling on any node except hwn01. However, there are still some points to consider:

  1. The comment indicates this is a temporary solution, which is good for addressing immediate concerns but may require a follow-up.
  2. Excluding a specific node might impact load distribution and resource utilization.

Consider the following suggestions:

  1. Document the specific reasons for excluding hwn01 to aid in future decision-making.
  2. Create a plan for a more permanent solution that doesn't rely on excluding specific nodes.
  3. Ensure that the cluster has sufficient resources to handle this deployment without hwn01.

To validate the impact of this change, run:

#!/bin/bash
# Check the current node distribution of pods for this deployment
kubectl get pods -l app=test-basemap-tileserver -o wide

# Verify if there are any other affinity rules or node selectors for this deployment
rg --type yaml 'kind:\s*Deployment' -A 20 | rg 'name:\s*test-basemap-tileserver' -A 20 | rg -e 'nodeSelector:' -e 'affinity:'

19-34: Consider the implications of node-specific scheduling for the CronJob.

The addition of a node affinity rule for the 'test-basemap' CronJob to run specifically on 'hwn02.k8s-01.kontur.io' is noted. While this achieves the immediate goal of moving the job outside of hwn01, there are some considerations:

  1. This approach might limit scalability and flexibility in the long term.
  2. The comment indicates this is a temporary solution, which aligns with the PR objective but may need revisiting.

Consider the following suggestions:

  1. If possible, use more generic labels (e.g., node roles or types) instead of specific hostnames for better scalability.
  2. Document the rationale behind this temporary change and create a follow-up task to implement a more permanent solution.
  3. Evaluate if this change addresses the root cause of why the job needs to be moved from hwn01.

To ensure this change doesn't conflict with existing configurations, run:

✅ Verification successful

Node affinity verification successful for 'test-basemap' CronJob.

No existing nodeAffinity or nodeSelector configurations found for the test-basemap CronJob. The added node affinity rule to schedule the job on hwn02.k8s-01.kontur.io is confirmed without conflicts.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any other affinity rules or node selectors for this CronJob
rg --type yaml 'kind:\s*CronJob' -A 20 | rg 'name:\s*test-basemap' -A 20 | rg -e 'nodeSelector:' -e 'affinity:'

Length of output: 256


Script:

#!/bin/bash
# Search for node affinity or node selectors in the 'test-basemap' CronJob configurations
rg --type yaml 'kind:\s*CronJob' -A 20 | rg 'name:\s*test-basemap' -A 20 | rg -e 'nodeAffinity:' -e 'nodeSelector:'

Length of output: 113

helm/basemap/values.yaml (3)

29-29: Approved: Basemap job suspension explicitly disabled

The addition of suspend: false for the basemap job ensures that it will run as scheduled and not be suspended.

Could you please clarify:

  1. Was the basemap job previously suspended by default?
  2. Is this change part of a broader strategy to ensure consistent job execution across environments?

To verify the impact of this change, run the following script:

#!/bin/bash
# Description: Check for usage of the 'suspend' field in CronJob templates

echo "Checking for 'suspend' field usage in CronJob templates:"
rg --type yaml 'kind:\s*CronJob' -A 20 -g 'templates/*.yaml' | rg '^\s*suspend:'

This will help us understand how the suspend field is used across different CronJob templates in the chart.


30-30: Approved: Basemap storage type set to PVC

The addition of storageType: pvc for the basemap component aligns with the PR objective of improving persistent volume handling. This change should provide more robust and scalable storage for the basemap data.

To ensure proper configuration, please verify the following:

  1. The PVC for the basemap is correctly defined in the templates.
  2. The basemap deployment or job is updated to use this PVC.
  3. The processorStorageSize in the pvc section (currently 500Gi) is appropriate for the basemap's needs.

Run the following script to check the PVC configuration for the basemap:

#!/bin/bash
# Description: Verify PVC configuration for basemap

# Check for PVC template
echo "Checking for basemap PVC template:"
rg --type yaml 'kind:\s*PersistentVolumeClaim' -A 10 -g 'templates/*basemap*.yaml'

# Check for PVC usage in basemap deployment or job
echo "Checking for PVC usage in basemap deployment or job:"
rg --type yaml 'kind:\s*(Deployment|Job|CronJob)' -A 50 -g 'templates/*basemap*.yaml' | rg -A 10 'volumes:' | rg 'persistentVolumeClaim'

Additionally, could you please clarify if this change makes the basemap stateless or if it's part of moving the local path-based PVs? Understanding the specific goal will help ensure the implementation aligns with the intended architecture.


26-26: Approved: Tileserver storage type set to PVC

The addition of storageType: pvc for the tileserver aligns with the PR objective of improving persistent volume handling. This change should provide more robust and scalable storage for the tileserver component.

To ensure proper configuration, please verify the following:

  1. The PVC for the tileserver is correctly defined in the templates.
  2. The tileserver deployment is updated to use this PVC.
  3. The tileStorageSize in the pvc section (currently 150Gi) is appropriate for the tileserver's needs.

Run the following script to check the PVC configuration:

flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml (3)

28-51: Consider the trade-offs of strict node affinity.

The current configuration uses strict node affinity, binding each instance to a specific node:

  • hwn04 instance to hwn04.k8s-01.kontur.io
  • hwn03 instance to hwn03.k8s-01.kontur.io

While this ensures predictable scheduling, it may limit flexibility and resilience:

  1. If a specified node goes down, the pod cannot be rescheduled on another node.
  2. It may lead to uneven resource utilization across your cluster.

Consider using a less strict affinity rule (e.g., preferredDuringSchedulingIgnoredDuringExecution) or node labels that are less specific, allowing for more flexible scheduling while still maintaining some control over pod placement.

To check the current node status and labels, which can help in designing more flexible affinity rules, run:

#!/bin/bash
# Get node status and labels
kubectl get nodes -o wide --show-labels

This will provide an overview of your nodes and their labels, helping you design more flexible affinity rules if needed.


20-27: Review storage allocation and align with empirical recommendations.

The current storage allocation (100Gi) is significantly lower than the commented empirical recommendation (2500Gi). This large discrepancy warrants attention:

  1. If 100Gi is sufficient for your DEV environment, consider removing the commented recommendation to avoid confusion.
  2. If your data growth projections align more closely with the 2300Gi to 6200Gi range mentioned in the comment, consider increasing the allocated storage.

Remember that underprovisioning storage can lead to operational issues, while overprovisioning may result in unnecessary costs.

To check the current storage usage and help determine the appropriate size, you can run:

#!/bin/bash
# Get the current storage usage of PostgreSQL PVCs
kubectl get pvc -l app.kubernetes.io/name=postgres -o custom-columns=NAME:.metadata.name,SIZE:.spec.resources.requests.storage,USED:.status.capacity.storage

This will show you the allocated and used storage for your PostgreSQL volumes.


8-19: Review resource allocations and consider adjusting them.

The current resource allocations seem quite high, especially for a DEV environment:

  • Memory limit of 512Gi is extremely high. Even the commented empirical recommendation of 100Gi seems excessive for most development scenarios.
  • CPU limit of 64 cores is also very high. The empirical recommendation of 5 cores might be more appropriate.

Consider adjusting these values based on actual usage patterns in your development environment. Over-allocation can lead to inefficient resource utilization.

To help determine appropriate resource allocations, you can run the following command to check the actual resource usage of your PostgreSQL pods:

This will provide insights into the actual CPU and memory usage, helping you set more accurate resource limits and requests.

🧰 Tools
🪛 yamllint

[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)

flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml (2)

28-51: LGTM: Node affinity configuration.

The node affinity configuration for both instances (hwn04 and hwn03) is well-defined. This ensures that each instance is scheduled on a specific node, which is good for data locality and performance in a PostgreSQL cluster.


1-71: Overall: Configuration aligns well with PR objectives.

This configuration file for PostgreSQL cluster instances aligns well with the PR objectives of moving or making the basemap stateless. The use of node affinity ensures better control over instance placement, while the switchover configuration provides flexibility for managing the cluster state.

Key points:

  1. Well-defined resource allocations and storage specifications.
  2. Proper use of node affinity for instance placement.
  3. Inclusion of switchover capabilities for improved state management.

These changes contribute to a more robust and manageable PostgreSQL cluster configuration, which should help in achieving the goals of the PR.

🧰 Tools
🪛 yamllint

[error] 2-2: trailing spaces

(trailing-spaces)


[error] 12-12: trailing spaces

(trailing-spaces)


[error] 13-13: trailing spaces

(trailing-spaces)


[error] 19-19: trailing spaces

(trailing-spaces)

helm/basemap/templates/deployment.yaml (2)

71-71: Consistent volume mount naming in initContainer

The change in the volume mount name for the initContainer is consistent with the volume definition changes. This ensures that the correct volume is used based on the configured storage type.


Line range hint 46-113: Summary: Improved flexibility in storage configuration

The changes in this deployment template significantly enhance the flexibility of storage configuration for the basemap tileserver. By introducing a dynamic storageType variable, the template now supports both persistent volume claims and emptyDir volumes, allowing for easier adaptation to different environments or requirements.

Key improvements:

  1. Dynamic volume naming based on storageType
  2. Conditional logic to handle different storage types
  3. Consistent naming across volume definition and mounts in both initContainer and main container

These changes align well with the PR objective of making the basemap more flexible or stateless, depending on the chosen configuration. The implementation is clean and maintains good Helm practices.

helm/raster-tiler/templates/deployment.yaml (2)

Line range hint 1-134: Summary: Good improvement in storage flexibility

The changes to this deployment configuration file are focused and beneficial:

  1. The introduction of conditional storage type configuration enhances flexibility.
  2. The rest of the deployment configuration remains intact, minimizing potential side effects.
  3. This change allows for easier management of different storage requirements across various environments.

Overall, this is a positive change that improves the adaptability of the raster-tiler deployment without introducing unnecessary complexity.

To ensure a smooth transition and proper use of this new feature:

  1. Make sure to update the Helm chart version to reflect these changes.
  2. Consider adding a note to the changelog or release notes about this new storage configuration option.
  3. Ensure that the team is aware of this new configuration option and understands when to use each storage type.

Great job on improving the deployment configuration!


128-134: Verify Helm values and update documentation

The new storage configuration looks good, but it's important to ensure that the supporting elements are in place:

  1. Verify that the Helm values file includes the new storageType and storage_size variables, with appropriate default values.
  2. Update the README or other documentation to explain these new configuration options and their implications.
  3. Consider adding comments in the values file to guide users on when to use each storage type.

To check the Helm values file, you can run:

This will help ensure that the new configuration options are properly supported and documented.

✅ Verification successful

Helm values and documentation are properly configured

The storageType and storage_size variables are correctly defined in the Helm values file and appropriately utilized across the deployment templates. Ensure that the documentation has been updated to reflect these configurations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if storageType and storage_size are defined in values.yaml
grep -E "storageType:|storage_size:" helm/raster-tiler/values.yaml

# Check if there's any documentation update
grep -ri "storageType" helm/raster-tiler/

Length of output: 490

helm/basemap/templates/cronjobs.yaml (4)

19-21: LGTM: Suspend feature added

The addition of the suspend feature is a good improvement. It allows for easy temporary disabling of the CronJob when needed, which can be useful for maintenance or debugging purposes.

Note: The indentation warning from yamllint can be safely ignored as it's likely due to the Helm templating.

🧰 Tools
🪛 yamllint

[warning] 20-20: wrong indentation: expected 4 but found 6

(indentation)


63-63: LGTM: Volume mount name updated

The update to the volume mount name is correct and consistent with the changes in volume definitions. This modification allows for dynamic naming based on the storage type, which increases the flexibility of the configuration.


123-123: LGTM: Postgres volume mount name updated

The update to the postgres volume mount name is correct and consistent with the changes in volume definitions and the basemap-rendering container. This modification maintains consistency across the configuration and supports the dynamic storage type feature.


Line range hint 1-148: Overall assessment: Well-implemented improvements to CronJob flexibility

The changes in this file significantly enhance the flexibility and configurability of the CronJob:

  1. The new suspend feature allows for easy temporary disabling of the CronJob.
  2. The addition of an init container for PVC usage checks improves reliability when using persistent storage.
  3. The flexible volume definitions support both PVC and emptyDir storage types, allowing for different deployment scenarios.
  4. The changes are consistently applied throughout the file, maintaining coherence in the configuration.

These improvements will make the basemap CronJob more adaptable to different environments and use cases, while also enhancing its reliability and maintainability.

Comment on lines +35 to +36

storageType: emptydir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Conflict Between storageType and Persistent Volume Claims

The storageType is set to emptydir in the values files, but the deployment template references a persistentVolumeClaim. This inconsistency may lead to deployment issues.

  • File: helm/raster-tiler/templates/deployment.yaml contains persistentVolumeClaim:
🔗 Analysis chain

Verify the implications of using emptyDir for storage

The addition of storageType: emptydir suggests a shift in how storage is managed for this application. While emptyDir can be useful for temporary storage, it has some important considerations:

  1. Data persistence: emptyDir is ephemeral and does not persist data across pod restarts or rescheduling. Ensure this aligns with the application's data storage requirements.

  2. Performance: Depending on the emptyDir medium (memory or disk), performance characteristics may differ from the previous storage solution.

  3. Capacity: There's no explicit size limit for emptyDir (unless using memory medium). Ensure this doesn't conflict with the storage_size: 100Gi parameter defined earlier.

  4. Consistency with PR objectives: This change appears to align with the PR objective of making the basemap stateless, as emptyDir provides temporary, non-persistent storage.

To ensure this change is consistent across environments and doesn't conflict with existing configurations, please run the following script:

Consider documenting this change in the chart's README or values description, explaining the rationale behind using emptyDir and any implications for data persistence or application behavior.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for storageType configuration and potential conflicts

# Check if storageType is consistently set across all environment files
echo "Checking storageType across environments:"
rg "storageType:" helm/raster-tiler/values/

# Check for potential conflicts with persistent volume claims
echo "\nChecking for PVC configurations:"
rg "persistentVolumeClaim" helm/raster-tiler/templates/

Length of output: 478

@@ -32,3 +32,5 @@ resources:
memory: 4Gi
limits:
memory: 512Gi

storageType: emptydir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

Potential Conflict with Existing Storage Configurations

The addition of storageType: emptydir in values-dev.yaml may conflict with the existing storage_size: 100Gi and storageClass: local-path parameters. Consider the following:

  • Clarify Intent: Determine whether storageType: emptydir is meant to replace or complement the existing storage settings.
  • Adjust Configurations: If emptydir is intended to override, remove or adjust storage_size and storageClass accordingly to prevent conflicts.
  • Scoped Usage: If emptydir is for specific components, ensure that the configuration clearly reflects its scoped application.
🔗 Analysis chain

New storageType parameter added

The addition of storageType: emptydir introduces a new configuration option for storage in the dev environment. This change has potential implications:

  1. Using emptydir means that the storage is ephemeral and tied to the pod's lifecycle. This is suitable for temporary storage needs but may lead to data loss if the pod is rescheduled or restarted.

  2. This configuration might conflict with or override the existing storage_size: 100Gi and storageClass: local-path parameters (lines 17-18). It's unclear which storage configuration takes precedence.

To ensure consistency and understand the usage of these storage parameters, let's check other files in the project:

Consider the following recommendations:

  1. Clarify the relationship between storageType, storage_size, and storageClass in comments or documentation.
  2. If emptydir is intended for specific components only, consider using a more specific parameter name or structuring the configuration to clearly indicate which components use which storage type.
  3. Evaluate if using emptydir in the dev environment aligns with the application's data persistence requirements.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for storage-related configurations in other value files
echo "Searching for storage configurations in other value files:"
rg --type yaml 'storage(Type|Class|_size):' helm/raster-tiler/values/

Length of output: 985

Comment on lines +24 to +29
- target: # new config for instances management
group: postgres-operator.crunchydata.com
version: v1beta1
kind: PostgresCluster
name: db-event-api
path: postgrescluster-assigning.yaml
path: postgrescluster-instances.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Incomplete removal of patch files

The patch files postgrescluster-resource-adjustment.yaml and postgrescluster-assigning.yaml are still referenced in several kustomization.yaml files across different overlays. Please remove these references to ensure consistency and prevent configuration issues.

  • References to postgrescluster-resource-adjustment.yaml:

    • flux/clusters/k8s-01/user-profile-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/event-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/DEV/kustomization.yaml
  • References to postgrescluster-assigning.yaml:

    • flux/clusters/k8s-01/user-profile-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/layers-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/user-profile-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/TEST/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/PROD/kustomization.yaml
    • flux/clusters/k8s-01/keycloak-db/overlays/DEV/kustomization.yaml
    • flux/clusters/k8s-01/insights-api-db/overlays/DEV/kustomization.yaml
🔗 Analysis chain

New patch added for instances management

A new patch postgrescluster-instances.yaml has been added for the PostgresCluster resource db-event-api. This addition aligns with the PR objective of potentially making the basemap stateless or moving persistent volumes.

To ensure this change is consistent with the rest of the configuration, please run the following script:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the contents of the new patch file and its references

# Test 1: Check if the new patch file exists
if [ -f "flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml" ]; then
    echo "postgrescluster-instances.yaml exists."
    # Display the contents of the file
    cat "flux/clusters/k8s-01/event-api-db/overlays/DEV/postgrescluster-instances.yaml"
else
    echo "Error: postgrescluster-instances.yaml not found."
fi

# Test 2: Check for any other references to the new patch file
rg "postgrescluster-instances.yaml" --type yaml

Length of output: 3449


Script:

#!/bin/bash
# Description: Verify the removal of postgrescluster-resource-adjustment.yaml and postgrescluster-assigning.yaml

# Files to check
FILES=("postgrescluster-resource-adjustment.yaml" "postgrescluster-assigning.yaml")
BASE_DIR="flux/clusters/k8s-01/event-api-db/overlays/DEV"

for FILE in "${FILES[@]}"; do
    # Test 1: Check if the patch file exists
    if [ -f "${BASE_DIR}/${FILE}" ]; then
        echo "Error: ${FILE} still exists in ${BASE_DIR}."
    else
        echo "Success: ${FILE} has been removed from ${BASE_DIR}."
    fi

    # Test 2: Check for any references to the removed patch file in all kustomization.yaml files
    MATCHES=$(rg "${FILE}" --type yaml)
    if [ -z "$MATCHES" ]; then
        echo "Success: No references to ${FILE} found in any kustomization.yaml files."
    else
        echo "Error: References to ${FILE} found in the following kustomization.yaml files:"
        echo "$MATCHES"
    fi
done

Length of output: 12927

Comment on lines +19 to +34
kind: CronJob
name: 'dev-basemap'
group: batch
version: v1
patch: |-
- op: add
path: /spec/jobTemplate/spec/template/spec/affinity/nodeAffinity
value:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- hwn02.k8s-01.kontur.io
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistent Node Affinity Strategy Detected

The addition of a hard-coded hostname (hwn02.k8s-01.kontur.io) for node affinity in the CronJob introduces an inconsistency with the repository's existing practices, which utilize labels for node selection. This deviation can lead to maintenance challenges and reduced flexibility.

Recommendations:

  1. Align with Existing Practices:

    • Use node labels instead of hard-coded hostnames for affinity. For example:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: node-type
                  operator: In
                  values:
                    - basemap-storage
  2. Document Rationale:

    • If using a specific hostname is necessary, include comments within the YAML file or in the PR description to explain the reasoning behind this choice.
  3. Consider Flexibility:

    • Implement a fallback mechanism or use preferredDuringSchedulingIgnoredDuringExecution to ensure the CronJob can run even if the specified node is unavailable.
🔗 Analysis chain

Consider using more flexible node selection criteria

The addition of node affinity to the CronJob aligns with the PR objective of moving copy local path based PVs outside of hwn01. However, there are some concerns and suggestions to consider:

  1. Hard-coding a specific node hostname (hwn02.k8s-01.kontur.io) reduces flexibility and can cause issues if the node is unavailable or renamed.
  2. This approach might not scale well in a dynamic cluster environment.

Consider the following recommendations:

  1. Use more flexible node selection criteria, such as node labels or zones, instead of a specific hostname. For example:

    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-type
                operator: In
                values:
                  - basemap-storage
  2. Document the reason for this specific node requirement in a comment within the YAML file or in the PR description. This will help future maintainers understand the rationale behind this decision.

  3. Consider implementing a fallback mechanism or using preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution to ensure the job can still run if the preferred node is unavailable.

To ensure this change doesn't conflict with other scheduling policies, let's check for any existing affinity rules or node selectors:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing affinity rules or node selectors in the dev environment
rg --type yaml 'affinity:|nodeSelector:' flux/clusters/k8s-01/basemap/dev/

Length of output: 76


Script:

#!/bin/bash
# Comprehensive search for affinity rules or node selectors in the entire repository
rg --type yaml 'affinity:|nodeSelector:'

Length of output: 1176


Script:

#!/bin/bash
# Search for node affinity definitions using labels instead of hard-coded hostnames
rg --type yaml 'nodeSelector' flux/clusters/k8s-01/ | rg -v 'kubernetes\.io/hostname'

Length of output: 2624

Comment on lines +53 to +71
# Switchover section
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#changing-the-primary
# https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/cluster-management/administrative-tasks#targeting-an-instance
# 1. Requires fully-qualified instance id of new master
# 2. New master must be in sink with current master (check via `patronictl list`)
# 3. Its recommended to retain switchover section, to keep track of last desired master both in code and k8s
- op: add
path: /spec/patroni/switchover
value:
enabled: true
targetInstance: db-event-api-hwn04-OVERRIDEME
#type: Failover
# trigger-switchover annotation triggers actual switchover whenever its updated
# value is arbitrary, but it's recommended to reference the date of switchover
# special syntax is used to escape slash with sequence ~1
# See https://windsock.io/json-pointer-syntax-in-json-patches-with-kustomize/
- op: add
path: /metadata/annotations/postgres-operator.crunchydata.com~1trigger-switchover
value: OVERRIDEME
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Action Required: Replace Placeholder Values Before Applying Configuration

The switchover configuration contains placeholder values that must be updated before it can be safely applied:

  1. Line 63: targetInstance: db-event-api-hwn04-OVERRIDEME
  2. Line 71: value: OVERRIDEME

Please replace these placeholders with the appropriate instance identifiers and values to ensure proper functionality.

🔗 Analysis chain

LGTM: Switchover configuration with a note on placeholders.

The switchover configuration is well-documented with helpful comments explaining the process and requirements. The use of an annotation to trigger the switchover is a good practice for Kubernetes-based operations.

However, please note that there are placeholders that need to be replaced before this configuration can be used:

  1. Line 63: db-event-api-hwn04-OVERRIDEME
  2. Line 71: OVERRIDEME

Ensure these are replaced with appropriate values before applying this configuration.

To verify the presence of these placeholders, you can run:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for OVERRIDEME placeholders in the file
grep -n "OVERRIDEME" flux/clusters/k8s-01/event-api-db/overlays/PROD/postgrescluster-instances.yaml

Length of output: 175

@@ -105,7 +110,7 @@ spec:
protocol: TCP
volumeMounts:
- mountPath: /data
name: {{ .Values.envName }}-basemap-tileserver-pvc
name: {{ .Values.envName }}-basemap-tileserver-{{ .Values.tileserver.storageType }}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Inconsistent Volume Mount Naming Detected

The old volume naming convention is still present in helm/basemap/templates/pvc.yaml. Please update it to match the new naming convention used in the deployment.

🔗 Analysis chain

Consistent volume mount naming in main container

The change in the volume mount name for the main container is consistent with the volume definition and initContainer changes. This ensures that the correct volume is used based on the configured storage type.

To ensure consistency across the entire template, please run the following verification script:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistent use of the new volume naming convention

# Test: Check for any remaining instances of the old naming convention
# Expect: No output, indicating all instances have been updated
rg --type yaml 'name:\s+\{\{\s*\.Values\.envName\s*\}\}-basemap-tileserver-pvc' helm/basemap/templates/

# Test: Verify consistent use of the new naming convention
# Expect: Consistent output showing the new naming pattern
rg --type yaml 'name:\s+\{\{\s*\.Values\.envName\s*\}\}-basemap-tileserver-\{\{\s*\.Values\.tileserver\.storageType\s*\}\}' helm/basemap/templates/

Length of output: 737

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants