Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors adding environment variables and mounting secret volumes to services #793

Open
andrewbelu opened this issue Aug 29, 2024 · 15 comments

Comments

@andrewbelu
Copy link

andrewbelu commented Aug 29, 2024

The override examples don't seem to be working for me.

With the latest operator code, applying an environment variable override (to add an environment variable as per https://temporal-operator.pages.dev/features/overrides/#example-add-an-environment-variable-to-the-worker-pod

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: prod
  namespace: demo
spec:
  version: 1.23.0
  numHistoryShards: 1
  persistence:
    defaultStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal
        connectAddr: postgres.demo.svc.cluster.local:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: postgres-password
        key: PASSWORD
    visibilityStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal_visibility
        connectAddr: postgres.demo.svc.cluster.local:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: postgres-password
        key: PASSWORD
  services:
    worker:
      overrides:
        deployment:
          spec:
            template:
              spec:
                containers:
                  - name: service
                    env:
                      - name: HTTP_PROXY
                        value: example.com

causes the following error:

2024-08-29T16:34:20.751752301Z 2024-08-29T16:34:20Z     DEBUG   events  failed to create resource prod-worker of Type *v1.Deployment{"type": "Warning", "object": {"kind":"TemporalCluster","namespace":"demo","name":"prod","uid":"c246e4c8-fe32-4fef-8845-03a3f42a6b14","apiVersion":"temporal.io/v1beta1","resourceVersion":"5760"}, "reason": "RessourceCreateError"}
2024-08-29T16:34:20.751760515Z 2024-08-29T16:34:20Z     DEBUG   events  Deployment.apps "prod-worker" is invalid: [spec.template.spec.containers[0].env[0].valueFrom: Invalid value: "": may not be specified when `value` is not empty, spec.template.spec.containers[0].env[1].valueFrom: Invalid value: "": may not be specified when `value` is not empty, spec.template.spec.containers[0].env[2].valueFrom: Invalid value: "": may not be specified when `value` is not empty]   {"type": "Warning", "object": {"kind":"TemporalCluster","namespace":"demo","name":"prod","uid":"c246e4c8-fe32-4fef-8845-03a3f42a6b14","apiVersion":"temporal.io/v1beta1","resourceVersion":"5760"}, "reason": "ProcessingError"}

Additionally, adding an additional volume causes an error as well (taken from https://temporal-operator.pages.dev/features/overrides/#example-mount-an-extra-volume-to-the-frontend-pod)

apiVersion: temporal.io/v1beta1
kind: TemporalCluster
metadata:
  name: prod
  namespace: demo
spec:
  version: 1.23.0
  numHistoryShards: 1
  persistence:
    defaultStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal
        connectAddr: postgres.demo.svc.cluster.local:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: postgres-password
        key: PASSWORD
    visibilityStore:
      sql:
        user: temporal
        pluginName: postgres
        databaseName: temporal_visibility
        connectAddr: postgres.demo.svc.cluster.local:5432
        connectProtocol: tcp
      passwordSecretRef:
        name: postgres-password
        key: PASSWORD
  services:
    frontend:
      overrides:
        deployment:
          spec:
            template:
              spec:
                containers:
                  - name: service
                    volumeMounts:
                      - name: extra-volume
                        mountPath: /etc/extra
                volumes:
                  - name: extra-volume
                    secret:
                      secretName: test-secret
---
apiVersion: v1
kind: Secret
metadata:
  name: test-secret
  namespace: demo
type: Opaque
data:
  TEST: TEST

results in the following error in starting the pod:

2024-08-29T16:55:27.928266623Z 2024-08-29T16:55:27Z     DEBUG   events  Deployment.apps "prod-frontend" is invalid: [spec.template.spec.volumes[0].configMap: Forbidden: may not specify more than 1 volume type, spec.template.spec.containers[0].volumeMounts[0].name: Not found: "extra-volume"]       {"type": "Warning", "object": {"kind":"TemporalCluster","namespace":"demo","name":"prod","uid":"c246e4c8-fe32-4fef-8845-03a3f42a6b14","apiVersion":"temporal.io/v1beta1","resourceVersion":"8330"}, "reason": "ProcessingError"}

I think this is due to some of the limitations to StrategicMergePatch with the default object, where the override replaces the existing object at index 0 with something that conflicts on a key, so there may be other fields where the same limitations applies. For example, env can have either the value field or the valueFrom field. When the operator does the StrategicMergePatch call, the result has a value field in index 0 where there was previously a valueFrom field. When kubernetes tries to apply this, it fails, probably similar to the issue in: kubernetes-sigs/kustomize#2734 (comment). The volumes is similar story, it is an array with objects that cannot have both configMap and secret set, so when the first index is overridden with another type it causes this issue.

@alexandrevilain
Copy link
Owner

cc @uritig @yujunz
Looks like the issue has been introduced in #720

@yujunz
Copy link
Contributor

yujunz commented Aug 30, 2024

I thought it is covered by tests, isn't it?

Too busy recently. Not sure if I can find some free time to check details. @andrewbelu could you try adding a test case to fail this case?

@andrewbelu
Copy link
Author

@yujunz As requested, added some test cases: #795

@yujunz
Copy link
Contributor

yujunz commented Sep 1, 2024

Thank you @andrewbelu . We may reproduce it with the tests.

go test pkg/kubernetes/overrides_test.go
--- FAIL: TestApplyDeploymentOverrides (0.00s)
    --- FAIL: TestApplyDeploymentOverrides/add_env_var_to_existing_env (0.00s)
        overrides_test.go:515:
            	Error Trace:	/Users/yujunz/Code/github.com/alexandrevilain/temporal-operator/pkg/kubernetes/overrides_test.go:515
            	Error:      	Should be true
            	Test:       	TestApplyDeploymentOverrides/add_env_var_to_existing_env
    --- FAIL: TestApplyDeploymentOverrides/add_secret_volume_to_existing_volumes (0.00s)
        overrides_test.go:515:
            	Error Trace:	/Users/yujunz/Code/github.com/alexandrevilain/temporal-operator/pkg/kubernetes/overrides_test.go:515
            	Error:      	Should be true
            	Test:       	TestApplyDeploymentOverrides/add_secret_volume_to_existing_volumes
FAIL
FAIL	command-line-arguments	0.786s
FAIL

Let's see if there is a way to fix it. Do you already have some proposals?

yujunz added a commit to yujunz/temporal-operator that referenced this issue Sep 1, 2024
yujunz added a commit to yujunz/temporal-operator that referenced this issue Sep 1, 2024
@yujunz
Copy link
Contributor

yujunz commented Sep 1, 2024

A simple test of strategicMergePatch does work as expected. See https://github.com/yujunz/temporal-operator/tree/issues-793

@alexandrevilain
Copy link
Owner

@yujunz
Copy link
Contributor

yujunz commented Sep 3, 2024

The workaround is to set valuesFrom to null explictly.

But still not clear about the root case. We may need to dig the code of strategic merge.

@yujunz
Copy link
Contributor

yujunz commented Sep 16, 2024

Print some debug info.

        overrides_test.go:511: expected env: [{Name:b Value:c ValueFrom:nil} {Name:a Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:test,},Key:,Optional:nil,},SecretKeyRef:nil,}}]
        overrides_test.go:512: actual env: [{Name:b Value:c ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:test,},Key:,Optional:nil,},SecretKeyRef:nil,}} {Name:a Value: ValueFrom:&EnvVarSource{FieldRef:nil,ResourceFieldRef:nil,ConfigMapKeyRef:&ConfigMapKeySelector{LocalObjectReference:LocalObjectReference{Name:test,},Key:,Optional:nil,},SecretKeyRef:nil,}}]

Need further look into the options

// Merge fields from a patch map into the original map. Note: This may modify
// both the original map and the patch because getting a deep copy of a map in
// golang is highly non-trivial.
// flag mergeOptions.MergeParallelList controls if using the parallel list to delete or keeping the list.
// If patch contains any null field (e.g. field_1: null) that is not
// present in original, then to propagate it to the end result use
// mergeOptions.IgnoreUnmatchedNulls == false.
func mergeMap(original, patch map[string]interface{}, schema LookupPatchMeta, mergeOptions MergeOptions) (map[string]interface{}, error) {

@yujunz
Copy link
Contributor

yujunz commented Sep 23, 2024

When kubernetes tries to apply this, it fails, probably similar to the issue in: kubernetes-sigs/kustomize#2734 (comment). The volumes is similar story, it is an array with objects that cannot have both configMap and secret set, so when the first index is overridden with another type it causes this issue.

@alexandrevilain , I think the root cause is the one mentioned by @andrewbelu above. While there can be a more graceful fix by patching null fields on the fly, it is difficult to implement a generic solution.

What do you think we add a similar workaround in examples?

@andrewbelu
Copy link
Author

andrewbelu commented Sep 25, 2024

When kubernetes tries to apply this, it fails, probably similar to the issue in: kubernetes-sigs/kustomize#2734 (comment). The volumes is similar story, it is an array with objects that cannot have both configMap and secret set, so when the first index is overridden with another type it causes this issue.

@alexandrevilain , I think the root cause is the one mentioned by @andrewbelu above. While there can be a more graceful fix by patching null fields on the fly, it is difficult to implement a generic solution.

What do you think we add a similar workaround in examples?

I don't believe the workaround works. If you swap the null on the secret and the configMap like so:

  services:
    frontend:
      overrides:
        deployment:
          spec:
            template:
              spec:
                containers:
                  - name: service
                    volumeMounts:
                      - name: extra-volume
                        mountPath: /etc/extra
                volumes:
                  - name: extra-volume
                    configMap: null
                    secret:
                      secretName: test-secret

It gives:

Deployment.apps "prod-frontend" is invalid: [spec.template.spec.volumes[0].configMap: Forbidden: may not specify more than 1 volume type, spec.template.spec.containers[0].volumeMounts[0].name: Not found: "extra-volume"]

Before the resource is even created by Kubernetes, so probably at some validation step after the StrategicPatchMerge step.

@yujunz
Copy link
Contributor

yujunz commented Sep 26, 2024

I did test this case and it looks ok in unit test.

Raw: []byte(`{"containers":[{"name":"test", "volumeMounts":[{"name":"b", "readOnly":true, "mountPath":"/b"}]}], "volumes":[{"name":"b", "secret": {"secretName": "test"}, "configMap": null}]}`),

@andrewbelu Are you encountering it in e2e tests? Could you try reproducing it with unit tests so we can narrow down the possible root case?

@andrewbelu
Copy link
Author

I did test this case and it looks ok in unit test.

Raw: []byte(`{"containers":[{"name":"test", "volumeMounts":[{"name":"b", "readOnly":true, "mountPath":"/b"}]}], "volumes":[{"name":"b", "secret": {"secretName": "test"}, "configMap": null}]}`),

@andrewbelu Are you encountering it in e2e tests? Could you try reproducing it with unit tests so we can narrow down the possible root case?

Yes, this is actually trying to apply the resource to a running cluster. I believe the unit test only tests that the StrategicPatchMerge works as expected. It doesn't actually validate the Kubernetes resource by adding it to a Kubernetes cluster. This would seem to be more like an E2E or integration test required (I don't think the scope of a unit test would cover this). I can take a look at adding the E2E test over the weekend.

@TheHiddenLayer
Copy link
Contributor

TheHiddenLayer commented Oct 1, 2024

does anyone happen to know if this issue persists in latest release? #760

@alexandrevilain
Copy link
Owner

Hi @TheHiddenLayer nothing has been merged to fix this for now.

@andrewbelu
Copy link
Author

Added a pull request to add functionality for JSON patching, which should provide override functionality in YAML arrays.

#875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants