Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataPlan-Trust implementation for Activator #13968

Closed
davidhadas opened this issue May 9, 2023 · 21 comments
Closed

Add DataPlan-Trust implementation for Activator #13968

davidhadas opened this issue May 9, 2023 · 21 comments
Labels
kind/feature Well-understood/specified features, ready for coding. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@davidhadas
Copy link
Contributor

davidhadas commented May 9, 2023

See serving #11906
See https://docs.google.com/document/d/1XE7UzgQlVVtAb7ULSqOyKCaIHtm8zMF35ainp1JmwyY/

This issue focuses on adding DataPlan-Trust support for Activator and Queue including options for:
dataplane-trust = "minimal" (common names for all namespaces)
dataplane-trust = "enabled" (per namespace)
dataplane-trust = "mutual" mTLS

It includes the necessary changes needed for:

  1. QP Server will present the DataPlane User Certificate with names "data-plane.knative.dev" and "kn-user-<namespace>"
  2. Activator Client will always present the data plane certificate with the name "kn-routing-0"
  3. If dataplane-trust = "minimal", Activator Client will verify server certificate has the name "data-plane.knative.dev"
    otherwise, Activator Client will verify server certificate has the name "kn-user-<namespace>"
  4. Activator Server will present the DataPlane Routing Certificate with the name "kn-routing-0"
  5. If dataplane-trust = "mutual", Activator Server will verify the Client certificate having the name "kn-routing-0"
    Until such time that all ingresses use the new DataPlane Routing certificate, we should also accept "data-plane.knative.dev"
@davidhadas davidhadas added the kind/feature Well-understood/specified features, ready for coding. label May 9, 2023
@dprotaso
Copy link
Member

dprotaso commented May 9, 2023

As an operator why would I choose minimal over enabled?

@davidhadas
Copy link
Contributor Author

As an operator why would I choose minimal over enabled?

Minimal equals what we have had as "internal-encryption" (which we deprecated). This is the first TLS support we will be able to release. Once we complete the work for Enabled (all ingress types etc.), enabled will be recommended to replace it.

@nak3
Copy link
Contributor

nak3 commented May 11, 2023

You will add dataplane-trust = "identity" as well, won't you?

@davidhadas
Copy link
Contributor Author

davidhadas commented May 11, 2023

You will add dataplane-trust = "identity" as well, won't you?

The code also seems to support what we need for a future dataplane-trust = "identity" since identity adds on top of mutual trust. The main difference between the two appears at the ingress. However, the focus now is on TLS and verifying identities at each hop.

The suggested wip #13969 treats:

  • anything that is not disabled, as requiring TLS and server certificate verification
  • anything that is not disabled or minimal, as requiring TLS and server certificate verification per namespace
  • anything that is not disabled or minimal or enabled as requiring mTLS

So identity and mutual are the same at this point when we make changes to the activator and queue

@dprotaso
Copy link
Member

Minimal equals what we have had as "internal-encryption" (which we deprecated).

Given internal encryption is 'alpha' I wonder if we skip supporting Minimal in our redux. It might be better to just remove it now.

@nak3 do you have input on this feature's usage

@nak3
Copy link
Contributor

nak3 commented May 12, 2023

Yeah, I think it is alright to replace current internal-encryption with enabled if the new configuration will be implemented smoothly.

@davidhadas davidhadas changed the title Add DataPlan-Trust implementation for Activator and Queue Add DataPlan-Trust implementation for Activator May 14, 2023
@davidhadas davidhadas moved this to In Progress in Serving Encryption May 23, 2023
@davidhadas
Copy link
Contributor Author

@nak3

Yeah, I think it is alright to replace current internal-encryption with enabled if the new configuration will be implemented smoothly.

I can't see a way to ensure that the new configuration will be implemented smoothly at this point.
Minimal is our only option for enabling support without mandating an activator on-route for all requests.

Enabled without mandating an activator on-route is not something I expect to be available any time soon.

Moving the original internal-encryption forward (now renamed as Minimal) will allow us to introduce end-to-end encryption and move it from Alpha to Beta as soon as the PRs for activator and queue are approved. The other trust levels will evolve at a later time.

@nak3
Copy link
Contributor

nak3 commented Jun 26, 2023

If the new configuration will not be implemented smoothly, I think we should keep Minimal.

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per #11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

@davidhadas
Copy link
Contributor Author

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per #11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

#11906 was initially planned to be implemented using internal-encryption which is now called Minimal.
#11906 specifically requires support for "Ingress -> Queue-Proxy" path, which we can only achieve using Minimal

Support for Minimal is therefore in line with the defined criteria, removing "Minimal" means we no longer meet the criteria indicated in #11906.

@nak3
Copy link
Contributor

nak3 commented Jun 27, 2023

Sorry I referred to this specific comment #11906 (comment)

@evankanderson
Copy link
Member

With said, for moving it from Alpha to Beta, we will add Enabled level configuration as per #11906 ? We will need to re-discuss about Beta criteria if we change the criteria.

When you say Enabled level configuration, do you mean enabling encryption by default? I think that's a good target for Beta, though some KIngress implementations (gateway API) may need to disable for now.

@nak3
Copy link
Contributor

nak3 commented Jun 27, 2023

Sorry I misunderstood about Implement SNI in activator to avoid activator always in path. David explained about it and it is clear now.

@evankanderson
Copy link
Member

We talked a bit about this in the security WG meeting. To summarize, we believe that all the following are true:

  1. The current implementation is the minimal version, which uses the same certificate across all namespacse. We may want to pick a different name that's slighly less confusing, something like shared or bootstrap.
    • This is weak against an attacker which has read access to Secrets in any Kubernetes namespace AND network-level write access.
  2. There are two feature additions we need to reach the "GA" security bar:
    • high-volume: Removing the activator from the request path when space capacity is greater than target-burst-capacity. Activator bypass is currently disabled when TLS is enabled.
    • enabled: Changing the expected SAN for connections between components, so that requests for a service in a given namespace expect that namespace in the subject of the next-hop certificate.
      These two options interplay with each other -- for enabled and high-volume, we either need ingresses to support requesting multiple subjects, or to implement SNI on the activator (or to generate a huge number of subjects on the activator cert).

Given that we're currently publishing support for the minimal level, we need a path forward with compatibility for any existing adopters. The security WG is recommending that we do the following upgrade path, probably in separate releases:

minimal -> enabled -> high-volume

Once enabled has been rolled out, we should be able to retire the minimal feature in the following release, possibly coordinated with the release of the high-volume feature.

Does that make sense? Is there other data that we're missing?

@davidhadas
Copy link
Contributor Author

To clarify, the recommendation for 2023, coming out of the recent Security WG meeting is:

  1. We should ship Trust=Minimal while supporting high-volume as our default GA - making this the default for Serving is a huge step forward in the security offered to the Knative community. Trust=Minimal is only slightly less secure than Trust=Enabled and on the other hand, offers a significant advantage over the current Non-TLS support.
  2. For community users that are not using high-volume, we recommend to also support Trust=Enabled as a configuration option. When used, this option should disable high-volume support.
    Both the recommended default (Trust=Minimal with high-volume), and the more secure option (Trust=Enabled without high-volume) are within reach in the near future and can be GAed in 2023.

As a future plan, we recommend to continue making an effort to support Trust=Enabled with high-volume support. Such work may depend on support from Ingress GW API and/or other external dependencies which are unlikely to be resolved at any near future. Hence such support may or may not be available in 2024/5/6/... Even without such work, the move to end to end encryption with Trust=Minimal is a significant step forward in the security offered to the community.

@dprotaso
Copy link
Member

dprotaso commented Jul 4, 2023

Given that we're currently publishing support for the minimal level, we need a path forward with compatibility for any existing adopters.

This is where I disagree. Only Kourier supports the internal encryption feature. Contour has code but we have zero test coverage. This feature is alpha and we have affordances to forgo the current implementation and skip this complex migration. Additionally, there is no documentation on the website about this feature.

high-volume

I'm confused is this a configuration option the user sets explicitly?

This seems unnecessary for Istio and Kourier because they can support multiple SANs for a backend the activator should just move in and out of the data path when Trust=enabled

For GatewayAPI & Contour we should default the activator to always be in the data path when trust=Enabled, document this clearly on the website, and then enable the activator movement transparently for the user when those implementations support multiple SAN in the future.

I'm not convinced that we should support trust=Minimal in order to have 'high-volume' support - we have two ingress implementations end-users can use to achieve their security & volume requirements.

@davidhadas
Copy link
Contributor Author

davidhadas commented Jul 25, 2023

As indicated above the Security WG opted to make TLS our default sooner rather than later (and for very good reasons). From our analysis, the path for getting to TLS as the Serving default is using Minimal. Also, we see the need to support Minimal for many years to come and do not see an option to have any other TLS default for Serving.

@dprotaso (and please correct me if I am misinterpreted your thoughts) clearly prefers to drop Minimal and have TLS supported with Enabled. @dprotaso sees Minimal as a transient phase that is confusing and need to be dropped.

I think we all agree on the facts (I listed everything I could collect below), and that we have two options:

Option A

  • Use non-TLS as our default in Serving for many years to come
  • Skip Minimal
  • Gradually develop Enabled

Option B

  • Merge Minimal
  • Use Minimal as our default in Serving for many years to come
  • Gradually develop Enabled

I believe we all agree on the many items below (please speak out if not):

  1. Minimal is dramatically more secure than non-TLS option. It offers data encryption in-transit and prevents MITM for an attacker that can manipulate the cluster network but have no access to any of the cluster secrets.
  2. Enabled is more secure than Minimal as it prevents a certain attack vector where the attacker has access to one secret in one namespace and can manipulate the the cluster network to perform MITM attack. (With Enabled the attacker needs access to the specific secret on the namespace being attacked).
  3. Closing the gaps to reach a GA with Minimal is doable in a relatively short time, and is a manageable effort - (seems as if we could have been there already if it was not for the disagreement as for the path forward).
  4. Minimal supports high-volume for all ingresses - The term high-volume refers to the case where we do not mandate activator in path for all requests, and instead support direct communication between the ingress and queue.
  5. We do not identify anything preventing us to move Minimal to GA and make it the default for Serving.
  6. Support for Enabled with high-volume is apparently doable at present in Kourier and Istio but we did not POC it and it represent another gap we need to close to reach GA with Enabled
  7. Support for Enabled with high-volume is apparently not doable in Contour, Ingress GW any time soon. Instead what we will be required to do is to enforce all requests to go via the activator for these ingresses.
  8. It will probably not make sense to use Enabled as our default before we can solve how we can support high-volume for all ingresses - (not an option I expect we will have in the next 1,2,3... years). This means we will continue to use non-TLS as our default.
  9. We started with the implementation of a number of trust levels that include Minimal, Enabled, Mutual, ... and merged a number of PRs in a couple of packages. We also created Queue multi-level trust #14063 and Dataplane-trust Adding mTLS and TLS To Activator #13969 for Serving for that purpose.

Still, although we seem to all agree on the above list, we disagree on the question of what should we do.

I am opting for a decision (one way or another) such that I can clean my desk form this effort as I am gradually focusing more and more on Zero-Trust aspects of K8s and Knative.

@evankanderson, @dprotaso, @nainaz @ReToCode @skonto @nak3, @rhuss, @psschwei

@dprotaso
Copy link
Member

We do not identify anything preventing us to move Minimal to GA and make it the default for Serving.
...
I am opting for a decision (one way or another) such that I can clean my desk form this effort as I am gradually focusing more and more on Zero-Trust aspects of K8s and Knative.

I've been pretty clear that we should drop Minimal for the past few months now. I don't understand why we're ignoring my feedback and revisiting this discussion every two weeks.

Closing the gaps to reach a GA with Minimal is doable in a relatively short time, and is a manageable effort - (seems as if we could have been there already if it was not for the disagreement as for the path forward).

Like I stated above - Enabled should be fairly easy to support today since multiple upstream SAN are supported by Kourier & Istio. Users would then have two options to run Knative securely and where the activator doesn't need to be in the data plane (aka. high-volume).

For Contour/Gateway API we can wait until their API can support such a change. It would be great if folks could engage/discuss/contribute in those respective projects and communities.

@davidhadas
Copy link
Contributor Author

davidhadas commented Jul 25, 2023

I've been pretty clear that we should drop Minimal for the past few months now. I don't understand why we're ignoring my feedback and revisiting this discussion every two weeks.

@dprotaso -
No-one ignores your feedback.
In parallel, no-one seem to be able to make sense as for why Option A is your choice over Option B - I have talked to almost anyone related to this effort to try to figure this out.

Since this is going nowhere, lets close the 2 respective PRs, consider it a POC that others can use as reference future Serving TLS support.

@dprotaso
Copy link
Member

dprotaso commented Jul 26, 2023

no-one seem to be able to make sense as for why Option A is your choice over Option B

What's not clear about my position?

I've stated previously:

  1. Supporting 'Enabled' using multiple SANs should be straight forward to do in Kourier and Istio
  2. Given 1) The extra work/testing/maintenance to support 'Minimal' seems unnecessary
  3. Given 1) Why introduce and default to 'Minimal' but have our docs recommend using 'Enabled' or higher? It's going to be confusing for end-users/admins.
  4. Please work upstream in the respective projects Contour/Gateway API to provide a path for 'Enabled' to be supported (via multiple SAN support)

@davidhadas
Copy link
Contributor Author

davidhadas commented Jul 27, 2023

Enabled is not suitable to be used as Serving default before all ingresses can support Enabled with high-volume.
It is very possible it will never be suitable as the default (e.g. if Gateway API will not support it). This is why Security WG chose after much deliberation to keep the Minimal option.

Minimal in many ways was designed as a subset of Enabled, same as Enabled being a subset of Mutual and Mutual being subset of Identity etc. - the claim about the extra work/testing/maintenance to support Minimal is not accurate and at the same time one can make similar arguments to remove Enabled and make Mutual the default, or maybe remove Mutual also and make Identity (which is not yet designed) to be the default.

It boils down to 'A Bird in the Hand is Worth Two in the Bush'.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Serving Encryption Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Well-understood/specified features, ready for coding. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants