Zstd:chunked podman-side tests #25007

mtrmac · 2025-01-13T22:13:53Z

This adds Podman tests of chunked pulls.

Does this PR introduce a user-facing change?

None

openshift-ci · 2025-01-13T22:14:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

packit-as-a-service · 2025-01-14T21:23:39Z

Ephemeral COPR build failed. @containers/packit-build please check.

mtrmac · 2025-01-21T22:05:29Z

test/e2e/pull_chunked_test.go

+	imgspecv1 "github.com/opencontainers/image-spec/specs-go/v1"
+)
+
+func pullChunkedTests() { // included in pull_test.go


All of this is rather tedious.

Plausibly we could pre-build all the test images and ship them in the CI images… but we still need to podman system reset for each test case, and the pulls must run from a real registry (not from oci: nor using podman load), so I’m not sure that would be any better.

It’s quite likely there is some infrastructure that could shorten these tests; I’d welcome pointers.

mtrmac · 2025-01-21T22:06:38Z

@containers/podman-maintainers PTAL, this is the last part of the tracked Zstd blocker work.

Luap99

Hard blocking as this will cause a lot of flakes due system reset.

Luap99 · 2025-01-22T12:22:06Z

test/e2e/pull_chunked_test.go

+		// Do each test with a clean slate: no layer metadata known, no blob info cache.
+		// Annoyingly, we have to re-start and re-populate the registry as well.
+		podmanTest.PodmanExitCleanly("system", "reset", "-f")


This is hard blocker, we cannot call system reset in parallel context. system reset modifies global shared state, see
054154c

In particular looking thought CI logs here I see you have rerun flakes that are almost certainly caused because of this.

Now marked as a serial test.

Alternatively, for the purposes of this test, it should be enough to remove all images+layers and the BlobInfoCache. Do you think it would make sense to do that manually?

Currently BlobInfoCache’s location can’t be changed by a public option to Podman, so the test would have to delete the global one, and possibly affect other tests.

… and even if I added a Podman option for the purpose of the test, the test also needs Skopeo, which does not have the option either :| .

Various options to move forward:

Stay with the serial test (and pay the cost, about 15 seconds or so of non-concurrent execution)

Add the BlobInfoCache option to both Podman (possibly local-only, in a hidden option) and Skopeo (~necessarily public as a cross-project dependency), wait with tests until a Skopeo RPM ships

Teach c/image to do chunked pulls from dir:. Then we might not need a running registry at all (and, hypothetically, the test image files could be included in the CI image and not created at runtime), probably avoiding the need for running Skopeo at all — we would podman pull dir:…, and we would still need a Podman option.

Some shenanigans with running as non-root a custom home directory, to trigger the use of per-user BlobInfoCache. That would not test running as root, and, at a guess, introduces way too much unexpected complexity.

Don’t try to work from an empty state; go with whatever the BlobInfoCache knows from other operations or from previous cases of this test. (To an extent, that can be helped by using custom, very small, images, not the traditional ALPINE. But the BIC would still play a role.) I think this would be sufficient to confirm that the tests correctly reject invalid images. We could no longer test the fallback behavior of images with unknown information, or the like — various test cases would have to be dropped. Worse, the relative test timing could influence which exact code path is invoked, making the tests not deterministic even if they did reveal a bug.

Stay with the serial test (and pay the cost, about 15 seconds or so of non-concurrent execution)

I am if zstd:chunked and partial pull are important then we should test them and accept whatever time they take. It certainly sounds like that is a very important feature so I rather have that tested and not blocked because 15s was to long for us.

Add the BlobInfoCache option to both Podman (possibly local-only, in a hidden option) and Skopeo (~necessarily public as a cross-project dependency), wait with tests until a Skopeo RPM ships

Why is this a global not configurable location to begin with, should this not be part of the storage location instead? (I mean I know it is for the default because they are the same but if other processes use a different storage why should they still default to a global path).

Teach c/image to do chunked pulls from dir:. Then we might not need a running registry at all (and, hypothetically, the test image files could be included in the CI image and not created at runtime), probably avoiding the need for running Skopeo at all — we would podman pull dir:…, and we would still need a Podman option.

Well users still pull from real registries so not testing a real registry seems like not an option?
I mean sure we could get rid of some test cases maybe but ultimately it seems better to use the registry as this is what most people would interact with.

Some shenanigans with running as non-root a custom home directory, to trigger the use of per-user BlobInfoCache. That would not test running as root, and, at a guess, introduces way too much unexpected complexity.

Yes let's not go there, we do that for machine tests but it is ugly and I would not recommend it

Don’t try to work from an empty state; go with whatever the BlobInfoCache knows from other operations or from previous cases of this test. (To an extent, that can be helped by using custom, very small, images, not the traditional ALPINE. But the BIC would still play a role.) I think this would be sufficient to confirm that the tests correctly reject invalid images. We could no longer test the fallback behavior of images with unknown information, or the like — various test cases would have to be dropped. Worse, the relative test timing could influence which exact code path is invoked, making the tests not deterministic even if they did reveal a bug.

Non deterministic test sounds horrible. I don't what the BlobInfoCache does but I would guess in reality the code must work with an empty cache AND a full/partial populated cache, possible concurrently modified?
Only testing the empty cache case would not reflect reality either I assume so do we test both?

Teach c/image to do chunked pulls from dir: …

Well users still pull from real registries so not testing a real registry seems like not an option?

(There is at least test/e2e/pull_test.go: @test "push and pull zstd chunked image" to cover that.)

Non deterministic test sounds horrible. I don't what the BlobInfoCache does but I would guess in reality the code must work with an empty cache AND a full/partial populated cache, possible concurrently modified?
The code is structured so that

If it knows the DiffID (uncompressed layer digest), and the config contains a DiffID, they must match exactly

(We always compute the DiffID of fully-pulled layers, as we decompress them)

During a chunked pull, we always compute the DiffID by default. (Users who don’t care about the security impact and value performance way too much can opt out.)

Together, the above resolve the signing ambiguity, unless the users opt out.

For purposes of layer reuse without pulling anything, various information known with certainty about layers is stored in BlobInfoCache and in the c/storage store. So, even if the users opt out, we might know the DiffID. And in that case we do enforce the consistency, because we don’t have any reason not to.

Only testing the empty cache case would not reflect reality either I assume so do we test both?
Each case:

Resets storage

fresh: Tests pull of the test image from this empty state (no information should be recorded anywhere)

Triggers a pull of chunkedNormal, which can record the full set of metadata for the relevant layer (if something was not recorded during fresh, which might have refused to complete the pull)

reuse: Tests pull with all metadata available.

This should cover the “nothing known” and “everything known” extremes. I wouldn’t claim that it covers every single code path the metadata can travel — that can be determined for certain only with code coverage (branch coverage?) instrumentation — but I think it’s enough for the image ID / signing ambiguity logic.

Luap99 · 2025-01-22T12:29:32Z

test/e2e/pull_chunked_test.go

+	// The actual test
+	for _, c := range []struct {
+		img             *pullChunkedTestImage


This is not how to do table driven tests with ginkgo.
In particular all this output end up under one It() block and it exits on the first error not running following test cases which may hide if this is a single issue or multiple ones.

Just look at a CI log for podman pull chunked images
https://api.cirrus-ci.com/v1/artifact/task/6056101424136192/html/int-podman-fedora-41-root-host-sqlite.log.html

The output is basically unreadable, the printl calls do not really help. Each of these must be there own test block for the log formatter to properly highlight all the cases.
Ginkgo does support table like tests, see DescribeTable() in the quadlet test as an example.

The difficulty with using DescribeTable that is that I need to create the various pullChunkedTestImage on-disk images only once, to decrease test runtime.

AFAICS the two options are

Use BeforeEach within a Describe/It container; the setup will run repeatedly, every time, for each table entry.

Add the code to generate the images to the top-level SynchronizedBeforeSuite. Technically possible but hard to maintain/scale if more tests added their custom setup to that top level.

The output is basically unreadable

I agree with that, but I think that’s primarily because it includes the voluminous --log-level=debug output. Would this be more acceptable if that output were silenced and only printed on test failure?

You can setup a new Describe() and have the It() blocks in there. IF the BeforeEach() is outside of the describe it should only be run once AFAICT, compare vs JustBeforeEach()

Unless you need --log-level debug for the test I would prefer not to use it.
Consider that our logs are already quite large, it takes a fair bit to download and render them in my browser. So there is some value to avoid unnecessary output

You can setup a new Describe() and have the It() blocks in there. IF the BeforeEach() is outside of the describe it should only be run once AFAICT, compare vs JustBeforeEach()

That’s not how I read https://onsi.github.io/ginkgo/#mental-model-how-ginkgo-traverses-the-spec-hierarchy , but I’ll set up a test and confirm.

Unless you need --log-level debug for the test I would prefer not to use it.

The tests currently do rely on the debug level, e.g. to confirm that we don’t do a partial pull and fall back if we can’t resolve the ambiguities.

I have filed #25096 as a subset of this PR, which only updates existing tests. That can be used to vendor the latest c/image, for now.

To actually resolve this comment, I propose, as the first step, #25097 . This will allow the chunked tests to add a “don’t print all logs to Ginkgo output” option.

Ok I took another look and yeah the BeforeEach() thing does not work like I thought in that context.

Also more importantly even when there would be a way to do it each parallel runner is its own process. That means the setup cannot be run once and as the order of It() would be random it would not be workable.

As for as logging each test case then it should use By() I would say instead of manually printing, I guess the end result is the same but this is more ginkgo style.

I think regarding do not print the log level debug output that is an option we can choose put if such output is needed in the test then it SHOULD be printed IMO.
When a test fails we want to see such output, hard errors or flakes. (Yes we can make it so the output is dumped on failures but I don't feel so strongly about special casing this one test.

My main concern was that individual test cases are hard to follow/recognize, we just see this between a lot of lines

=== ===Testing docker://localhost:5013/chunked-mismatch ===

If we cannot fix that with the By() ginkgo lines then we should consider teaching logformatter to color them. Or maybe add more passing around it. Adding an extra newline above/below would make it easier to spot.
I know this is very pedantic but given almost everything flakes at one point I would like to quickly spot the broken test case

Updated:

Now using By(). I don’t think it’s particularly more readable, but it is the standard and potentially visible to tools.

The --log-level=debug output is only shown if a test fails, and only for that failing command. This is a very visible change — most noise in the test now comes from WaitContainerReady (re)starting the registry.

See https://api.cirrus-ci.com/v1/artifact/task/4684706555363328/html/int-podman-fedora-41-root-host-sqlite.log.html for an example failure.

... because (podman system reset) will delete all of it, interfering with the test storing other data in the directory. Signed-off-by: Miloslav Trmač <[email protected]>

Signed-off-by: Miloslav Trmač <[email protected]>

Luap99 · 2025-01-24T11:00:47Z

test/e2e/pull_chunked_test.go

+	if expectation.success != nil {
+		Expect(session).Should(Exit(0))
+		for _, s := range expectation.success {
+			Expect(regexp.MatchString(".*"+s+".*", log)).To(BeTrue(), s)


note these expected(flase).To(BeTrue()) lead to poor error messages
you should use something like Expect(string).To(MatchRegexp("regex")) instead

sorry for not noticing this earlier

Yes; using the idiomatic way would trigger posting the whole debug log of that Podman command into that error structure. “Expected … $theWholeDebugLog … to match $regexp”. E.g. the stack trace would then be rather far from the failure header.

Luap99 · 2025-01-24T11:00:56Z

test/e2e/pull_chunked_test.go

+		for _, s := range expectation.failure {
+			Expect(regexp.MatchString(".*"+s+".*", log)).To(BeTrue(), s)
+		}


openshift-ci bot added release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 13, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2025

mtrmac marked this pull request as draft January 13, 2025 22:14

mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 16d79da to 6d5c1d2 Compare January 14, 2025 21:19

mtrmac force-pushed the zstd-chunked-with-tests branch 5 times, most recently from fb12098 to fb18284 Compare January 15, 2025 17:20

mtrmac mentioned this pull request Jan 15, 2025

Expect UncompressedDigest to be set for partial pulls, enforce DiffID match containers/image#2613

Merged

mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 0864767 to 1942947 Compare January 16, 2025 20:11

This was referenced Jan 16, 2025

DO NOT MERGE: Smoke-test of a costly “always compute old-style IDs” approach #24419

Draft

Zstd(:chunked) work tracking checklist containers/image#2189

Open

Container restore fails due to rootfsImageID mismatch #24307

Open

mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 56c3692 to 62891ef Compare January 21, 2025 21:59

openshift-ci bot added release-note and removed release-note-none labels Jan 21, 2025

mtrmac changed the title ~~WIP: Zstd:chunked + podman-side tests~~ Zstd:chunked + podman-side tests Jan 21, 2025

mtrmac commented Jan 21, 2025

View reviewed changes

mtrmac marked this pull request as ready for review January 21, 2025 22:06

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025

Luap99 requested changes Jan 22, 2025

View reviewed changes

mtrmac mentioned this pull request Jan 22, 2025

Update c/image and update tests containers/buildah#5932

Merged

mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 4f09725 to 624bf91 Compare January 22, 2025 23:10

This was referenced Jan 22, 2025

Update c/image + existing tests to resolve the signing ambiguity #25096

Merged

Refactor Podman E2E helpers to allow passing/adding more options to the low-level executor #25097

Merged

openshift-ci bot added release-note-none and removed release-note labels Jan 23, 2025

Don't use all of tempdir as podman's --tmpdir

1c0454b

... because (podman system reset) will delete all of it, interfering with the test storing other data in the directory. Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 1355296 to c2c2b6f Compare January 24, 2025 00:12

mtrmac changed the title ~~Zstd:chunked + podman-side tests~~ Zstd:chunked podman-side tests Jan 24, 2025

mtrmac marked this pull request as draft January 24, 2025 00:15

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025

mtrmac force-pushed the zstd-chunked-with-tests branch from c2c2b6f to 37a5534 Compare January 24, 2025 00:53

Add tests for chunked pulls

00db254

Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac force-pushed the zstd-chunked-with-tests branch from 37a5534 to 00db254 Compare January 24, 2025 01:03

Luap99 reviewed Jan 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zstd:chunked podman-side tests #25007

Zstd:chunked podman-side tests #25007

mtrmac commented Jan 13, 2025 •

edited

Loading

openshift-ci bot commented Jan 13, 2025

packit-as-a-service bot commented Jan 14, 2025

mtrmac Jan 21, 2025

mtrmac commented Jan 21, 2025

Luap99 left a comment

Luap99 Jan 22, 2025

mtrmac Jan 22, 2025

mtrmac Jan 24, 2025

mtrmac Jan 24, 2025

mtrmac Jan 24, 2025

Luap99 Jan 24, 2025

mtrmac Jan 24, 2025 •

edited

Loading

Luap99 Jan 22, 2025

mtrmac Jan 22, 2025

Luap99 Jan 22, 2025

Luap99 Jan 22, 2025

mtrmac Jan 22, 2025

mtrmac Jan 22, 2025

mtrmac Jan 22, 2025

Luap99 Jan 23, 2025

Luap99 Jan 23, 2025

mtrmac Jan 24, 2025 •

edited

Loading

Luap99 Jan 24, 2025

mtrmac Jan 24, 2025

Luap99 Jan 24, 2025

Zstd:chunked podman-side tests #25007

Are you sure you want to change the base?

Zstd:chunked podman-side tests #25007

Conversation

mtrmac commented Jan 13, 2025 • edited Loading

Does this PR introduce a user-facing change?

openshift-ci bot commented Jan 13, 2025

packit-as-a-service bot commented Jan 14, 2025

Choose a reason for hiding this comment

mtrmac commented Jan 21, 2025

Luap99 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtrmac commented Jan 13, 2025 •

edited

Loading

mtrmac Jan 24, 2025 •

edited

Loading

mtrmac Jan 24, 2025 •

edited

Loading