Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zstd:chunked podman-side tests #25007

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mtrmac
Copy link
Collaborator

@mtrmac mtrmac commented Jan 13, 2025

This adds Podman tests of chunked pulls.

Does this PR introduce a user-facing change?

None

@openshift-ci openshift-ci bot added release-note-none do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 13, 2025
Copy link
Contributor

openshift-ci bot commented Jan 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mtrmac

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 13, 2025
@mtrmac mtrmac marked this pull request as draft January 13, 2025 22:14
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 16d79da to 6d5c1d2 Compare January 14, 2025 21:19
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch 5 times, most recently from fb12098 to fb18284 Compare January 15, 2025 17:20
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 0864767 to 1942947 Compare January 16, 2025 20:11
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 56c3692 to 62891ef Compare January 21, 2025 21:59
@mtrmac mtrmac changed the title WIP: Zstd:chunked + podman-side tests Zstd:chunked + podman-side tests Jan 21, 2025
imgspecv1 "github.com/opencontainers/image-spec/specs-go/v1"
)

func pullChunkedTests() { // included in pull_test.go
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is rather tedious.

Plausibly we could pre-build all the test images and ship them in the CI images… but we still need to podman system reset for each test case, and the pulls must run from a real registry (not from oci: nor using podman load), so I’m not sure that would be any better.

It’s quite likely there is some infrastructure that could shorten these tests; I’d welcome pointers.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Jan 21, 2025

@containers/podman-maintainers PTAL, this is the last part of the tracked Zstd blocker work.

@mtrmac mtrmac marked this pull request as ready for review January 21, 2025 22:06
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2025
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard blocking as this will cause a lot of flakes due system reset.

Comment on lines 304 to 306
// Do each test with a clean slate: no layer metadata known, no blob info cache.
// Annoyingly, we have to re-start and re-populate the registry as well.
podmanTest.PodmanExitCleanly("system", "reset", "-f")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard blocker, we cannot call system reset in parallel context. system reset modifies global shared state, see
054154c

In particular looking thought CI logs here I see you have rerun flakes that are almost certainly caused because of this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now marked as a serial test.

Alternatively, for the purposes of this test, it should be enough to remove all images+layers and the BlobInfoCache. Do you think it would make sense to do that manually?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently BlobInfoCache’s location can’t be changed by a public option to Podman, so the test would have to delete the global one, and possibly affect other tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and even if I added a Podman option for the purpose of the test, the test also needs Skopeo, which does not have the option either :| .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various options to move forward:

  • Stay with the serial test (and pay the cost, about 15 seconds or so of non-concurrent execution)
  • Add the BlobInfoCache option to both Podman (possibly local-only, in a hidden option) and Skopeo (~necessarily public as a cross-project dependency), wait with tests until a Skopeo RPM ships
  • Teach c/image to do chunked pulls from dir:. Then we might not need a running registry at all (and, hypothetically, the test image files could be included in the CI image and not created at runtime), probably avoiding the need for running Skopeo at all — we would podman pull dir:…, and we would still need a Podman option.
  • Some shenanigans with running as non-root a custom home directory, to trigger the use of per-user BlobInfoCache. That would not test running as root, and, at a guess, introduces way too much unexpected complexity.
  • Don’t try to work from an empty state; go with whatever the BlobInfoCache knows from other operations or from previous cases of this test. (To an extent, that can be helped by using custom, very small, images, not the traditional ALPINE. But the BIC would still play a role.) I think this would be sufficient to confirm that the tests correctly reject invalid images. We could no longer test the fallback behavior of images with unknown information, or the like — various test cases would have to be dropped. Worse, the relative test timing could influence which exact code path is invoked, making the tests not deterministic even if they did reveal a bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stay with the serial test (and pay the cost, about 15 seconds or so of non-concurrent execution)

I am if zstd:chunked and partial pull are important then we should test them and accept whatever time they take. It certainly sounds like that is a very important feature so I rather have that tested and not blocked because 15s was to long for us.

Add the BlobInfoCache option to both Podman (possibly local-only, in a hidden option) and Skopeo (~necessarily public as a cross-project dependency), wait with tests until a Skopeo RPM ships

Why is this a global not configurable location to begin with, should this not be part of the storage location instead? (I mean I know it is for the default because they are the same but if other processes use a different storage why should they still default to a global path).

Teach c/image to do chunked pulls from dir:. Then we might not need a running registry at all (and, hypothetically, the test image files could be included in the CI image and not created at runtime), probably avoiding the need for running Skopeo at all — we would podman pull dir:…, and we would still need a Podman option.

Well users still pull from real registries so not testing a real registry seems like not an option?
I mean sure we could get rid of some test cases maybe but ultimately it seems better to use the registry as this is what most people would interact with.

Some shenanigans with running as non-root a custom home directory, to trigger the use of per-user BlobInfoCache. That would not test running as root, and, at a guess, introduces way too much unexpected complexity.

Yes let's not go there, we do that for machine tests but it is ugly and I would not recommend it

Don’t try to work from an empty state; go with whatever the BlobInfoCache knows from other operations or from previous cases of this test. (To an extent, that can be helped by using custom, very small, images, not the traditional ALPINE. But the BIC would still play a role.) I think this would be sufficient to confirm that the tests correctly reject invalid images. We could no longer test the fallback behavior of images with unknown information, or the like — various test cases would have to be dropped. Worse, the relative test timing could influence which exact code path is invoked, making the tests not deterministic even if they did reveal a bug.

Non deterministic test sounds horrible. I don't what the BlobInfoCache does but I would guess in reality the code must work with an empty cache AND a full/partial populated cache, possible concurrently modified?
Only testing the empty cache case would not reflect reality either I assume so do we test both?

Copy link
Collaborator Author

@mtrmac mtrmac Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Teach c/image to do chunked pulls from dir: …

Well users still pull from real registries so not testing a real registry seems like not an option?

(There is at least test/e2e/pull_test.go: @test "push and pull zstd chunked image" to cover that.)


Non deterministic test sounds horrible. I don't what the BlobInfoCache does but I would guess in reality the code must work with an empty cache AND a full/partial populated cache, possible concurrently modified?
The code is structured so that

  • If it knows the DiffID (uncompressed layer digest), and the config contains a DiffID, they must match exactly
  • (We always compute the DiffID of fully-pulled layers, as we decompress them)
  • During a chunked pull, we always compute the DiffID by default. (Users who don’t care about the security impact and value performance way too much can opt out.)
  • Together, the above resolve the signing ambiguity, unless the users opt out.
  • For purposes of layer reuse without pulling anything, various information known with certainty about layers is stored in BlobInfoCache and in the c/storage store. So, even if the users opt out, we might know the DiffID. And in that case we do enforce the consistency, because we don’t have any reason not to.

Only testing the empty cache case would not reflect reality either I assume so do we test both?
Each case:

  • Resets storage
  • fresh: Tests pull of the test image from this empty state (no information should be recorded anywhere)
  • Triggers a pull of chunkedNormal, which can record the full set of metadata for the relevant layer (if something was not recorded during fresh, which might have refused to complete the pull)
  • reuse: Tests pull with all metadata available.

This should cover the “nothing known” and “everything known” extremes. I wouldn’t claim that it covers every single code path the metadata can travel — that can be determined for certain only with code coverage (branch coverage?) instrumentation — but I think it’s enough for the image ID / signing ambiguity logic.

Comment on lines 150 to 152
// The actual test
for _, c := range []struct {
img *pullChunkedTestImage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how to do table driven tests with ginkgo.
In particular all this output end up under one It() block and it exits on the first error not running following test cases which may hide if this is a single issue or multiple ones.

Just look at a CI log for podman pull chunked images
https://api.cirrus-ci.com/v1/artifact/task/6056101424136192/html/int-podman-fedora-41-root-host-sqlite.log.html

The output is basically unreadable, the printl calls do not really help. Each of these must be there own test block for the log formatter to properly highlight all the cases.
Ginkgo does support table like tests, see DescribeTable() in the quadlet test as an example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty with using DescribeTable that is that I need to create the various pullChunkedTestImage on-disk images only once, to decrease test runtime.

AFAICS the two options are

  • Use BeforeEach within a Describe/It container; the setup will run repeatedly, every time, for each table entry.
  • Add the code to generate the images to the top-level SynchronizedBeforeSuite. Technically possible but hard to maintain/scale if more tests added their custom setup to that top level.

The output is basically unreadable

I agree with that, but I think that’s primarily because it includes the voluminous --log-level=debug output. Would this be more acceptable if that output were silenced and only printed on test failure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can setup a new Describe() and have the It() blocks in there. IF the BeforeEach() is outside of the describe it should only be run once AFAICT, compare vs JustBeforeEach()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you need --log-level debug for the test I would prefer not to use it.
Consider that our logs are already quite large, it takes a fair bit to download and render them in my browser. So there is some value to avoid unnecessary output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can setup a new Describe() and have the It() blocks in there. IF the BeforeEach() is outside of the describe it should only be run once AFAICT, compare vs JustBeforeEach()

That’s not how I read https://onsi.github.io/ginkgo/#mental-model-how-ginkgo-traverses-the-spec-hierarchy , but I’ll set up a test and confirm.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you need --log-level debug for the test I would prefer not to use it.

The tests currently do rely on the debug level, e.g. to confirm that we don’t do a partial pull and fall back if we can’t resolve the ambiguities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have filed #25096 as a subset of this PR, which only updates existing tests. That can be used to vendor the latest c/image, for now.

To actually resolve this comment, I propose, as the first step, #25097 . This will allow the chunked tests to add a “don’t print all logs to Ginkgo output” option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I took another look and yeah the BeforeEach() thing does not work like I thought in that context.

Also more importantly even when there would be a way to do it each parallel runner is its own process. That means the setup cannot be run once and as the order of It() would be random it would not be workable.

As for as logging each test case then it should use By() I would say instead of manually printing, I guess the end result is the same but this is more ginkgo style.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think regarding do not print the log level debug output that is an option we can choose put if such output is needed in the test then it SHOULD be printed IMO.
When a test fails we want to see such output, hard errors or flakes. (Yes we can make it so the output is dumped on failures but I don't feel so strongly about special casing this one test.

My main concern was that individual test cases are hard to follow/recognize, we just see this between a lot of lines

           ===
           ===Testing docker://localhost:5013/chunked-mismatch
           ===

If we cannot fix that with the By() ginkgo lines then we should consider teaching logformatter to color them. Or maybe add more passing around it. Adding an extra newline above/below would make it easier to spot.
I know this is very pedantic but given almost everything flakes at one point I would like to quickly spot the broken test case

Copy link
Collaborator Author

@mtrmac mtrmac Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated:

  • Now using By(). I don’t think it’s particularly more readable, but it is the standard and potentially visible to tools.
  • The --log-level=debug output is only shown if a test fails, and only for that failing command. This is a very visible change — most noise in the test now comes from WaitContainerReady (re)starting the registry.

See https://api.cirrus-ci.com/v1/artifact/task/4684706555363328/html/int-podman-fedora-41-root-host-sqlite.log.html for an example failure.

... because (podman system reset) will delete all of it,
interfering with the test storing other data in the directory.

Signed-off-by: Miloslav Trmač <[email protected]>
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch 2 times, most recently from 1355296 to c2c2b6f Compare January 24, 2025 00:12
@mtrmac mtrmac changed the title Zstd:chunked + podman-side tests Zstd:chunked podman-side tests Jan 24, 2025
@mtrmac mtrmac marked this pull request as draft January 24, 2025 00:15
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2025
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch from c2c2b6f to 37a5534 Compare January 24, 2025 00:53
Signed-off-by: Miloslav Trmač <[email protected]>
@mtrmac mtrmac force-pushed the zstd-chunked-with-tests branch from 37a5534 to 00db254 Compare January 24, 2025 01:03
if expectation.success != nil {
Expect(session).Should(Exit(0))
for _, s := range expectation.success {
Expect(regexp.MatchString(".*"+s+".*", log)).To(BeTrue(), s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note these expected(flase).To(BeTrue()) lead to poor error messages
you should use something like Expect(string).To(MatchRegexp("regex")) instead

sorry for not noticing this earlier

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; using the idiomatic way would trigger posting the whole debug log of that Podman command into that error structure. “Expected … $theWholeDebugLog … to match $regexp”. E.g. the stack trace would then be rather far from the failure header.

Comment on lines +404 to +406
for _, s := range expectation.failure {
Expect(regexp.MatchString(".*"+s+".*", log)).To(BeTrue(), s)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants