Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TEST_TROUBLESHOOTING.md initial version #518

Merged
merged 2 commits into from
Nov 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The data are structured in the [Flexible metadata format](https://fmf.readthedoc
Individual tests are supposed to be executed using the [Test management tool](https://tmt.readthedocs.io/en/stable/).

## Test execution and troubleshooting
Test execution and troubleshooting is described in detail in the [TESTING](TESTING.md) file.
Test execution and troubleshooting is described in detail in [TESTING](TESTING.md) and [TEST_TROUBLESHOOTING.md](TEST_TROUBLESHOOTING.md).

## Commit / merge policy

Expand Down
6 changes: 6 additions & 0 deletions TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,12 @@ On every test system:
For the next test iteration make sure to change `XTRA` to a new value on each test system.
This variable is necessary for the proper functioning of a sync mechanism.

## Running tests against keylime bits from a fork or a different branch

There are two specific `/setup/` tasks that are used by test plans from `/plans/` directory to install keylime bits. These are `/setup/install_upstream_keylime` and `/setup/install_upstream_rust_keylime` ([example](https://github.com/RedHat-SP-Security/keylime-tests/blob/main/plans/upstream-keylime-all-tests.fmf#L12)). By default, these are pointed to the respective upstream repos and the default branch, however you can point them to a different repo or branch using environment variables. The easiest way is to change values that are defined in the `/plans/main.fmf` file because this file is inherited by all plans. An alternative might be defining the respective environment variables directly on the `tmt` command line using the `--environment NAME=VALUE` parameter.`

Note, there is also a task `/setup/install_rust_keylime_from_copr` that is used by some plans to speed up test execution by installing agent from RPMs. Usage of this task is not compatible with the usage of a custom agent repo and therefore you would have to replace this tasks with `/setup/install_upstream_rust_keylime`.

## Running CI tests from the upstream keylime project

Clone the keylime source code from the upstream project (or your fork) and change the branch if necessary.
Expand Down
65 changes: 65 additions & 0 deletions TEST_TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Test failure troubleshooting guide

This should help you to troubleshoot keylime failures more efficiently by providing hints on how to read test logs more efficiently.
Still, log reading alone may not be sufficient to successfully troubleshoot the issue so it is highly recommended to read also [TESTING.md](TESTING.md)
and learn how to schedule keylime tests using the `tmt` tool.

## First steps

### Asking for help

Do not afraid to ask for help on #keylime Slack channel on cloud-native.slack.com.

### Searching for service logs

When a test fails, keylime service logs are typically printed during the CleanUp phase of the test when services are being stopped.

Another option to access service logs (but also other test related logs) is to click on the "log_dir" link at the end of the test and from there navigate to "data" directory.

### Enable logging of DEBUG messages

To speed up testing and improve test stability we have disabled logging on a DEBUG level. Due to that you may be missing some useful messages. To enable DEBUG logging you can modify the respective `tmt` test plan in your PR (keylime project plan is stored in the `packit_ci.fmf` file ) and include a new test /setup/enable_keylime_debug_messages right between tasks installing keylime bits and the first `/functional` test.

### Checking if the failure is new

You may check other recently opened and closed keylime PRs to find out whether the failure is present there. If so, it is likely that the issue has been already discussed in comments.

### Checking test log for the first failure

When investigating a test failure in a test log provided by Testing Farm the quickest way how to search for errors is to search for "[ FAIL ]" string using your browser and also for preceding lines containing string " ERROR ". These FAIL and ERRORs should either point out to the root cause or at least present how the root cause manifests itself in the test scenario. Once you know this error you may find additional related hints in the text below.

### Looking for a package update causing a regression

Keylime tests are typically run using the upstream keylime code on multiple Fedora releases. Despite having the same keylime version, these releases differ in other packages installed on a test system. If you spot an unknown test failure on a single Fedora release (typically Rawhide) it is very likely that the failure is caused by a package update. If you are able to reproduce the issue on your test system (e.g. virtual one) you can check which packages have been updated recently.

To list packages installed on the system sorted by the build date (the most recent ones last) run:
```
$ rpm -qa --qf '%{BUILDTIME} %{NVR}\n' | sort -n
```
Then check wheter "usual suspects" are near the end of the list. These are: tpm2-tools, tpm2-tss, edk2-ovmf, swtpm, coreutils, kernel,

Also, you can check for specific package builds directly in [Koji](https://koji.fedoraproject.org/koji/search).

## Agent registration failures

TBD

## Agent startup failures

TBD

## Measured boot related test scenarios

### keylime.tenant - ERROR - Failed key derivation for Agent

This is most likely caused by some updated update (tpm2-tools, tpm2-tss, edk2-ovmf). Check the verifier log for details, typically there is a agent status message with more details. For example:
```
keylime_verifier[76390]: 2023-11-15 12:39:16.726 - keylime.measured_boot - ERROR - Boot attestation failed for agent d432fbb3-d2f1-4a97-9ef7-75bd81c00000,
policy example, refstate={"has_secureboot": true, ....
... 'EventSize': 8, 'Event': {'String': 'MokList\x00'}} [Event String is not 'MokList', Event String is not 'MokListX', Event String is not 'MokListTrusted']
```
which points out the actual problem.

### libefivar.so.1: cannot open shared object file: No such file or directory

Here the efivar-libs package has been accidentally uninstalled. We believe that the issue has been addressed with [PR#515](https://github.com/RedHat-SP-Security/keylime-tests/pull/515) but if you were doing some system setup manually it is possible that the package is missing.