Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors: remove verifier truncation heuristic, rely on ENOSPC #756

Merged
merged 2 commits into from
Aug 23, 2022

Conversation

ti-mo
Copy link
Collaborator

@ti-mo ti-mo commented Aug 3, 2022

In order to capture full verifier errors of large programs, the caller can
choose to repeatedly retry loading the program with progressively larger log
buffers, since the important info is always at the end of the log.

In doing so, it's incredibly likely for the last character of the buffer to
be a newline, resulting in the truncated flag being set to false when it should
instead be set to true, regardless of the kernel's ENOSPC return code.

This commit removes the newline heuristic that currently aims to detect log
truncation. Instead, we now require ENOSPC to be sent by the kernel in order
for the Truncated flag to be set in the VerifierError.

The heuristic was initially introduced for parsing BTF verifier errors,
since there's a latent kernel bug that causes ENOSPC only being returned
on a successful BTF load. Unfortunately, it turns out that not all BTF
verifier errors end in a newline either (see calls to btf_verifier_log()),
making the newline heuristic ineffective for BTF logs as well.

ErrorWithLog now takes an ancillary error from a second load operation that
is used to determine truncation status. This second error will not be wrapped
into the VerifierError.

@lmb
Copy link
Collaborator

lmb commented Aug 4, 2022

There is no indication that kernels 5.18 and up don't return ENOSPC for BTF
loads.

Have you actually managed to get ENOSPC from loading BTF? In my experience an EINVAL or similar will override ENOSPC when loading invalid BTF. Maybe you're talking about ENOSPC from valid BTF with a small log?

Agreed re bringing the check for ENOSPC back, but I think we might still need the heuristic for BTF. We could fall back if the logErr is nil, or we could factor it out into a separate function and call that from the BTF loading code.

@ti-mo
Copy link
Collaborator Author

ti-mo commented Aug 16, 2022

Have you actually managed to get ENOSPC from loading BTF? In my experience an EINVAL or similar will override ENOSPC when loading invalid BTF. Maybe you're talking about ENOSPC from valid BTF with a small log?

I see, I misinterpreted your syscall will never return ENOSPC as of 5.18-rc4 comment to be 'as of 5.18-rc4...', implying something changed in 5.18. This should probably be ENOSPC until at least 6.0, as this is still the case on master. In btf.c, at the end of btf_parse(), ENOSPC is only returned if the log space check is hit, which requires a successful BTF parse.

I'll think of something else.

@ti-mo
Copy link
Collaborator Author

ti-mo commented Aug 16, 2022

We could fall back if the logErr is nil, or we could factor it out into a separate function and call that from the BTF loading code.

This feels clearer to me. BPF and BTF verifier logs have different styles and their APIs have different semantics.

@ti-mo ti-mo force-pushed the tb/verifier-error-truncate branch 2 times, most recently from 0076beb to 6147394 Compare August 19, 2022 13:29
@ti-mo ti-mo marked this pull request as ready for review August 19, 2022 13:51
ti-mo added a commit to ti-mo/cilium that referenced this pull request Aug 19, 2022
Just having seen a 4,6MB log from a bpf_host complexity issue,
I started work on automatically growing the log buffer instead of hardcoding
a size here, but hit a small bug in the lib. A possible solution is out, but
still needs review: cilium/ebpf#756.

Grow the buffer to 10MiB to at least get access to the full contents.

Fixes: cilium#20738

Signed-off-by: Timo Beckers <timo@isovalent.com>
Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to capture full verifier errors of large programs, the caller can
choose to repeatedly retry loading the program with progressively larger log
buffers, since the important info is always at the end of the log.

I think this is the first time I understand what you are trying to achieve, and why you need .Truncated to be accurate. I didn't consider that use case when introducing .Truncated, so I didn't mind that the heuristic isn't bomb proof.

You're trying to make .Truncated "reliable". Have you considered implementing resize-and-retry in the lib instead? Seems like a useful feature to have.

btf/btf_test.go Show resolved Hide resolved
internal/errors.go Show resolved Hide resolved
internal/errors.go Outdated Show resolved Hide resolved
internal/errors.go Outdated Show resolved Hide resolved
aditighag pushed a commit to cilium/cilium that referenced this pull request Aug 22, 2022
Just having seen a 4,6MB log from a bpf_host complexity issue,
I started work on automatically growing the log buffer instead of hardcoding
a size here, but hit a small bug in the lib. A possible solution is out, but
still needs review: cilium/ebpf#756.

Grow the buffer to 10MiB to at least get access to the full contents.

Fixes: #20738

Signed-off-by: Timo Beckers <timo@isovalent.com>
Add a test to make sure a default Spec can be instantiated if so desired.
This is convenient for exercising/experimenting with BTF verifier error
handling.

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo ti-mo force-pushed the tb/verifier-error-truncate branch from 6147394 to 47302a3 Compare August 23, 2022 13:31
prog.go Outdated Show resolved Hide resolved
In order to capture full verifier errors of large programs, the caller can
choose to repeatedly retry loading the program with progressively larger log
buffers, since the important info is always at the end of the log.

In doing so, it's incredibly likely for the last character of the buffer to
be a newline, resulting in the truncated flag being set to false when it should
instead be set to true, regardless of the kernel's ENOSPC return code.

This commit removes the newline heuristic that currently aims to detect log
truncation. Instead, we now require ENOSPC to be sent by the kernel in order
for the Truncated flag to be set in the VerifierError.

The heuristic was initially introduced for parsing BTF verifier errors,
since there's a latent kernel bug that causes ENOSPC only being returned
on a successful BTF load. Unfortunately, it turns out that not all BTF
verifier errors end in a newline either (see calls to btf_verifier_log()),
making the newline heuristic ineffective for BTF logs as well.

ErrorWithLog now takes an ancillary error from a second load operation that
is used to determine truncation status. This second error will not be wrapped
into the VerifierError.

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo ti-mo force-pushed the tb/verifier-error-truncate branch from 47302a3 to 3bad83a Compare August 23, 2022 18:29
@ti-mo ti-mo merged commit 22cc87b into cilium:master Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants