Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

link: add maxactive for kretprobe #755

Merged
merged 9 commits into from
Dec 15, 2022

Conversation

alahaiyo
Copy link
Contributor

@alahaiyo alahaiyo commented Aug 3, 2022

In some case we want to missed as little as possible and
it is helpful to have this configuration option.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR!

Can you explain your use case for maxactive and why implementing your own sampling in BPF isn't sufficient?

As it stands, the PR is too special cased for me:

  • doesn't work with pmu probes
  • only supported for retprobes (?)

Kprobes have a large API, and supporting all of it is not possible. So we need a very compelling reason to add something like your proposed change.

@alahaiyo alahaiyo closed this Aug 3, 2022
@alahaiyo alahaiyo deleted the kretprobe_add_maxactive branch August 3, 2022 11:49
@alahaiyo alahaiyo restored the kretprobe_add_maxactive branch August 3, 2022 11:50
@alahaiyo alahaiyo reopened this Aug 3, 2022
@alahaiyo
Copy link
Contributor Author

alahaiyo commented Aug 3, 2022

Thank you for your PR!

Can you explain your use case for maxactive and why implementing your own sampling in BPF isn't sufficient?

As it stands, the PR is too special cased for me:

  • doesn't work with pmu probes
  • only supported for retprobes (?)

Kprobes have a large API, and supporting all of it is not possible. So we need a very compelling reason to add something like your proposed change.

If the traced function can sleep, then when the number of processes exceeds the number of CPUs, the kretprobe will not be triggered. Because the kernel defaults maxactive to NR_CPU. When we trace some kernel network functions (such as inet_accept), when the network connection pressure is very high, there will be a lot of kretprobe not triggered.

When writing kprobe_event through tracefs, you can set maxactive to a maximum of 2048, which is enough. The implementation of pmu probe in the kernel will call create_local_trace_kprobe, in which maxactive is hardcoded to 0.

@lmb
Copy link
Collaborator

lmb commented Aug 4, 2022

Thanks for the background, that's interesting.

So what is the behaviour of a pmu retprobe in the same scenario? It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

What problem does maxactive guard against in the first place? Maybe we could just hardcode it to 2048 in the tracefs case?

Cc @mmat11

@mmat11
Copy link
Contributor

mmat11 commented Aug 4, 2022

It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

I believe this is the case

What problem does maxactive guard against in the first place?

I found an explanation here iovisor/bcc#1072 (comment)

Maybe we could just hardcode it to 2048 in the tracefs case?

From this commit torvalds/linux@696ced4 it seems the maximum is 4096. I guess there's some overhead (see https://elixir.bootlin.com/linux/v5.19/source/kernel/kprobes.c#L2202) in having a big number, else it could have been hardcoded in the kernel itself (?).

I wonder why this parameter isn't configurable via perf_event_open.
IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

@nashuiliang
Copy link

hi, @mmat11

I wonder why this parameter isn't configurable via perf_event_open.

Tow related commits(not merged):

Enable specifying maxactive for fd based kretprobe. This will be useful
for tracing tools like bcc and bpftrace. [1] discussed the need of this
in bpftrace. Use highest highest 12 bit (bit 52-63) to allow maximal
maxactive of 4095.

[1] https://github.com/iovisor/bpftrace/issues/835
From: Song Liu <songliubraving@fb.com>

Enable specifying maxactive for fd based kretprobe. This will be useful
for tracing tools like bcc and bpftrace (see for example discussion [1]).
Use highest 4 bit (bit 59-63) to allow specifying maxactive by log2.

The original patch [2] seems to be fallen through the cracks and wasn't
applied. I've merely rebased the work done by Song Liu, verififed it
still works, and modified to allow specifying maxactive by log2 per
suggestion from the discussion thread.

Note that changes in rethook implementation may render maxactive
obsolete.

[1]: https://github.com/iovisor/bpftrace/issues/835
[2]: https://lore.kernel.org/all/20191007223111.1142454-1-songliubraving@fb.com/

It seems to be discussing a more elegant way to implement.

IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

Agree, I also expect this feature.

@alahaiyo
Copy link
Contributor Author

alahaiyo commented Aug 8, 2022

Thank you for your reply!

It drops retprobe calls, without being able to work around the limitation? Asked another way, what does it mean that maxactive is 0 in the pmu case?

I believe this is the case

Agree

What problem does maxactive guard against in the first place?

I found an explanation here iovisor/bcc#1072 (comment)

Maybe we could just hardcode it to 2048 in the tracefs case?

From this commit torvalds/linux@696ced4 it seems the maximum is 4096.

Oh my mistake, the maximum is 4096.

I guess there's some overhead (see https://elixir.bootlin.com/linux/v5.19/source/kernel/kprobes.c#L2202) in having a big number, else it could have been hardcoded in the kernel itself (?).

I wonder why this parameter isn't configurable via perf_event_open.

As @nashuiliang said, the feature is under discussion now.

IMO we can support the feature but we should raise an error when it's not possible to do so (kernel < 4.12 or maxactive > 0 on non-return probes) and clearly document that setting this option will force the probe to be installed via tracefs

Yes, I agree. I will add this and commit again.

@lmb
Copy link
Collaborator

lmb commented Aug 8, 2022

@alahaiyo I'm not sure yet what the best approach is here, so there is a risk that your work might be wasted. I'll read through all the info here and try to reach an opinion today or tomorrow.

@alahaiyo
Copy link
Contributor Author

alahaiyo commented Aug 8, 2022

@alahaiyo I'm not sure yet what the best approach is here, so there is a risk that your work might be wasted. I'll read through all the info here and try to reach an opinion today or tomorrow.

I'm really looking forward to your opinion.

@lmb
Copy link
Collaborator

lmb commented Aug 9, 2022

From the kernel side we have two separate APIs to create a kretprobe:

  1. via tracefs: supports setting maxactive and missed events are reported via
    /sys/kernel/debug/tracing/kprobe_profile.

  2. via perf_event_open: supports neither maxactive nor missed events. Peter Zijlstra wants to avoid adding maxactive to p_e_o if possible. A patch to expose missed events has gotten stuck, but the maintainer seems open to it.

On the user space side, maxactive is exposed in bcc and gobpf. bpftrace hardcodes maxactive to 0 and would like to print the number of missed events instead.

Let's consider the implications of the maxactive API as proposed in the PR:

  • On pre 4.12 kernels the user will receive an error from the kernel, since the feature is not supported. So users have to avoid setting maxactive on old kernels. This requires checking kernel version or similar, or feature probing from the library to avoid setting maxactive.
  • On >= 4.12 kernels, we will force use of tracefs if maxactive!=0 even if pmu is available. tracefs behaves poorly when the tracer crashes.
  • If maxactive becomes available on pmu after all, we have to do feature detection to
    fall back to tracefs for kernels that have pmu but don't have the maxactive API.
  • If maxactive never becomes available on pmu we are forever locked into tracefs for kretprobes, even if regular pmu kprobes don't require maxactive anymore.
  • It's completely unclear what users should set maxactive to, since it needs to account for all callers of a function on the whole system. So being able to set maxactive still doesn't guarantee that all retprobes fire.

All in all, maxactive seems like a bad API. I think bpftrace has it right with exposing the number of missed events instead. If we manage to resurrect the fd-based missed events patch we could provide a somewhat unified API across tracefs and pmu.

@alahaiyo in your PR description you say that you want to "miss as little as possible". Would it be acceptable to you if you instead had a way to see how many misses you had, and scaling your result accordingly?

If not, what value would you set maxactive to? How do you figure out whether the kretprobe needs maxactive in the first place? How do you plan on dealing with kernels < 4.12?

If just misses is not enough, we could add a higher level option like FunctionMaySleep bool or similar instead of Maxactive int. Behind the scenes the lib would try to figure out the best strategy, which for now would be setting maxactive=4096 and forcing tracefs. This is better than maxactive since it avoids exposing the low level implementation detail and gives us flexibility in the future. The downside is that it requires a lot more feature probing and magic in the library itself.

@lmb
Copy link
Collaborator

lmb commented Aug 10, 2022

@alahaiyo do you notice the missed events on specific machines, or is your goal to reduce misses outright? If there is a problem with a specific machine, is CONFIG_PREEMPT enabled? What is the value of /sys/devices/system/cpu/possible?

@lmb
Copy link
Collaborator

lmb commented Aug 11, 2022

From iovisor/bcc#2224 (comment):

For kernels without maxactive supports in debugfs, the error resolution is not straightforward. The kernel will still install a probe, but under a different name.

The issue is that < 4.12 kernels will treat the r50:... written to tracefs as the name of the probe. This will need to be handled if we end up adding maxactive after all.

@alahaiyo
Copy link
Contributor Author

@alahaiyo do you notice the missed events on specific machines, or is your goal to reduce misses outright? If there is a problem with a specific machine, is CONFIG_PREEMPT enabled? What is the value of /sys/devices/system/cpu/possible?

It's not on specific machines. CONFIG_PREEMPT is not set and the value of /sys/devices/system/cpu/possible is 0-95.

@alahaiyo
Copy link
Contributor Author

From iovisor/bcc#2224 (comment):

For kernels without maxactive supports in debugfs, the error resolution is not straightforward. The kernel will still install a probe, but under a different name.

The issue is that < 4.12 kernels will treat the r50:... written to tracefs as the name of the probe. This will need to be handled if we end up adding maxactive after all.

Yes, that is a good reference.

@alahaiyo
Copy link
Contributor Author

On pre 4.12 kernels the user will receive an error from the kernel, since the feature is not supported. So users have to avoid setting maxactive on old kernels. This requires checking kernel version or similar, or feature probing from the library to avoid setting maxactive.

We can handle with that.

On >= 4.12 kernels, we will force use of tracefs if maxactive!=0 even if pmu is available. tracefs behaves poorly when the tracer crashes.

In the interface of link.Kprobe, users can ignore maxactive, so the default is 0 and pmu will be used. If the user does not want to missed events, then they can set maxactive by themselves, especially when the misses is unbearable.

If maxactive becomes available on pmu after all, we have to do feature detection to
fall back to tracefs for kernels that have pmu but don't have the maxactive API.
If maxactive never becomes available on pmu we are forever locked into tracefs for kretprobes, even if regular pmu kprobes don't require maxactive anymore.
It's completely unclear what users should set maxactive to, since it needs to account for all callers of a function on the whole system. So being able to set maxactive still doesn't guarantee that all retprobes fire.

Even setting maxactive to 4096 does not guarantee that events will not be missed, and if so we have no other way. But I think the user needs to be given the ability to adjust the possibility of event missed.

If not, what value would you set maxactive to? How do you figure out whether the kretprobe needs maxactive in the first place? How do you plan on dealing with kernels < 4.12?

In fact, I don't know how to set maxactive. In our case it was only found that there were kretprobe events missing and this was not acceptable. The solution to this problem is to increase maxactive to a large enough value.

we could add a higher level option like FunctionMaySleep bool or similar instead of Maxactive int

But I still think exposing maxactive is better than making the user think about whether the function might sleep. Maxactive is considered as an attribute of kretprobe, which can make it easier for users to find out whether the function of trace will have the problem of loss of kretprobe events.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay, wrapping my head around this issue has been tricky. I was hoping to find a better solution that exposing maxactive, but I'm drawing a blank.

I've left some comments for what I'd like you to change, mostly documentation and making sure we're refusing maxactive in as many cases as possible. Please also add tests that cover the maxactive API.

link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
@alahaiyo alahaiyo force-pushed the kretprobe_add_maxactive branch 2 times, most recently from a2a73bf to 118ae4a Compare November 2, 2022 10:01
@alahaiyo alahaiyo requested a review from lmb November 2, 2022 10:10
@alahaiyo
Copy link
Contributor Author

alahaiyo commented Nov 2, 2022

I'm glad that this PR can continue to push forward. I have made changes to the code based on the comments. @lmb

link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
link/kprobe.go Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
@lmb
Copy link
Collaborator

lmb commented Dec 9, 2022

@alahaiyo I think the issue I pointed out in #755 (comment) is not addressed yet?

@alahaiyo alahaiyo force-pushed the kretprobe_add_maxactive branch 2 times, most recently from 2642713 to a253f5f Compare December 12, 2022 12:39
This commit exposes kretprobe's maxactive parameter to the user.
You can modify the limit on the number of concurrent events with
maxactive. This feature is currently supported from kernel version
4.12. Using maxactive on kernels lower than version 4.12 will result
in the creation of events with names that do not match expectations.

Signed-off-by: daikunhai <daikunhai@didiglobal.com>
@alahaiyo
Copy link
Contributor Author

I found that it already has a query for the event ID after the event is created. So I just need to determine if I need to recreate it. My new commit is pushed. @lmb

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you integrated the check for unsupported maxactive! In the interests of getting this merged soon I'll push a couple of commits that contain changes I would've asked you to make. There is only one functional change: I'm dropping the fallback to disable maxactive.

Please take a look at my commits and let me know if you are OK with the modifications. Please also update the tests as mentioned.

link/kprobe.go Outdated
pe := fmt.Sprintf("-:%s/%s", group, sanitizeSymbol(symbol))
var pe string
if lowKernel {
pe = fmt.Sprintf("-:kprobes/r_%s_0", symbol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how the transform from r4096:ebpf_XYZ/sym into kprobes/r_sym_0 works? I'd expect that the 4096 is interpreted as a part of the event name and so would show up in the mangled event name.

Why is this not sanitizing the symbol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old kernel, if the second character is not a:(https://elixir.bootlin.com/linux/v4.10/source/kernel/trace/trace_kprobe.c#L638), the event will not be created temporarily. Then here(https://elixir.bootlin.com/linux/v4.10/source/kernel/trace/trace_kprobe.c#L713), the kernel will create an event named r_symbol_offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not sanitizing the symbol?

So I think we can't sanitizing the symbol when kernel would use full symbol name to create event.
There is an example:
root@hecs-352404:~# echo "r1024:ebpf_abcdefg/_ib_umem_release __ib_umem_release" >> /sys/kernel/debug/tracing/kprobe_events
root@hecs-352404:~# cat /sys/kernel/debug/tracing/kprobe_events
r:kprobes/r___ib_umem_release_0 __ib_umem_release

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment to reflect this.

link/kprobe_test.go Outdated Show resolved Hide resolved
link/kprobe.go Outdated Show resolved Hide resolved
@alahaiyo alahaiyo force-pushed the kretprobe_add_maxactive branch 3 times, most recently from 6b48b81 to 9d835bb Compare December 15, 2022 11:36
@alahaiyo
Copy link
Contributor Author

I like how you integrated the check for unsupported maxactive! In the interests of getting this merged soon I'll push a couple of commits that contain changes I would've asked you to make. There is only one functional change: I'm dropping the fallback to disable maxactive.

Please take a look at my commits and let me know if you are OK with the modifications. Please also update the tests as mentioned.

I total OK with that. And I push my commits which deal with tests and a little bugfix.

lmb and others added 7 commits December 15, 2022 13:16
Move the code that deals with various kernel quirks such as not
getting ErrExist and maxactive not being supported into the
function used to create a probe event.
We modified logic in createTraceFSProbeEvent, so that in any kernel
can not create a event when it already exists.
When the kernel creates strange event names because it does not
support maxactive, you need to use the full symbol name to remove it.
// args.retprobeMaxActive is used on non kprobe types. Returns ErrNotSupported if
// the kernel is too old to support kretprobe maxactive.
func createTraceFSProbeEvent(typ probeType, args probeArgs) (uint64, error) {
// Before attempting to create a trace event through tracefs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmat11 does my refactor here make sense? I think it's nicer if createTraceFSProbeEvent encapsulates all the ugly bits of the tracefs interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@lmb
Copy link
Collaborator

lmb commented Dec 15, 2022

deal with tests and a little bugfix.

A good catch. I've opted to return os.ErrExist from create... instead, that way we don't have to change the tests. I also included the symbol offset in the even name when removing, since that may be != 0.

Copy link
Collaborator

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alahaiyo thanks for pushing this through, sorry it took so long!

@lmb lmb merged commit 1e326d1 into cilium:master Dec 15, 2022
@alahaiyo alahaiyo deleted the kretprobe_add_maxactive branch December 16, 2022 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants