-
Notifications
You must be signed in to change notification settings - Fork 11.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf2bolt maps linux perf samples incorrectly for small binaries #109384
Comments
@llvm/issue-subscribers-bolt Author: Kristof Beyls (kbeyls)
When experimenting with `perf2bolt` on small programs, it looks like at least sometimes it does not map samples to the correct instructions.
The root cause seems to be a failing heuristic in Detailed description of experiment and observationsI've observed this on at least almabench from the llvm test-suite. I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags Perf samples were collected as follows:
Comparing with what
|
I think it's possible to match multiple segments, but that's my speculation based on looking at these maps long time ago. We can check that the base address (for the binary) matches regardless of what segment is being used for the calculation. Regarding the file name, we do use it when the buildid is not set. Maybe I misunderstood the comment. |
When experimenting with
perf2bolt
on small programs, it looks like at least sometimes it does not map samples to the correct instructions.The root cause seems to be a failing heuristic in
perf2bolt
; more precisely the heuristic used to map instruction addresses in perf samples (i.e. addresses as they appear in running processes) to addresses of instructions in the ELF binary.Detailed description of experiment and observations
I've observed this on at least almabench from the llvm test-suite.
I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags
-O3 -flto=thin -g -fuse-ld=lld -Wl,--emit-relocs
.Perf samples were collected as follows:
perf2bolt
was invoked as follows:Comparing with what
perf script
showsWhen inspecting the list of events seen manually, it does seem like most samples
are in
sin
,cos
,atan2
and similar functions in thelibm
library. It isexpected that
perf2bolt
wouldn't be able to map these samples back to thealmabench
binary, as they are in a dynamically loaded library. However, thereare at least some samples on instructions that are located in the
almabench
main code section, for example the 3rd line from the bottom above. It's a sample
seen on address
aaaaaaab2788
, andperf script
says this is0x934
bytesafter the start of function
planetpv
in the binary. This sample and othersamples like it should be mappable by
perf2bolt
, but somehow they're not.After digging in using print-style debugging, it seems to me that the root
cause is the code in
BinaryContext::getBaseAddressForMapping
.The gist of that code is as follows:
In other words the heuristic used to try and map the start address of a segment
in a running process
MMapAddress
, and the associatedFileOffset
for it,seems to be checking whether a segment in the ElfFile has the following
properties:
In the AArch64 running example,
SegInfo.Alignment
is 0x10000.FileOffset
is1000. It seems this heuristic ends up working for all 4 segments in the
almabench
binary, rather than just 1. Presumably because this is a smallrather than a very large binary, where multiple segments fit within the same "alignment granule" of
0x10000
?The code selects the first segment, which is not the text segment (the second
segment is the text segment).
This heuristic should be improved. Even better, could this be made deterministic
rather than based on a heuristic? How does linux perf compute this (presumably
in a non-heuristic way)?
It seems perf computes this based on the name (and/or buildid?) of the binary,
see roughly:
https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/map.c#L166
Would it be OK to use filename and/or buildid too in perf2bolt? Presumably this
should work unless the binary file moves around between profiling and bolt
optimization?
documents that perf2bolt is expected to work when provided "a copy of the
binary that was running". So matching based on filename of the binary isn't
going to work well in general.
indicates that perf can print out the buildid on a PERF_RECORD_MMAP2 event,
but only when available. When not available, it seems it prints out the inode
of the binary file.
In my setup, the PERF_RECORD_MMAP2 records don't contain build ids. Therefore it
seems that just relying on build ids might not solve this issue "out of the box"
for most users?
Another idea to improve the heuristic is that segments in ELF files contain
flags to indicate whether they are (amongst others) executable. In the running
example, I see
objdump -a ./almabench
contains:So, if we'd require the segment that matches to be an executable segment, in at
least this example, the correct segment would be picked.
I'll be creating a PR that implements this shortly.
A further improvement that could be made is for
perf2bolt
to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.The text was updated successfully, but these errors were encountered: