Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf2bolt maps linux perf samples incorrectly for small binaries #109384

Open
kbeyls opened this issue Sep 20, 2024 · 2 comments · May be fixed by #109397
Open

perf2bolt maps linux perf samples incorrectly for small binaries #109384

kbeyls opened this issue Sep 20, 2024 · 2 comments · May be fixed by #109397
Assignees
Labels

Comments

@kbeyls
Copy link
Collaborator

kbeyls commented Sep 20, 2024

When experimenting with perf2bolt on small programs, it looks like at least sometimes it does not map samples to the correct instructions.

The root cause seems to be a failing heuristic in perf2bolt; more precisely the heuristic used to map instruction addresses in perf samples (i.e. addresses as they appear in running processes) to addresses of instructions in the ELF binary.

Detailed description of experiment and observations

I've observed this on at least almabench from the llvm test-suite.

I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags -O3 -flto=thin -g -fuse-ld=lld -Wl,--emit-relocs.

Perf samples were collected as follows:

$ cd SingleSource/Benchmarks/CoyoteBench
$ perf record -e cycles:u  ./almabench

perf2bolt was invoked as follows:

$ ~/dev/llvm/build_rel/bin/perf2bolt -p perf.data -o almabench.perf.perf2bolt.fdata -nl ./almabench
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for perf.data
PERF2BOLT: spawning perf job to read events without LBR
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: 86e5c5468ae3fcd65b23fd7b3cb0182e676829bd
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parsing basic events (without LBR)...
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing basic events (without LBR)...
PERF2BOLT: read 13429 samples
PERF2BOLT: out of range samples recorded in unknown regions: 13429 (100.0%)

 !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may b
e ineffective for improving performance.

PERF2BOLT: wrote 0 objects and 0 memory objects to almabench.perf.perf2bolt.fdata
BOLT-INFO: 0 out of 11 functions in the binary (0.0%) have non-empty execution profile

Comparing with what perf script shows

$ perf script --show-mmap-events | wc -l
13434
$ perf script --show-mmap-events
       almabench 1356276 446583.793818: PERF_RECORD_MMAP2 1356276/1356276: [0xaaaaaaab1000(0x2000) @ 0x1000 103:01 4733784 3684065051]: r-xp /home/kribey01/d
ev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench
       almabench 1356276 446583.793832: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7fc2000(0x3e000) @ 0 103:01 21002 578735524]: r-xp /usr/lib/aarch64-linux-
gnu/ld-linux-aarch64.so.1
       almabench 1356276 446583.793844: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7ffb000(0x1000) @ 0 00:00 0 0]: r-xp [vdso]
       almabench 1356276 446583.793873:          1 cycles:u:  ffffdb66ec538f08 [unknown] ([unknown])
       almabench 1356276 446583.793895:          1 cycles:u:      fffff7fd9c40 _start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793912:          1 cycles:u:      fffff7fda610 _dl_start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793929:          3 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793947:          9 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793966:         25 cycles:u:      fffff7fda648 _dl_start+0x38 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793985:         66 cycles:u:      fffff7fdaa5c _dl_start+0x44c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794002:        167 cycles:u:      fffff7fda6d0 _dl_start+0xc0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794020:        414 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.794040:       1145 cycles:u:      fffff7fda8f8 _dl_start+0x2e8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794056:       2965 cycles:u:      fffff7fd5a40 __GI___tunables_init+0xe0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794078:       7286 cycles:u:      fffff7fdb978 dl_main+0x9f8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794121: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7f10000(0x97000) @ 0 103:01 21020 4096449059]: r-xp /usr/lib/aarch64-linux-gnu/libm.so.6
       almabench 1356276 446583.794150:      19937 cycles:u:      fffff7fcd6cc _dl_name_match_p+0x2c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794167: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7d60000(0x1a9000) @ 0 103:01 21006 1342113607]: r-xp /usr/lib/aarch64-linux-gnu/libc.so.6
       almabench 1356276 446583.794228:      45966 cycles:u:      fffff7fd0e80 _dl_relocate_object+0x660 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794305:      60218 cycles:u:      fffff7f426a8 __atan2_finite@GLIBC_2.17+0x3d8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794348:      76663 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794399:      98506 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794471:     157477 cycles:u:      fffff7f487d8 __cos+0x8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794571:     234290 cycles:u:      fffff7f4249c __atan2_finite@GLIBC_2.17+0x1cc (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794696:     306140 cycles:u:      fffff7f1f5b0 __fmod+0x30 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794842:     364129 cycles:u:      fffff7f48ba0 __cos+0x3d0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795004:     409635 cycles:u:      fffff7f48808 __cos+0x38 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795179:     446157 cycles:u:      fffff7f1f324 __atan2+0x24 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795365:     476437 cycles:u:      fffff7f48280 __sin+0x1e0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795588:     501967 cycles:u:      fffff7f48a54 __cos+0x284 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795777:     483010 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795968:     490104 cycles:u:      aaaaaaab2788 planetpv+0x934 (/home/kribey01/dev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench)
       almabench 1356276 446583.796166:     510048 cycles:u:      fffff7f1f33c __atan2+0x3c (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.796371:     529819 cycles:u:      fffff7f4275c __atan2_finite@GLIBC_2.17+0x48c (/usr/lib/aarch64-linux-gnu/libm.so.6)
...

When inspecting the list of events seen manually, it does seem like most samples
are in sin, cos, atan2 and similar functions in the libm library. It is
expected that perf2bolt wouldn't be able to map these samples back to the
almabench binary, as they are in a dynamically loaded library. However, there
are at least some samples on instructions that are located in the almabench
main code section, for example the 3rd line from the bottom above. It's a sample
seen on address aaaaaaab2788, and perf script says this is 0x934 bytes
after the start of function planetpv in the binary. This sample and other
samples like it should be mappable by perf2bolt, but somehow they're not.

After digging in using print-style debugging, it seems to me that the root
cause is the code in BinaryContext::getBaseAddressForMapping.
The gist of that code is as follows:

BinaryContext::getBaseAddressForMapping(uint64_t MMapAddress,
                                        uint64_t FileOffset) const {
  // Find a segment with a matching file offset.
  for (auto &KV : SegmentMapInfo) {
    const SegmentInfo &SegInfo = KV.second;
    if (alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
        alignDown(FileOffset, SegInfo.Alignment)) {
      return MMapAddress - (SegInfo.Address - SegInfo.FileOffset + FileOffset);
    }
  }
  return std::nullopt;
}

In other words the heuristic used to try and map the start address of a segment
in a running process MMapAddress, and the associated FileOffset for it,
seems to be checking whether a segment in the ElfFile has the following
properties:

alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
alignDown(FileOffset, SegInfo.Alignment)

In the AArch64 running example, SegInfo.Alignment is 0x10000. FileOffset is
1000. It seems this heuristic ends up working for all 4 segments in the
almabench binary, rather than just 1. Presumably because this is a small
rather than a very large binary, where multiple segments fit within the same "alignment granule" of 0x10000?

gBAFM(MMapAddres aaaaaaab1000, FileOffset 1000):
.. MapStartAddress 0
.. alignDown(SegInfo.FileOffset 0, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 11d40
.. alignDown(SegInfo.FileOffset 1d40, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 22f20
.. alignDown(SegInfo.FileOffset 2f20, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 33110
.. alignDown(SegInfo.FileOffset 3110, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. Found 4 matching segments:
   0. SegmentInfo { Address: 0x0, Size: 0x1d1c, FileOffset: 0x0, FileSize: 0x1d1c, Alignment: 0x10000}
   1. SegmentInfo { Address: 0x11d40, Size: 0x11e0, FileOffset: 0x1d40, FileSize: 0x11e0, Alignment: 0x10000}
   2. SegmentInfo { Address: 0x22f20, Size: 0x10e0, FileOffset: 0x2f20, FileSize: 0x1f0, Alignment: 0x10000}
   3. SegmentInfo { Address: 0x33110, Size: 0x89, FileOffset: 0x3110, FileSize: 0x88, Alignment: 0x10000}
.. gBAFM(MMapAddress aaaaaaab1000, FileOffset 1000):
.... result = MMapAddress(aaaaaaab1000) - (SegInfo.Address(0) - SegInfo.FileOffset(0) + FileOffset(1000)) -> aaaaaaab0000

The code selects the first segment, which is not the text segment (the second
segment is the text segment).

This heuristic should be improved. Even better, could this be made deterministic
rather than based on a heuristic? How does linux perf compute this (presumably
in a non-heuristic way)?

It seems perf computes this based on the name (and/or buildid?) of the binary,
see roughly:
https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/map.c#L166

Would it be OK to use filename and/or buildid too in perf2bolt? Presumably this
should work unless the binary file moves around between profiling and bolt
optimization?

In my setup, the PERF_RECORD_MMAP2 records don't contain build ids. Therefore it
seems that just relying on build ids might not solve this issue "out of the box"
for most users?

Another idea to improve the heuristic is that segments in ELF files contain
flags to indicate whether they are (amongst others) executable. In the running
example, I see objdump -a ./almabench contains:

...
LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**16
     filesz 0x0000000000001d18 memsz 0x0000000000001d18 flags r--
LOAD off    0x0000000000001d40 vaddr 0x0000000000011d40 paddr 0x0000000000011d40 align 2**16
     filesz 0x00000000000011e0 memsz 0x00000000000011e0 flags r-x
LOAD off    0x0000000000002f20 vaddr 0x0000000000022f20 paddr 0x0000000000022f20 align 2**16
     filesz 0x00000000000001f0 memsz 0x00000000000010e0 flags rw-
LOAD off    0x0000000000003110 vaddr 0x0000000000033110 paddr 0x0000000000033110 align 2**16
     filesz 0x0000000000000088 memsz 0x0000000000000089 flags rw-
...

So, if we'd require the segment that matches to be an executable segment, in at
least this example, the correct segment would be picked.

I'll be creating a PR that implements this shortly.

A further improvement that could be made is for perf2bolt to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.

@kbeyls kbeyls added the BOLT label Sep 20, 2024
@kbeyls kbeyls self-assigned this Sep 20, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Sep 20, 2024

@llvm/issue-subscribers-bolt

Author: Kristof Beyls (kbeyls)

When experimenting with `perf2bolt` on small programs, it looks like at least sometimes it does not map samples to the correct instructions.

The root cause seems to be a failing heuristic in perf2bolt; more precisely the heuristic used to map instruction addresses in perf samples (i.e. addresses as they appear in running processes) to addresses of instructions in the ELF binary.

Detailed description of experiment and observations

I've observed this on at least almabench from the llvm test-suite.

I observed this on an AWS Graviton4 machine running Ubuntu 22.04.1; building the test-suite with the following optimization flags -O3 -flto=thin -g -fuse-ld=lld -Wl,--emit-relocs.

Perf samples were collected as follows:

$ cd SingleSource/Benchmarks/CoyoteBench
$ perf record -e cycles:u  ./almabench

perf2bolt was invoked as follows:

$ ~/dev/llvm/build_rel/bin/perf2bolt -p perf.data -o almabench.perf.perf2bolt.fdata -nl ./almabench
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for perf.data
PERF2BOLT: spawning perf job to read events without LBR
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: aarch64
BOLT-INFO: BOLT version: 86e5c5468ae3fcd65b23fd7b3cb0182e676829bd
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-WARNING: build-id will not be checked because we could not read one from input binary
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 1 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parsing basic events (without LBR)...
PERF2BOLT: waiting for perf mem events collection to finish...
PERF2BOLT: processing basic events (without LBR)...
PERF2BOLT: read 13429 samples
PERF2BOLT: out of range samples recorded in unknown regions: 13429 (100.0%)

 !! WARNING !! This high mismatch ratio indicates the input binary is probably not the same binary used during profiling collection. The generated data may b
e ineffective for improving performance.

PERF2BOLT: wrote 0 objects and 0 memory objects to almabench.perf.perf2bolt.fdata
BOLT-INFO: 0 out of 11 functions in the binary (0.0%) have non-empty execution profile

Comparing with what perf script shows

$ perf script --show-mmap-events | wc -l
13434
$ perf script --show-mmap-events
       almabench 1356276 446583.793818: PERF_RECORD_MMAP2 1356276/1356276: [0xaaaaaaab1000(0x2000) @ 0x1000 103:01 4733784 3684065051]: r-xp /home/kribey01/d
ev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench
       almabench 1356276 446583.793832: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7fc2000(0x3e000) @ 0 103:01 21002 578735524]: r-xp /usr/lib/aarch64-linux-
gnu/ld-linux-aarch64.so.1
       almabench 1356276 446583.793844: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7ffb000(0x1000) @ 0 00:00 0 0]: r-xp [vdso]
       almabench 1356276 446583.793873:          1 cycles:u:  ffffdb66ec538f08 [unknown] ([unknown])
       almabench 1356276 446583.793895:          1 cycles:u:      fffff7fd9c40 _start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793912:          1 cycles:u:      fffff7fda610 _dl_start+0x0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793929:          3 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793947:          9 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.793966:         25 cycles:u:      fffff7fda648 _dl_start+0x38 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.793985:         66 cycles:u:      fffff7fdaa5c _dl_start+0x44c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794002:        167 cycles:u:      fffff7fda6d0 _dl_start+0xc0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794020:        414 cycles:u:  ffffdb66ec5389d8 [unknown] ([unknown])
       almabench 1356276 446583.794040:       1145 cycles:u:      fffff7fda8f8 _dl_start+0x2e8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794056:       2965 cycles:u:      fffff7fd5a40 __GI___tunables_init+0xe0 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794078:       7286 cycles:u:      fffff7fdb978 dl_main+0x9f8 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794121: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7f10000(0x97000) @ 0 103:01 21020 4096449059]: r-xp /usr/lib/aarch64-linux-gnu/libm.so.6
       almabench 1356276 446583.794150:      19937 cycles:u:      fffff7fcd6cc _dl_name_match_p+0x2c (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794167: PERF_RECORD_MMAP2 1356276/1356276: [0xfffff7d60000(0x1a9000) @ 0 103:01 21006 1342113607]: r-xp /usr/lib/aarch64-linux-gnu/libc.so.6
       almabench 1356276 446583.794228:      45966 cycles:u:      fffff7fd0e80 _dl_relocate_object+0x660 (/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1)
       almabench 1356276 446583.794305:      60218 cycles:u:      fffff7f426a8 __atan2_finite@<!-- -->GLIBC_2.17+0x3d8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794348:      76663 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794399:      98506 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794471:     157477 cycles:u:      fffff7f487d8 __cos+0x8 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794571:     234290 cycles:u:      fffff7f4249c __atan2_finite@<!-- -->GLIBC_2.17+0x1cc (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794696:     306140 cycles:u:      fffff7f1f5b0 __fmod+0x30 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.794842:     364129 cycles:u:      fffff7f48ba0 __cos+0x3d0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795004:     409635 cycles:u:      fffff7f48808 __cos+0x38 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795179:     446157 cycles:u:      fffff7f1f324 __atan2+0x24 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795365:     476437 cycles:u:      fffff7f48280 __sin+0x1e0 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795588:     501967 cycles:u:      fffff7f48a54 __cos+0x284 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795777:     483010 cycles:u:      fffff7f1f340 __atan2+0x40 (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.795968:     490104 cycles:u:      aaaaaaab2788 planetpv+0x934 (/home/kribey01/dev/llvm/build_test_suite-LTOthin-O3-bolt/SingleSource/Benchmarks/CoyoteBench/almabench)
       almabench 1356276 446583.796166:     510048 cycles:u:      fffff7f1f33c __atan2+0x3c (/usr/lib/aarch64-linux-gnu/libm.so.6)
       almabench 1356276 446583.796371:     529819 cycles:u:      fffff7f4275c __atan2_finite@<!-- -->GLIBC_2.17+0x48c (/usr/lib/aarch64-linux-gnu/libm.so.6)
...

When inspecting the list of events seen manually, it does seem like most samples
are in sin, cos, atan2 and similar functions in the libm library. It is
expected that perf2bolt wouldn't be able to map these samples back to the
almabench binary, as they are in a dynamically loaded library. However, there
are at least some samples on instructions that are located in the almabench
main code section, for example the 3rd line from the bottom above. It's a sample
seen on address aaaaaaab2788, and perf script says this is 0x934 bytes
after the start of function planetpv in the binary. This sample and other
samples like it should be mappable by perf2bolt, but somehow they're not.

After digging in using print-style debugging, it seems to me that the root
cause is the code in BinaryContext::getBaseAddressForMapping.
The gist of that code is as follows:

BinaryContext::getBaseAddressForMapping(uint64_t MMapAddress,
                                        uint64_t FileOffset) const {
  // Find a segment with a matching file offset.
  for (auto &amp;KV : SegmentMapInfo) {
    const SegmentInfo &amp;SegInfo = KV.second;
    if (alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
        alignDown(FileOffset, SegInfo.Alignment)) {
      return MMapAddress - (SegInfo.Address - SegInfo.FileOffset + FileOffset);
    }
  }
  return std::nullopt;
}

In other words the heuristic used to try and map the start address of a segment
in a running process MMapAddress, and the associated FileOffset for it,
seems to be checking whether a segment in the ElfFile has the following
properties:

alignDown(SegInfo.FileOffset, SegInfo.Alignment) ==
alignDown(FileOffset, SegInfo.Alignment)

In the AArch64 running example, SegInfo.Alignment is 0x10000. FileOffset is
1000. It seems this heuristic ends up working for all 4 segments in the
almabench binary, rather than just 1. Presumably because this is a small
rather than a very large binary, where multiple segments fit within the same "alignment granule" of 0x10000?

gBAFM(MMapAddres aaaaaaab1000, FileOffset 1000):
.. MapStartAddress 0
.. alignDown(SegInfo.FileOffset 0, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 11d40
.. alignDown(SegInfo.FileOffset 1d40, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 22f20
.. alignDown(SegInfo.FileOffset 2f20, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. MapStartAddress 33110
.. alignDown(SegInfo.FileOffset 3110, SegInfo.Alignment 10000) 0 == alignDown(FileOffset 1000, SegInfo.Alignment 10000) 0
.. Found 4 matching segments:
   0. SegmentInfo { Address: 0x0, Size: 0x1d1c, FileOffset: 0x0, FileSize: 0x1d1c, Alignment: 0x10000}
   1. SegmentInfo { Address: 0x11d40, Size: 0x11e0, FileOffset: 0x1d40, FileSize: 0x11e0, Alignment: 0x10000}
   2. SegmentInfo { Address: 0x22f20, Size: 0x10e0, FileOffset: 0x2f20, FileSize: 0x1f0, Alignment: 0x10000}
   3. SegmentInfo { Address: 0x33110, Size: 0x89, FileOffset: 0x3110, FileSize: 0x88, Alignment: 0x10000}
.. gBAFM(MMapAddress aaaaaaab1000, FileOffset 1000):
.... result = MMapAddress(aaaaaaab1000) - (SegInfo.Address(0) - SegInfo.FileOffset(0) + FileOffset(1000)) -&gt; aaaaaaab0000

The code selects the first segment, which is not the text segment (the second
segment is the text segment).

This heuristic should be improved. Even better, could this be made deterministic
rather than based on a heuristic? How does linux perf compute this (presumably
in a non-heuristic way)?

It seems perf computes this based on the name (and/or buildid?) of the binary,
see roughly:
https://github.com/torvalds/linux/blob/839c4f596f898edc424070dc8b517381572f8502/tools/perf/util/map.c#L166

Would it be OK to use filename and/or buildid too in perf2bolt? Presumably this
should work unless the binary file moves around between profiling and bolt
optimization?

In my setup, the PERF_RECORD_MMAP2 records don't contain build ids. Therefore it
seems that just relying on build ids might not solve this issue "out of the box"
for most users?

Another idea to improve the heuristic is that segments in ELF files contain
flags to indicate whether they are (amongst others) executable. In the running
example, I see objdump -a ./almabench contains:

...
LOAD off    0x0000000000000000 vaddr 0x0000000000000000 paddr 0x0000000000000000 align 2**16
     filesz 0x0000000000001d18 memsz 0x0000000000001d18 flags r--
LOAD off    0x0000000000001d40 vaddr 0x0000000000011d40 paddr 0x0000000000011d40 align 2**16
     filesz 0x00000000000011e0 memsz 0x00000000000011e0 flags r-x
LOAD off    0x0000000000002f20 vaddr 0x0000000000022f20 paddr 0x0000000000022f20 align 2**16
     filesz 0x00000000000001f0 memsz 0x00000000000010e0 flags rw-
LOAD off    0x0000000000003110 vaddr 0x0000000000033110 paddr 0x0000000000033110 align 2**16
     filesz 0x0000000000000088 memsz 0x0000000000000089 flags rw-
...

So, if we'd require the segment that matches to be an executable segment, in at
least this example, the correct segment would be picked.

I'll be creating a PR that implements this shortly.

A further improvement that could be made is for perf2bolt to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.

@maksfb
Copy link
Contributor

maksfb commented Sep 21, 2024

A further improvement that could be made is for perf2bolt to produce a warning when its heuristic works for multiple segments in the ELF file, rather than finding just one unique segment to match with the segment seen in process memory during profiling.

I think it's possible to match multiple segments, but that's my speculation based on looking at these maps long time ago. We can check that the base address (for the binary) matches regardless of what segment is being used for the calculation.

Regarding the file name, we do use it when the buildid is not set. Maybe I misunderstood the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants