Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance bug in buildLocationList #1

Open
wants to merge 10,000 commits into
base: main
Choose a base branch
from
Open

Fix performance bug in buildLocationList #1

wants to merge 10,000 commits into from

Conversation

tmsri
Copy link
Owner

@tmsri tmsri commented Sep 16, 2024

Fix performance bug in buildLocationList

In buildLocationList, with basic block sections, we iterate over
every basic block twice to detect section start and end.  This is
sub-optimal and shows up as significantly time consuming when
compiling large functions.

This patch uses the set of sections already stored in MBBSectionRanges
and iterates over sections rather than basic blocks.

When detecting if loclists can be merged, the end label of an entry is
matched with the beginning label of the next entry.  For the section
corresponding to the entry basic block, this is skipped.  This is
because the loc list uses the end label corresponding to the function
whereas the MBBSectionRanges map uses the function end label.

For example:

.Lfunc_begin0:
        .file   <blah>
        .loc    0 4 0                           # ex2.cc:4:0
        .cfi_startproc
.Ltmp0:
        #DEBUG_VALUE: test:i <- 7
        .loc    0 8 5 prologue_end              # ex2.cc:8:5
        ....
.LBB_END0_0:
        .cfi_endproc
        .section        .text._Z4testv,"ax",@progbits,unique,1
...
.Lfunc_end0:
        .size   _Z4testv, .Lfunc_end0-_Z4testv

The debug loc uses ".LBB_END0_0" for the end of the section whereas
MBBSectionRanges uses ".Lfunc_end0".

It is alright to skip this as we already check the section corresponding
to the debugloc entry.

jimingham and others added 30 commits September 13, 2024 10:18
…children provider (llvm#108414)

Our customers is reporting a serious performance issue (expanding a this
pointer takes 70 seconds in VSCode) in a specific execution context.

Profiling shows the hot path is triggered by an expression evaluation
from libStdC++ synthetic children provider for `std::vector<bool>` since
it uses `CreateValueFromExpression()`.

This PR added a new `SBValue::CreateBoolValue()` API and switch
`std::vector<bool>` synthetic children provider to use the new API
without performing expression evaluation.

Note: there might be other cases of `CreateValueFromExpression()` in our
summary/synthetic children providers which I will sweep through in later
PRs.

With this PR, the customer's scenario reduces from 70 seconds => 50
seconds. I will add other PRs to further optimize the remaining 50
seconds (mostly from type/namespace lookup).

Testing:

`test/API/functionalities/data-formatter/data-formatter-stl/libstdcpp/vbool/TestDataFormatterStdVBool.py`
passes with the PR

---------

Co-authored-by: jeffreytan81 <jeffreytan@fb.com>
… symbols are created (llvm#106791)

Summary:
This improves the performance of ObjectFileMacho::ParseSymtab by
removing eager and expensive work in favor of doing it later in a
less-expensive fashion.

Experiment:
My goal was to understand LLDB's startup time.
First, I produced a Debug build of LLDB (no dSYM) and a
Release+NoAsserts build of LLDB. The Release build debugged the Debug
build as it debugged a small C++ program. I found that
ObjectFileMachO::ParseSymtab accounted for somewhere between 1.2 and 1.3
seconds consistently. After applying this change, I consistently
measured a reduction of approximately 100ms, putting the time closer to
1.1s and 1.2s on average.

Background:
ObjectFileMachO::ParseSymtab will incrementally create symbols by
parsing nlist entries from the symtab section of a MachO binary. As it
does this, it eagerly tries to determine the size of symbols (e.g. how
long a function is) using LC_FUNCTION_STARTS data (or eh_frame if
LC_FUNCTION_STARTS is unavailable). Concretely, this is done by
performing a binary search on the function starts array and calculating
the distance to the next function or the end of the section (whichever
is smaller).

However, this work is unnecessary for 2 reasons:
1. If you have debug symbol entries (i.e. STABs), the size of a function
is usually stored right after the function's entry. Performing this work
right before parsing the next entry is unnecessary work.
2. Calculating symbol sizes for symbols of size 0 is already performed
in `Symtab::InitAddressIndexes` after all the symbols are added to the
Symtab. It also does this more efficiently by walking over a list of
symbols sorted by address, so the work to calculate the size per symbol
is constant instead of O(log n).
Like most other i128 operations, this adds scalarization for i128 vector
shifts. Which in turn allows a few other operations to legalize too.
Similar to other i128 bit operations, we scalarizer any icmps or selects larger
than 64bits.
Adds a python script to automatically take output from a failed clang
-verify test and update the test case(s) to expect the new behaviour.
* To create custom ABIs plugin libraries need access to CoroShape.
* As a step in enabling plugin libraries, move Shape into its own header
* The header will eventually be moved into include/llvm/Transforms/Coroutines

See RFC for more info:
https://discourse.llvm.org/t/rfc-abi-objects-for-coroutines/81057
…izationRewrite` (llvm#108359)

The dialect conversion maintains a set of unresolved materializations
(`UnrealizedConversionCastOp`). Turn that set into a `DenseMap` that
maps from ops to `UnresolvedMaterializationRewrite *`. This improves
efficiency a bit, because an iteration over
`ConversionPatternRewriterImpl::rewrites` can be avoided.

Also delete some dead code.
This patch implements sandboxir::GlobalObject mirroring
llvm::GlobalObject.
This patch implements a new empty pass for the Bottom-up vectorizer and
creates a pass pipeline that includes it.
The SandboxVectorizer LLVM pass runs the Sandbox IR pass pipeline.
Summary:
We can include `stdint.h` just fine as long as we don't allow it to find
system headers, passing `-nostdlibinc` and `-nogpuinc` suppresses these
extra paths so we will just use the clang resource headers for
`stdint.h` and `stddef.h`.
Should not return the original phi vector instruction, need to return
actual vectorized value as a result.
…iple `ucmp`/`scmp` operands and a constant with `phi` of individual comparisons of original intrinsic's arguments (llvm#107769)

When we have a `phi` instruction with more than one of its incoming
values being a call to `ucmp` or `scmp`, which is then compared with an
integer constant, we can move the comparison through the `phi` into the
incoming basic blocks because we know that a comparison of `ucmp`/`scmp`
with a constant will be simplified by the next iteration of InstCombine.

There's a high chance that other similar patterns can be identified, in
which case they can be easily handled by the same code by moving the
check for "simplifiable" instructions into a lambda.
Inline VPBuilder::createICmp in the header, in line with the other
VPBuilder functions.
Create `eh-assembly.s` that contains EH tests and remove EH tests from
`basic-assembly.s`, given that it's easier to manage. (We can have many
different tests, including the legacy EH and the new exnref, and with
nesting for readability)
)

This is a follow-up to llvm#92289 that adds lowering of the new
`@llvm.experimental.vector.compress` intrinsic on x86 with AVX512
instructions. This intrinsic maps directly to `vpcompress`.
This patch implements sandboxir::GlobalIFunc mirroring
llvm::GlobalIFunc.
This patch adds support for a user-defined pass-pipeline that overrides
the default pipeline of the vectorizer.
This will commonly be used by lit tests.
This allows the clang driver to know which tool is meant to be executed,
which allows the clang driver to load the right clang config files, and
allows clang to find colocated sysroots.

This makes sure that doing `clang-scan-deps -- <tool> ...` looks up
things in the same way as if one just would execute `<tool> ...`, when
`<tool>` isn't an absolute or relative path.
Emit signpost intervals for progress events so that when users report an
operation takes a long time, we can investigate the issue with
Instruments.app.
alexey-bataev and others added 26 commits September 16, 2024 09:53
The "clustered" nodes for buildvector nodes must be postponed in
accordance with the global flag, otherwise it may cause crash because of
the dependency between phi nodes.
This replaces the previous Code Owners section of our developer policy
with a new section for Maintainers. It also updates most of the places
we mention "code owner" in the documentation (it does not update the
files named `Code Owners.rst` or similar because those should be updated
when the subprojects add their `Maintainers.rst` file).

The wording was taken from what was proposed in the RFC (including all
suggested amendments from folks on the thread).

Please see the RFC for more details:

https://discourse.llvm.org/t/rfc-proposing-changes-to-the-community-code-ownership-policy/80714/
Fusion of reshapes by collapsing patterns were restricted to single
result operations, but the implementation supports multi result ops.
This PR removes the restriction, since it is not necessary.
As a side-effect, we start constant folding icmps.

Split out from llvm#105991.
Unlike scalar, where AArch64 prefers expanding scmp/ucmp with select,
under Neon we can use the arithmetic expansion to generate fewer
instructions. Notably it also prevents the scalarization of vselect
during vector-legalization.
As requested in follow-up comments on llvm#108592.
…08723)

We don't intercept this one, no reason to use RTLD_NEXT.

Co-authored-by: Sam Elliott <quic_aelliott@quicinc.com>
…cs section of RISCVUsage.rst. NFC (llvm#108718)

These are no longer experimental after
051054e. I left the section because we
will be adding intrinsics for Zvkgs and Zvbc32e.
…llvm#107905)

That missing space was causing the whole sentence to be rendered
incorrectly in the resulting HTML.
…st (llvm#108524)

This extends the fix in llvm#106242
for other derived class types.
…#108852)

In the process of adding scalarization support for DirectX target
intrinsics I found that intrinsics that weren't marked with `IntrNoMem`
did not get removed by
`RecursivelyDeleteTriviallyDeadInstructionsPermissive`. So this change
is to make it more clear that our intrinsics don't have side effects.

I only added `IntrNoMem` to the intrinics in `IntrinsicsDirectX.td` I
was involved with. There a potentially a few other cases that might
warrant this attribute, but will need input on the others.
There is possibility of
static_tls_begin is set and static_tls_end is not yet

The test reproduces the case.
Stack trace looks like this:
* `MsanThread::Init`
* `SetThreadStackAndTls`
* `GetThreadStackAndTls`
* `GetThreadStackTopAndBottom`
* `pthread_getattr_np`
* `realloc`
* `__sanitizer_malloc_hook`
* TLS access
* `___interceptor___tls_get_addr`
* `DTLS_on_tls_get_addr`

The issue is that `SetThreadStackAndTls` implementation
stores `tls_begin` before `GetThreadStackTopAndBottom`,
and `tls_end` after. So we have partially initialized
state in `DTLS_on_tls_get_addr`.
This is an implementation of `ctime` and includes `ctime_r`.

According to documentation, `ctime` and `ctime_r` are defined as the
following:

```c
char *ctime(const time_t *timep);
char *ctime_r(const time_t *restrict timep, char buf[restrict 26]);
```

closes llvm#86567
Given that the instructions here are all control flow instructions,
adding indentations seem to make it easier to read.
The plan was to make `eh-assembly.s` contain both the legacy and the new
tests, but the new tests require `--no-type-check` because the type
checker for the new EH is in progress. In case this drags on further
than expected, this renames the current file to `-legacy.s` in order to
follow the current naming scheme in `test/CodeGen/WebAssembly`.

After landing this first, `eh-assembly-new.s` in llvm#108668 will be renamed
to `eh-assembly.s`.
Secondary cache entries are now released to the OS from least recent to
most recent entries. This helps to avoid unnecessary scans of the cache
since entries ready to be released (specifically, entries that are
considered old relative to the configurable release interval) will
always be at the tail of the list of committed entries by the LRU
ordering. For this same reason, the `OldestTime` variable is no longer
needed to indicate when releases are necessary so it has been removed.
This PR is implementing `asfloat` for HLSL.

Fixes: llvm#70098

Co-authored-by: Joao Saffran <jderezende@microsoft.com>
This patch adds a large number of missing includes in the libc++ headers
and the test suite. Those were found as part of the effort to move
towards a mostly monolithic top-level std module.
This PR adds `f6E2M3FN` type to mlir.

`f6E2M3FN` type is proposed in [OpenCompute MX
Specification](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
It defines a 6-bit floating point number with bit layout S1E2M3. Unlike
IEEE-754 types, there are no infinity or NaN values.

```c
f6E2M3FN
- Exponent bias: 1
- Maximum stored exponent value: 3 (binary 11)
- Maximum unbiased exponent value: 3 - 1 = 2
- Minimum stored exponent value: 1 (binary 01)
- Minimum unbiased exponent value: 1 − 1 = 0
- Has Positive and Negative zero
- Doesn't have infinity
- Doesn't have NaNs

Additional details:
- Zeros (+/-): S.00.000
- Max normal number: S.11.111 = ±2^(2) x (1 + 0.875) = ±7.5
- Min normal number: S.01.000 = ±2^(0) = ±1.0
- Max subnormal number: S.00.111 = ±2^(0) x 0.875 = ±0.875
- Min subnormal number: S.00.001 = ±2^(0) x 0.125 = ±0.125
```

Related PRs:
- [PR-94735](llvm#94735) [APFloat]
Add APFloat support for FP6 data types
- [PR-105573](llvm#105573) [MLIR]
Add f6E3M2FN type - was used as a template for this PR
This reverts commit 69f3244.

Reason: buildbot breakage because Android doesn't have <gnu/libc-version.h>
https://lab.llvm.org/buildbot/#/builders/186/builds/2381

(It's probably easy to fix but I don't readily have an Android device to test.)
…cxx.rst` (llvm#108714)

This makes it easier for readers to locate how to build the library.
The paper was implemented by commit b0386a5
(https://reviews.llvm.org/D46845) in LLVM 7.0. But it would be nice to
have test coverage for desired properties of `insert_return_type`.

Closes llvm#99944
In buildLocationList, with basic block sections, we iterate over
every basic block twice to detect section start and end.  This is
sub-optimal and shows up as significantly time consuming when
compiling large functions.

This patch uses the set of sections already stored in MBBSectionRanges
and iterates over sections rather than basic blocks.

When detecting if loclists can be merged, the end label of an entry is
matched with the beginning label of the next entry.  For the section
corresponding to the entry basic block, this is skipped.  This is
because the loc list uses the end label corresponding to the function
whereas the MBBSectionRanges map uses the function end label.

For example:

.Lfunc_begin0:
	.file	<blah>
	.loc	0 4 0                           # ex2.cc:4:0
	.cfi_startproc
.Ltmp0:
	#DEBUG_VALUE: test:i <- 7
	.loc	0 8 5 prologue_end              # ex2.cc:8:5
	....
.LBB_END0_0:
	.cfi_endproc
	.section	.text._Z4testv,"ax",@progbits,unique,1
...
.Lfunc_end0:
	.size	_Z4testv, .Lfunc_end0-_Z4testv

The debug loc uses ".LBB_END0_0" for the end of the section whereas
MBBSectionRanges uses ".Lfunc_end0".

It is alright to skip this as we already check the section corresponding
to the debugloc entry.
tmsri pushed a commit that referenced this pull request Sep 19, 2024
…ap (llvm#108825)

This attempts to improve user-experience when LLDB stops on a
verbose_trap. Currently if a `__builtin_verbose_trap` triggers, we
display the first frame above the call to the verbose_trap. So in the
newly added test case, we would've previously stopped here:
```
(lldb) run
Process 28095 launched: '/Users/michaelbuch/a.out' (arm64)
Process 28095 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #1: 0x0000000100003f5c a.out`std::__1::vector<int>::operator[](this=0x000000016fdfebef size=0, (null)=10) at verbose_trap.cpp:6:9
   3    template <typename T>
   4    struct vector {
   5        void operator[](unsigned) {
-> 6            __builtin_verbose_trap("Bounds error", "out-of-bounds access");
   7        }
   8    };
```

After this patch, we would stop in the first non-`std` frame:
```
(lldb) run
Process 27843 launched: '/Users/michaelbuch/a.out' (arm64)
Process 27843 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #2: 0x0000000100003f44 a.out`g() at verbose_trap.cpp:14:5
   11  
   12   void g() {
   13       std::vector<int> v;
-> 14       v[10];
   15   }
   16  
```

rdar://134490328
tmsri pushed a commit that referenced this pull request Sep 19, 2024
Random testing found that the Z3 wrapper does not support UnarySymExpr,
which was added recently and not included in the original Z3 wrapper.
For now, just avoid submitting expressions to Z3 to avoid compiler
crashes.

Some crash context ...

clang -cc1 -analyze -analyzer-checker=core z3-unarysymexpr.c
-analyzer-constraints=z3

Unsupported expression to reason about!
UNREACHABLE executed at
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SMTConstraintManager.h:297!

Stack dump:
3. <root>/clang/test/Analysis/z3-unarysymexpr.c:13:7: Error evaluating
branch #0 <addr> llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) #1
<addr> llvm::sys::RunSignalHandlers() llvm#8 <addr>
clang::ento::SimpleConstraintManager::assumeAux(
llvm::IntrusiveRefCntPtr<clang::ento::ProgramState const>,
clang::ento::NonLoc, bool) llvm#9 <addr>
clang::ento::SimpleConstraintManager::assume(
llvm::IntrusiveRefCntPtr<clang::ento::ProgramState const>,
clang::ento::NonLoc, bool)

Co-authored-by: einvbri <vince.a.bridgers@ericsson.com>
tmsri pushed a commit that referenced this pull request Sep 19, 2024
…ap (llvm#108825)

This attempts to improve user-experience when LLDB stops on a
verbose_trap. Currently if a `__builtin_verbose_trap` triggers, we
display the first frame above the call to the verbose_trap. So in the
newly added test case, we would've previously stopped here:
```
(lldb) run
Process 28095 launched: '/Users/michaelbuch/a.out' (arm64)
Process 28095 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #1: 0x0000000100003f5c a.out`std::__1::vector<int>::operator[](this=0x000000016fdfebef size=0, (null)=10) at verbose_trap.cpp:6:9
   3    template <typename T>
   4    struct vector {
   5        void operator[](unsigned) {
-> 6            __builtin_verbose_trap("Bounds error", "out-of-bounds access");
   7        }
   8    };
```

After this patch, we would stop in the first non-`std` frame:
```
(lldb) run
Process 27843 launched: '/Users/michaelbuch/a.out' (arm64)
Process 27843 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = Bounds error: out-of-bounds access
    frame #2: 0x0000000100003f44 a.out`g() at verbose_trap.cpp:14:5
   11  
   12   void g() {
   13       std::vector<int> v;
-> 14       v[10];
   15   }
   16  
```

rdar://134490328
tmsri pushed a commit that referenced this pull request Sep 19, 2024
Random testing found that the Z3 wrapper does not support UnarySymExpr,
which was added recently and not included in the original Z3 wrapper.
For now, just avoid submitting expressions to Z3 to avoid compiler
crashes.

Some crash context ...

clang -cc1 -analyze -analyzer-checker=core z3-unarysymexpr.c
-analyzer-constraints=z3

Unsupported expression to reason about!
UNREACHABLE executed at
clang/include/clang/StaticAnalyzer/Core/PathSensitive/SMTConstraintManager.h:297!

Stack dump:
3. <root>/clang/test/Analysis/z3-unarysymexpr.c:13:7: Error evaluating
branch #0 <addr> llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) #1
<addr> llvm::sys::RunSignalHandlers() llvm#8 <addr>
clang::ento::SimpleConstraintManager::assumeAux(
llvm::IntrusiveRefCntPtr<clang::ento::ProgramState const>,
clang::ento::NonLoc, bool) llvm#9 <addr>
clang::ento::SimpleConstraintManager::assume(
llvm::IntrusiveRefCntPtr<clang::ento::ProgramState const>,
clang::ento::NonLoc, bool)

Co-authored-by: einvbri <vince.a.bridgers@ericsson.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.