Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM: Better strategy for TryReserveInitialMemory on arm64 (jump stubs) #70707

Merged
merged 16 commits into from
Jun 22, 2022

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 14, 2022

In #63842 we agreed that 128mb size is too small for the initial memory and might not fit typical real world apps. In this attempt I managed to get a good rate of successful attempts to reserve 500-700mb close to coreclr locally (~50% of test runs).

Also, I introduced a Release-internal flag to disable that behavior in order to get more stable results in microbenchmarks.

@jkotas @janvorli

const int32_t MemoryProbingIncrement = 128 * 1024 * 1024;

// If we manage to reserve the initial memory close to coreclr we might get a better performance
// but it's better to turn it off when we run benchmarks for more stable results (always reserve far from coreclr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not sound right for the benchmarks to measure something else than what customers see.

Copy link
Member Author

@EgorBo EgorBo Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem that currently on ARM we have quite shaky results and it's difficult to measure improvements from various changes, e.g. note this "ampere" line compared to x64 windows and Linux
image

I'll experiment on crank what my flags do to a series of measurements and how big RPS is when we're lucky to reserve a piece next to coreclr (and how often) - according to my previous measurements it's 5% larger in that case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in fact, in like 50% cases apps won't be able to reserve such a big chunk next to coreclr so that flag simulates, probably, the most common case.

@EgorBo EgorBo marked this pull request as draft June 14, 2022 12:28
@EgorBo
Copy link
Member Author

EgorBo commented Jun 14, 2022

So here is what happens on our ampere machine (linux-arm64) we use for TE benchmarks:

app is started, we know the module base address:

coreclrLoadAddress = (UINT_PTR)PAL_GetSymbolModuleBase((void*)VirtualAlloc);

then we append "guessed" CoreClrLibrarySize size to it (100mb) assuming we try to allocate above the location of libcoreclr:

preferredStartAddress = coreclrLoadAddress + CoreClrLibrarySize;

then in a loop we try to reserve 2Gb of memory from preferredStartAddress by appending 128Mb to it every iteration (and decreasing "desired" allocation size by the same number):

do
{
    m_startAddress = ReserveVirtualMemory(pthrCurrent, (void*)preferredStartAddress, sizeOfAllocation);
    if (m_startAddress != nullptr)
    {
        break;
    }

    // Try to allocate a smaller region (and futher)
    sizeOfAllocation -= 128 * 1024 * 1024;
    preferredStartAddress += 128 * 1024 * 1024;
} while (sizeOfAllocation >= MemoryProbingIncrement);

// Fallback:
if (m_startAddress == nullptr)
{
    // We were not able to reserve any memory near libcoreclr. Try to reserve approximately 2 GB of address space somewhere
    // anyway...

it means that after just one of such iteration we already out of RELOC's reach, futhermore 100Mb for "guessed" coreclr size is too much, 16Mb works just fine for me locally.

Next problem that it seems to be very difficult to find spare 2Gb of memory nearby - I was never able to do so on my Apple M1 but it seems to be possible on our Ampere linux-arm64 but only after ~3 iterations of appending 128mb to the base address. For my Apple M1 I was able with some low probability to reserve 1GB nearby and some decent probability for 500-700Mb

src/coreclr/pal/src/map/virtual.cpp Outdated Show resolved Hide resolved
src/coreclr/pal/src/map/virtual.cpp Outdated Show resolved Hide resolved
src/coreclr/pal/src/include/pal/virtual.h Outdated Show resolved Hide resolved
@EgorBo EgorBo marked this pull request as ready for review June 14, 2022 17:38
@EgorBo
Copy link
Member Author

EgorBo commented Jun 14, 2022

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class Program
{
    public static void Main(string[] args) =>
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

    [Benchmark]
    [Arguments(0.4f)]
    public float MathTest(float x) => 
        MathF.Cos(x) + MathF.Sin(x) + MathF.Tan(x) + MathF.Sin(x);
}

`

Method Job Toolchain x Mean
MathTest Job-FIVHQM /Core_Root/corerun 0.4 23.797 ns
MathTest Job-VDOWXN /Core_Root_PR/corerun 0.4 10.667 ns

in 70% cases the results are like this, in the rest 30% results are the same (24ns) but the baseline never managed to reserve 2Gb close to coreclr.

src/coreclr/pal/src/map/virtual.cpp Outdated Show resolved Hide resolved
src/coreclr/pal/src/map/virtual.cpp Outdated Show resolved Hide resolved
src/coreclr/pal/src/map/virtual.cpp Outdated Show resolved Hide resolved
// Smaller steps on ARM becuase we try hard finding a spare memory in a 128Mb
// distance from coreclr so e.g. all calls from corelib to coreclr could use relocs
const int32_t AddressProbingIncrement = 8 * 1024 * 1024;
const int32_t AllocSizeProbingDecrement = 64 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the increment and decrement differ for arm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unified (and increased the initial size to 1Gb) - works pretty good on my Apple M1 and even better on linux-ampere (more hits of successful reservations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the worst case it does 8 probes (decreasing 1Gb by 128Mb each iteration and increasing startAddress by 8mb) and bails out

EgorBo and others added 3 commits June 16, 2022 21:53
Co-authored-by: Jan Vorlicek <jan.vorlicek@volny.cz>
@EgorBo
Copy link
Member Author

EgorBo commented Jun 17, 2022

@janvorli @jkotas so does it look good? It now tries hard allocating 1Gb heap nearby and, according to logs, it manages to do so on our TE ampere linux machine with some good probability. It also does probing by increasing startAddress by 8mb each iteration and decreasing desired memory size by 128Mb (up to 8 iteration). On My M1 it usually manages to allocate around 500-700Mb heap and keeps relocs working.

I can remove the env.var I added if you don't think it brings value - I just wanted to experiment with it on dotnet/performance to reduce possible noise.

@janvorli
Copy link
Member

I'd suggest removing the env var, even though the results of benchmarks are going to be more noisy, it will allow us to see what users really get.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 21, 2022

I'd suggest removing the env var, even though the results of benchmarks are going to be more noisy, it will allow us to see what users really get.

Removed


#ifdef TARGET_XARCH
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ifdef is incorrect, we don't define TARGET_XARCH out of JIT. Can you please use
#if defined(TARGET_ARM) || defined(TARGET_ARM64) here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, again I am using JIT's flags in the VM 😞 the incorrect define slightly regressed x64/x86 by decreasing preferable size from 2Gb to 1Gb. Thanks, fixed

@EgorBo
Copy link
Member Author

EgorBo commented Jun 22, 2022

@janvorli does it look good now?

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@janvorli janvorli merged commit 4388cc5 into dotnet:main Jun 22, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants