Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly slower --emit=metadata #47267

Closed
leonardo-m opened this issue Jan 8, 2018 · 17 comments
Closed

Significantly slower --emit=metadata #47267

leonardo-m opened this issue Jan 8, 2018 · 17 comments
Labels
A-NLL Area: Non Lexical Lifetimes (NLL) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. NLL-performant Working towards the "performance is good" goal T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@leonardo-m
Copy link

leonardo-m commented Jan 8, 2018

I have found a significant time increase when I use "rustc --emit=metadata". I have located the source of such difference between two Nightlies, that I've named A and B, the time is for my code:

A:
rustup default nightly-2017-12-14
rustc 1.24.0-nightly (f8af59d95 2017-12-13)
commit-hash: f8af59d95225d122e38bb5dceb1027844d4ce170
host: x86_64-pc-windows-gnu
--emit=metadata time: 4.35 seconds

B:
rustup default nightly-2017-12-15
rustc 1.24.0-nightly (0077d128d 2017-12-14)
commit-hash: 0077d128d33543290140763ce7d84d05eb57c40a
host: x86_64-pc-windows-gnu
--emit=metadata time: 5.22 seconds

If you perform a complete compilation instead of just --emit=metadata the timing difference between the A and B compiler is of course much smaller. But --emit=metadata time is important to verify the code.

To show an example I've done:

cargo check --all

On rust-ndarray version 0.11.0, with both A and B. Then I've cleaned the directory:

rust-ndarray-master\target\debug\deps

And I've done "cargo check --all" again for both both A and B. This is not the best test for this problem, but it shows a difference (25.76 seconds for A, and 26.75 seconds for B), the difference percentage is small because it's not equivalent to just a "--emit=metadata".

@eddyb
Copy link
Member

eddyb commented Jan 8, 2018

cc @michaelwoerister

@michaelwoerister
Copy link
Member

IIRC, metadata generation is the first point where we compute the MIR of most functions, so the slowdown does not have to be specific to metadata encoding.

@alexcrichton
Copy link
Member

According to the list of changes the candidates probably are:

@leonardo-m would you be able to share detailed instructions of how to reproduce? Testing the two nightlies in question on rust-ndarray/ndarray@bda4c0d doesn't show much differnce to me for cargo check --all and incremental changes

@leonardo-m
Copy link
Author

leonardo-m commented Jan 9, 2018

So far I haven't found a better way to reproduce the problem, my code is not public. But if no one else is having this problem, then it's just a problem of my code, and I'll close down this issue.

A simple thing that I didn't think about using is "-Ztime-passes --emit=metadata":

A:
  time: 0.058; rss: 32MB	parsing
  time: 0.000; rss: 32MB	recursion limit
  time: 0.000; rss: 32MB	crate injection
  time: 0.000; rss: 32MB	plugin loading
  time: 0.000; rss: 32MB	plugin registration
  time: 0.152; rss: 59MB	expansion
  time: 0.000; rss: 59MB	maybe building test harness
  time: 0.003; rss: 59MB	maybe creating a macro crate
  time: 0.008; rss: 59MB	creating allocators
  time: 0.000; rss: 59MB	checking for inline asm in case the target doesn't support it
  time: 0.003; rss: 59MB	AST validation
  time: 0.056; rss: 66MB	name resolution
  time: 0.008; rss: 67MB	complete gated feature checking
  time: 0.042; rss: 81MB	lowering ast -> hir
  time: 0.015; rss: 81MB	early lint checks
  time: 0.091; rss: 86MB	indexing hir
  time: 0.003; rss: 86MB	attribute checking
  time: 0.000; rss: 78MB	load query result cache
  time: 0.000; rss: 78MB	looking for entry point
  time: 0.000; rss: 78MB	looking for plugin registrar
  time: 0.003; rss: 78MB	loop checking
  time: 0.000; rss: 78MB	static item recursion checking
  time: 0.009; rss: 79MB	stability checking
  time: 0.052; rss: 99MB	type collecting
  time: 0.000; rss: 99MB	outlives testing
  time: 0.000; rss: 99MB	impl wf inference
  time: 0.008; rss: 101MB	coherence checking
  time: 0.000; rss: 101MB	variance testing
  time: 0.058; rss: 108MB	wf checking
  time: 0.017; rss: 108MB	item-types checking
  time: 2.134; rss: 134MB	item-bodies checking
  time: 0.124; rss: 139MB	const checking
  time: 0.038; rss: 139MB	privacy checking
  time: 0.005; rss: 139MB	intrinsic checking
  time: 0.020; rss: 139MB	match checking
  time: 0.062; rss: 140MB	liveness checking
  time: 1.079; rss: 186MB	borrow checking
  time: 0.001; rss: 186MB	MIR borrow checking
  time: 0.000; rss: 186MB	MIR effect checking
  time: 0.014; rss: 186MB	death checking
  time: 0.000; rss: 186MB	unused lib feature checking
  time: 0.075; rss: 188MB	lint checking
  time: 0.000; rss: 188MB	resolving dependency formats
    time: 0.000; rss: 189MB	write metadata
    time: 0.000; rss: 197MB	assert dep graph
    time: 0.000; rss: 197MB	serialize dep graph
  time: 0.053; rss: 197MB	translation
  time: 0.000; rss: 95MB	serialize work products
  time: 0.000; rss: 95MB	linking


B:
  time: 0.058; rss: 32MB	parsing
  time: 0.000; rss: 32MB	recursion limit
  time: 0.000; rss: 32MB	crate injection
  time: 0.000; rss: 32MB	plugin loading
  time: 0.000; rss: 32MB	plugin registration
  time: 0.118; rss: 59MB	expansion
  time: 0.000; rss: 59MB	maybe building test harness
  time: 0.003; rss: 59MB	maybe creating a macro crate
  time: 0.008; rss: 60MB	creating allocators
  time: 0.000; rss: 60MB	checking for inline asm in case the target doesn't support it
  time: 0.003; rss: 60MB	AST validation
  time: 0.056; rss: 67MB	name resolution
  time: 0.008; rss: 67MB	complete gated feature checking
  time: 0.042; rss: 82MB	lowering ast -> hir
  time: 0.015; rss: 82MB	early lint checks
  time: 0.050; rss: 85MB	indexing hir
  time: 0.003; rss: 85MB	attribute checking
  time: 0.000; rss: 79MB	load query result cache
  time: 0.000; rss: 79MB	looking for entry point
  time: 0.000; rss: 79MB	looking for plugin registrar
  time: 0.003; rss: 80MB	loop checking
  time: 0.000; rss: 80MB	static item recursion checking
  time: 0.009; rss: 81MB	stability checking
  time: 0.052; rss: 100MB	type collecting
  time: 0.000; rss: 100MB	outlives testing
  time: 0.000; rss: 100MB	impl wf inference
  time: 0.008; rss: 101MB	coherence checking
  time: 0.000; rss: 101MB	variance testing
  time: 0.058; rss: 108MB	wf checking
  time: 0.018; rss: 109MB	item-types checking
  time: 2.118; rss: 134MB	item-bodies checking
  time: 0.126; rss: 139MB	const checking
  time: 0.038; rss: 139MB	privacy checking
  time: 0.005; rss: 139MB	intrinsic checking
  time: 0.020; rss: 139MB	match checking
  time: 0.061; rss: 140MB	liveness checking
  time: 1.955; rss: 188MB	borrow checking
  time: 0.001; rss: 188MB	MIR borrow checking
  time: 0.000; rss: 188MB	MIR effect checking
  time: 0.014; rss: 188MB	death checking
  time: 0.000; rss: 188MB	unused lib feature checking
  time: 0.077; rss: 190MB	lint checking
  time: 0.000; rss: 190MB	resolving dependency formats
    time: 0.000; rss: 190MB	write metadata
    time: 0.000; rss: 198MB	assert dep graph
    time: 0.000; rss: 198MB	serialize dep graph
  time: 0.052; rss: 198MB	translation
  time: 0.000; rss: 95MB	serialize work products
  time: 0.000; rss: 95MB	linking

The only significant difference I am seeing in those two listings is for the borrow checking:

A: time: 1.079; rss: 186MB borrow checking

B: time: 1.955; rss: 188MB borrow checking

So the cause could be the NLL changes. I have an open issue about NLL ( #47022 ), so I can't compile my code with #![feature(nll)]. But I've seen that nll multiplies by 5.5X my "--emit=metadata" compilation time.

@eddyb
Copy link
Member

eddyb commented Jan 9, 2018

cc @nikomatsakis

@nikomatsakis
Copy link
Contributor

nikomatsakis commented Jan 11, 2018

Huh, that's interesting. I don't have much of an idea why that would be. It'd be really nice if we could try to reproduce this with some standalone file.

@nikomatsakis nikomatsakis added T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-nll I-compiletime Issue: Problems and improvements with respect to compile times. labels Jan 11, 2018
@nikomatsakis
Copy link
Contributor

I'm tagging it as WG-compiler-nll for now.

@leonardo-m
Copy link
Author

This is a problem related but different from the main one discussed above. The last Nigtly (1.25.0 3f92e8d 2018-01-14) is finally able to compile my code if I add the nll feature. But compiling it with "--emit=metadata" goes from 5.30 seconds to 31.84 seconds (about 6X slower). If I also use "-Ztime-passes" I see the differences:

Without NLL:
  time: 2.001; rss: 187MB	borrow checking
  time: 0.001; rss: 187MB	MIR borrow checking

With NLL:
  time: 2.184; rss: 193MB	borrow checking
  time: 26.063; rss: 199MB	MIR borrow checking

@michaelwoerister
Copy link
Member

Ouch :)

@nikomatsakis
Copy link
Contributor

@leonardo-m perhaps you can run with perf and give some kind of profiling? It's kind of hard to diagnose what the problem is without some better way to reproduce.

(How private is the code? If you are willing to send via e-mail, I'd of course promise not to redistribute.)

@leonardo-m
Copy link
Author

leonardo-m commented Jan 18, 2018

Yes, I could send you all the code. But first let's try with an example I've found. This small program shows some timings difference compiling with "--emit=metadata" (0.47 seconds with nll, and 0.27 seconds without):

#![feature(nll)]

use std::mem::size_of;

macro_rules! isqrt { ($e:expr, $t:ident) => ((($e) as f64).sqrt() as $t)  }

struct SieveBits { bits: Vec<usize> }

impl SieveBits {
    const BPC: usize = size_of::<usize>() * 8;

    fn is_bit_set(bits: &[usize], i: usize) -> bool {
        debug_assert!(i < bits.len() * Self::BPC);
        let offset = i / Self::BPC;
        let mask = 1 << (i % Self::BPC);
        unsafe { (*bits.get_unchecked(offset) & mask) != 0 }
    }

    fn reset_bit(bits: &mut [usize], i: usize) {
        debug_assert!(i < bits.len() * Self::BPC);
        let offset = i / Self::BPC;
        let mask = 1 << (i % Self::BPC);
        unsafe { *bits.get_unchecked_mut(offset) &= !mask; }
    }

    fn new(m: usize) -> Self {
        if m < 2 {
            return Self { bits: vec![] };
        }
        let mut primes = vec![std::usize::MAX; (m / 3) / Self::BPC + 1];
        Self::reset_bit(&mut primes, 0);

        let lim = isqrt!(m, usize) + 1;
        let mut i = 5;
        let mut step = 2;
        while i < lim {
            if Self::is_bit_set(&primes, i / 3) {
                let mut j = i * i;
                let mut step2 = step;
                while j < m {
                    Self::reset_bit(&mut primes, j / 3);
                    j += step2 * i;
                    step2 = 6 - step2;
                }
            }
            i += step;
            step = 6 - step;
        }

        Self { bits: primes }
    }

    fn is_prime(&self, i: usize) -> bool {
        if i == 2 || i == 3 { true }
        else if i % 2 == 0 || i % 3 == 0 { false }
        else { Self::is_bit_set(&self.bits, i / 3) }
    }
}

fn eratosthenes_sieve_u32(limit: usize) -> Vec<u32> {
    assert!(limit <= std::u32::MAX as usize);

    #[inline(always)]
    fn fill_u8(data: &mut [u8], value: u8) {
        unsafe {
            std::ptr::write_bytes(data.as_mut_ptr(), value, data.len());
        }
    }

    const L1D_CACHE_SIZE:u32 = 32_768;

    let mut result = vec![];
    if limit < 2 {
        return result;
    } else {
        result.push(2);
    }

    let lsqrt = isqrt!(limit, u32);

    let mut is_prime = vec![1u8; lsqrt as usize + 1];
    let mut i = 2;
    while i * i <= lsqrt {
        unsafe {
            if *is_prime.get_unchecked(i as usize) != 0 {
                let mut j = i * i;
                while j <= lsqrt {
                    *is_prime.get_unchecked_mut(j as usize) = 0;
                    j += i;
                }
            }
        }
        i += 1;
    }

    let segment_size = lsqrt.max(L1D_CACHE_SIZE);
    let mut s: usize = 3;
    let mut n: usize = 3;

    let mut sieve = vec![1u8; segment_size as usize];
    let mut primes: Vec<u32> = vec![];
    let mut next: Vec<u32> = vec![];
    let mut low: usize = 0;

    while low <= limit {
        fill_u8(&mut sieve, 1);

        let high = limit.min(low + segment_size as usize - 1);

        unsafe {
            while s * s <= high {
                if *is_prime.get_unchecked(s) != 0 {
                    primes.push(s as u32);
                    next.push((s * s - low) as u32);
                }
                s += 2;
            }

            for (i, &p) in primes.iter().enumerate() {
                let k = p * 2;
                let mut j = *next.get_unchecked(i);
                while j < segment_size {
                    *sieve.get_unchecked_mut(j as usize) = 0;
                    j += k;
                }
                *next.get_unchecked_mut(i) = j - segment_size;
            }

            while n <= high {
                if *sieve.get_unchecked(n - low) != 0 {
                    result.push(n as u32);
                }
                n += 2;
            }
        }

        low += segment_size as usize;
    }

    result
}

fn main() {
    fn e60() -> u32 {
        struct DPrime { n: u32, fac: u32 }

        impl DPrime {
            fn new(p: u32) -> Self {
                fn next_pow10(n: u32) -> u32 {
                    let mut p = 10;
                    while n > p { p *= 10; }
                    p
                }

                Self { n: p, fac: next_pow10(p) }
            }
        }

        const LIM: u32 = 10_000;

        let mut primes = vec![DPrime::new(3)];
        primes.extend(eratosthenes_sieve_u32(LIM as usize).iter().skip(3).map(|&p| DPrime::new(p)));

        let sieve = SieveBits::new((LIM * LIM) as usize);

        let are_coprimes = |a: &DPrime, b: &DPrime|
            b.n > a.n &&
            sieve.is_prime((a.n + a.fac * b.n) as usize) &&
            sieve.is_prime((b.n + b.fac * a.n) as usize);

        let mut best_sum = std::u32::MAX;

        for p1 in &primes {
            if p1.n >= best_sum { break; }
            let p2v: Vec<_> = primes.iter().filter(|p2| are_coprimes(p1, p2)).collect();
            for p2 in &p2v {
                if p1.n + p2.n >= best_sum { break; }
                let p3v: Vec<_> = p2v.iter().filter(|p3| are_coprimes(p2, p3)).collect();
                for p3 in &p3v {
                    if p1.n + p2.n + p3.n >= best_sum { break; }
                    let p4v: Vec<_> = p3v.iter().filter(|p4| are_coprimes(p3, p4)).collect();
                    for p4 in &p4v {
                        if p1.n + p2.n + p3.n + p4.n >= best_sum { break; }
                        if let Some(p5) = p4v.iter().find(|p5| are_coprimes(p4, p5)) {
                            let sum = p1.n + p2.n + p3.n + p4.n + p5.n;
                            if sum < best_sum {
                                best_sum = sum;
                            }
                        }
                    }
                }
            }
        }
        best_sum
    }
    assert_eq!(e60(), 26_033);
}


Without nll feature:
time: 0.000; rss: 69MB    MIR borrow checking

With nll feature:
time: 0.203; rss: 70MB    MIR borrow checking

@nikomatsakis
Copy link
Contributor

@leonardo-m thanks =)

@nikomatsakis nikomatsakis modified the milestone: NLL run-pass without ICEs Jan 18, 2018
@nikomatsakis nikomatsakis added this to the NLL: Performance is good milestone Jan 19, 2018
@nikomatsakis nikomatsakis added the A-NLL Area: Non Lexical Lifetimes (NLL) label Jan 19, 2018
@nikomatsakis
Copy link
Contributor

Some notes on the runtime of this test that I've extracted from profiling:

  • First, 12% of compilation time (just a regular build) is spent in librustc_mir::borrow_check
  • Of that, virtually all of it is spent in NLL, and NLL divides its time roughly equally between:
    • the MIR type check (~6%)
    • the region inference (~6%)

@nikomatsakis
Copy link
Contributor

Browsing the MIR for the main function is interesting (here is a gist), you can definitely see some of the density/sparsity of the region values.

@spastorino
Copy link
Member

Some numbers of @leonardo-m example, with and without this PR #47766
I'm showing the average numbers after running time rustc several times.

Without PR

[santiago@archlinux rust2 (master)]$ time ./build/x86_64-unknown-linux-gnu/stage1/bin/rustc main.rs

real	0m2.822s
user	0m3.043s
sys	0m0.064s

With PR

[santiago@archlinux rust1 (inference-dirty-list)]$ time ./build/x86_64-unknown-linux-gnu/stage1/bin/rustc main.rs

real	0m1.125s
user	0m1.262s
sys	0m0.086s

With PR, NLL turned off

[santiago@archlinux rust1 (inference-dirty-list)]$ time ./build/x86_64-unknown-linux-gnu/stage1/bin/rustc main.rs

real	0m0.869s
user	0m1.012s
sys	0m0.081s

So, there's still a lot to improve but getting closer 😄

@leonardo-m
Copy link
Author

The compile time with --emit=metadata and nll feature has improved a bit, but it's still about 5X slower. The compile time without the nll has improved in the last two nightlies and it's now almost OK.

@nikomatsakis nikomatsakis added the NLL-performant Working towards the "performance is good" goal label Mar 14, 2018
@XAMPPRocky XAMPPRocky added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Apr 10, 2018
@leonardo-m
Copy link
Author

Since the last Nightly the compiler improvements have essentially put the compiler back to the original performance discussed here. The performance of the NLL feature is discussed elsewhere, so I close down this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NLL Area: Non Lexical Lifetimes (NLL) C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. NLL-performant Working towards the "performance is good" goal T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants