Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser uses a surprising amount of memory for simple but long patterns #1090

Closed
lilydjwg opened this issue Sep 15, 2023 · 13 comments
Closed

Comments

@lilydjwg
Copy link

What version of regex are you using?

1.9.5

Describe the bug at a high level.

I compiled a regex with a lot of literals as alternatives (~700KiB) and it consumed way too much memory (256MiB).

What are the steps to reproduce the behavior?

Prepare a text file with content like abc|def|text|... but about 700KiB long and ~64K |s (it could be generated from a dict file).

fn main() {
  let s = std::fs::read_to_string("file").unwrap();
  let r = regex::Regex::new(&s);
  std::thread::sleep(std::time::Duration::from_secs(100));
  drop(r);
}

And then look at the process's memory usage.

What is the actual behavior?

It took up about 256MiB memory. Python only used 31MiB memory.

What is the expected behavior?

I expect it to use less memory than Python, not a lot more.

@BurntSushi
Copy link
Member

I need a reproduction please. The code you've provided is not runnable by me. You can upload data files to GitHub issues.

@lilydjwg
Copy link
Author

I generated one. This is no the same as the one I've been using but has the same issue (gunzip first).

file.gz

@BurntSushi
Copy link
Member

Okay, there are some knots to untangle here. First is that you aren't checking whether regex compilation succeeds or not. In the case you've given me, it does not, because it exceeds the default size limit. The way this works is that the regex will try to eagerly be compiled, and if during the course of compilation it exceeds the limit, compilation is stopped and an error is returned. That means that if you increase the limit and let the regex actually compile, it's likely to use more memory than what you're observing. We should also try to actually use the regex to check that it actually works.

So let's do that first. I'll use my shell's built-in time command to measure peak memory usage. I've also inserted some print and timing statements to get a sense of where the program is spending time:

use std::time::{Duration, Instant};

fn main() {
    env_logger::init();
    let (s, elapsed) = timeit(|| std::fs::read_to_string("file").unwrap());
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let (m, elapsed) = timeit(|| {
        r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
            .expect("a regex match")
    });
    println!("search finished: {:?}", elapsed);
    assert_eq!(45, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And to compile and run it:

$ cargo b --release
    Finished release [optimized] target(s) in 0.00s

$ time ./target/release/i1090
pattern read from file: 342.791µs
regex compiled: 220.353665ms
search finished: 96.698486ms

real    0.326
user    0.249
sys     0.076
maxmem  383 MB
faults  0

All right, now we need a baseline to compare it to. You made a claim about the memory usage being greater than Python, but you didn't provide an equivalent Python program. Thankfully I know Python and how to run Python programs, so I'll do that part for you:

import re
import time

def main():
    s, elapsed = timeit(lambda: open('file', 'r').read())
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    m, elapsed = timeit(
        lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
    )
    print(f"search finished: {elapsed}s")
    assert m.start() == 45


def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed


if __name__ == '__main__':
    main()

And to run it:

$ time python main.py
pattern read from file: 0.0006263256072998047s
regex compiled: 0.7324204444885254s
search finished: 0.00032448768615722656s

real    0.744
user    0.734
sys     0.010
maxmem  131 MB
faults  0

So the Python program uses less memory than the Rust program, but the difference is not as great as you mentioned.

Generally speaking, this is not unexpected. Python uses a backtracking regex engine, and so I would expect it to use less memory overall. The regex crate uses finite automata to guarantee linear time, and this generally leads to greater memory usage.

So actually looking at your input, you do not have an alternation of literals. Most of your alternates are literals, but some are regexes due to the presence of the . meta-character. So let's put each literal on its own line, read them one line at a time, escape them and then join them together. I'll also note that you had a trailing | in your file, which matches the empty pattern. I would guess that was unintended so I removed it. Here are the updated programs and timings. For Rust:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let (m, elapsed) = timeit(|| {
        r.find("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
            .expect("a regex match")
    });
    println!("search finished: {:?}", elapsed);
    assert_eq!(45, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And Python:

import re
import time

def main():
    s, elapsed = timeit(
        lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
    )
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    m, elapsed = timeit(
        lambda: r.search("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!zygote")
    )
    print(f"search finished: {elapsed}s")
    assert m.start() == 45


def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed


if __name__ == '__main__':
    main()

And now to run the programs:

$ cargo b --release

$ time ./target/release/i1090
pattern read from file: 7.74721ms
regex compiled: 203.093996ms
search finished: 3.709µs

real    0.222
user    0.152
sys     0.069
maxmem  322 MB
faults  0

$ time python main.py
pattern read from file: 0.0295107364654541s
regex compiled: 0.7328767776489258s
search finished: 0.0003745555877685547s

real    0.774
user    0.720
sys     0.053
maxmem  131 MB
faults  0

Memory usage for the Rust program dropped a little bit, but not by much. However, the search time dropped quite a bit. This is because the regex engine is no longer compiling a regex, but rather, just building an Aho-Corasick automaton. It it likely also using Aho-Corasick.

To check whether memory usage is as I would expect, I ran the Rust program above under heaptrack. It turns out that nearly all of the memory is being used... by the regex parser. Yikes. That's not so good. This means memory usage has nothing to do with the actual compiled regex object itself, but its Ast.

I'll leave this bug open for now to see if I can investigate it, but if you really only need to look for an alternation of literals, you can just use the aho-corasick crate directly:

use std::time::{Duration, Instant};

fn main() {
    env_logger::init();
    let (patterns, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(|line| line.to_string())
            .collect::<Vec<String>>()
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        aho_corasick::AhoCorasick::builder()
            .match_kind(aho_corasick::MatchKind::LeftmostFirst)
            .build(&patterns)
            .unwrap()
    });
    println!("aho-corasick compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And now to compile and run it, we can see memory usage is way lower, including lower than Python:

$ cargo b --release
   Compiling i1090 v0.1.0 (/home/andrew/tmp/issues/regex/i1090)
    Finished release [optimized + debuginfo] target(s) in 0.65s

$ time ./target/release/i1090
pattern read from file: 4.236067ms
aho-corasick compiled: 19.070214ms
search finished: 12.952µs

real    0.025
user    0.025
sys     0.000
maxmem  12 MB
faults  0

@BurntSushi BurntSushi changed the title High memory usage for a lot of literal alternatives parser uses a surprising amount of memory for simple but long patterns Sep 15, 2023
@lilydjwg
Copy link
Author

I'm sorry I left some issues in the example program, but they weren't present in my actual program (in which I escaped meta-characters, got the size_limit error, raised it, and it worked correctly).

Btw I didn't measure max memory. Instead I looked at the resident memory during long time run.

I've switched to aho-corasick. It took me a few attempts before it worked as expected, and yes, memory usage is low now.

Glad to hear the heaptrack tool. It seems very useful. Thank you.

@BurntSushi
Copy link
Member

Btw I didn't measure max memory. Instead I looked at the resident memory during long time run.

Aye. This is the sort of thing that should have been in the original issue. You need to provide the full context for not only how you're producing a problem but how you're measuring it. Notice how I did that in my comment and it enabled you to see a discrepancy between our measurement methods!

@BurntSushi
Copy link
Member

Sadly, this looks a little tricky to fix, even with a band-aide. One possible band-aide is to introduce some boxing inside the Ast types. Namely, an Ast is over 200 bytes big on my x86-64 target. Since every single codepoint in the pattern gets its own Ast, it follows that every codepoint is using over 200 bytes of memory. (The pattern is about 1,000,000 codepoints. If each codepoint uses about 200 bytes, then this definitely explains where the vast majority of the memory being used comes from.) This could be reduced substantially with a few well-placed Boxes, but unfortunately, the Ast type is completely exposed in the public API of regex-syntax. Doing a breaking change of regex-syntax is not a huge deal, but I just did one and don't have the energy to do another just yet. So at some point, I should be able to mostly mitigate in the regex-syntax 0.8.0 release, whenever that is.

It's not clear to me if more fixes will be needed beyond that. If so, it might require deeper changes to the parser or perhaps even a "fast" path. But if we can get the Ast down to 24 bytes or so via boxing, then that would result in approximately an order of magnitude improvement. That's probably good enough.

@lilydjwg
Copy link
Author

You need to provide the full context for not only how you're producing a problem but how you're measuring it.

I'll do next time. (I thought I meant looking at whatever *top / process monitor program. Probably I shouldn't have written the issue after I hurried up coding and debugging.)

I have another question: why doesn't the parser free the memory after compilation is done? Such a large memory peak is still an issue for a router, but at least not so much for a desktop / server.

@BurntSushi
Copy link
Member

It does. The issue is that the looking at memory usage via top (i.e., RSS as reported by the OS) is a flawed measurement. Just because this library frees the memory doesn't mean the allocator frees it back to the OS. The allocator might keep it around for other uses.

You should be able to confirm this by inserting a shim via a global allocator to report total memory retained. (I have such a shim, and I will share it next time I have my hands on a keyboard. I'm on mobile at the moment.)

@BurntSushi
Copy link
Member

BurntSushi commented Sep 16, 2023

Also, if someone is using this library in a constrained environment such as a router, I would be happy to hear more detailed reports of issues that one runs into.

@lilydjwg
Copy link
Author

Just because this library frees the memory doesn't mean the allocator frees it back to the OS.

I see. It's a pity though. This may be a good reason to have a swap space so this kind of never-going-to-be-used-again memory can be swapped out to make room for other use.

if someone is using this library in a constrained environment such as a router.

Sorry I don't have my own router, and don't feel good to mess up with the ISP one. People asked about using my program on their routers though and that's one of the reasons I switch from Python to Rust.

@BurntSushi
Copy link
Member

I haven't put swap on any of my machines for over a decade.

And yes, Rust isn't magic. Python's regex engine is written in C. And it's a backtracker. So its memory usage is in principle not going to be obviously worse than any other regex engine written in Rust or C.

@BurntSushi
Copy link
Member

Here is a program with the allocator shim:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    allocator::reset();
    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    allocator::tell();
    println!("regex compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

mod allocator {
    use std::alloc::{GlobalAlloc, Layout, System};
    use std::sync::atomic::{AtomicUsize, Ordering};

    static TOTAL_ALLOC: AtomicUsize = AtomicUsize::new(0);
    static TOTAL_DEALLOC: AtomicUsize = AtomicUsize::new(0);

    #[global_allocator]
    static ALLOCATOR: TotalAllocator = TotalAllocator;

    struct TotalAllocator;

    unsafe impl GlobalAlloc for TotalAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            TOTAL_ALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.alloc(layout)
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            TOTAL_DEALLOC.fetch_add(layout.size(), Ordering::Relaxed);
            System.dealloc(ptr, layout)
        }
    }

    pub(crate) fn reset() {
        TOTAL_ALLOC.store(0, Ordering::Relaxed);
        TOTAL_DEALLOC.store(0, Ordering::Relaxed);
    }

    pub(crate) fn tell() {
        let total_alloc = TOTAL_ALLOC.load(Ordering::Relaxed) as isize;
        let total_dealloc = TOTAL_DEALLOC.load(Ordering::Relaxed) as isize;
        eprintln!(
            "allocated: {}, retained: {}",
            total_alloc,
            total_alloc - total_dealloc,
        );
    }
}

And then compile and run it:

$ cargo b --release
   Compiling i1090 v0.1.0 (/home/andrew/tmp/issues/regex/i1090)
    Finished release [optimized + debuginfo] target(s) in 0.37s

$ time ./target/release/i1090
pattern read from file: 8.313858ms
allocated: 683401785, retained: 1451268
regex compiled: 205.86363ms
search finished: 25.806µs

real    0.221
user    0.147
sys     0.073
maxmem  333 MB
faults  0

So there is a lot being allocated overall, but only 1.45MB is retained.

@BurntSushi
Copy link
Member

OK, I've got memory usage down below the Python program. Specifically, this Rust program:

use std::time::{Duration, Instant};

fn main() {
    let (s, elapsed) = timeit(|| {
        std::fs::read_to_string("file")
            .unwrap()
            .lines()
            .map(regex::escape)
            .collect::<Vec<String>>()
            .join("|")
    });
    println!("pattern read from file: {:?}", elapsed);

    let (r, elapsed) = timeit(|| {
        regex::RegexBuilder::new(&s)
            .size_limit(100 * (1 << 20))
            .build()
            .unwrap()
    });
    println!("regex compiled: {:?}", elapsed);

    let haystack = format!("{}zygote", "!".repeat(10_000_000));
    let (m, elapsed) = timeit(|| r.find(&haystack).expect("a regex match"));
    println!("search finished: {:?}", elapsed);
    assert_eq!(10_000_000, m.start());
}

fn timeit<T>(mut f: impl FnMut() -> T) -> (T, Duration) {
    let start = Instant::now();
    let t = f();
    let elapsed = Instant::now().duration_since(start);
    (t, elapsed)
}

And this Python program:

import re
import time

def main():
    s, elapsed = timeit(
        lambda: '|'.join(map(re.escape, open('file', 'r').read().splitlines()))
    )
    print(f"pattern read from file: {elapsed}s")

    r, elapsed = timeit(lambda: re.compile(s))
    print(f"regex compiled: {elapsed}s")

    haystack = ('!' * 10_000_000) + 'zygote'
    m, elapsed = timeit(lambda: r.search(haystack))
    print(f"search finished: {elapsed}s")
    assert m.start() == 10_000_000


def timeit(f):
    start = time.time()
    t = f()
    elapsed = time.time() - start
    return t, elapsed


if __name__ == '__main__':
    main()

with the data given by the OP:

$ cargo b --release

$ time ./target/release/i1090
pattern read from file: 8.23772ms
allocated: 234110697, retained: 1451268
regex compiled: 151.386156ms
search finished: 12.125789ms

real    0.179
user    0.143
sys     0.036
maxmem  122 MB
faults  0

$ time python main.py
pattern read from file: 0.03054189682006836s
regex compiled: 0.7595720291137695s
search finished: 0.01816868782043457s

real    0.823
user    0.772
sys     0.050
maxmem  131 MB
faults  0

Note that I'm not 100% sure when this fix is going to be released since it will require putting out a regex-syntax 0.8 release, which is a breaking change from the current 0.7. I might do it soon depending on how some other experiments go.

BurntSushi added a commit that referenced this issue Oct 6, 2023
This puts every Ast value behind a box to conserve space. It makes
things like Vec<Ast> quite a bit smaller than what they would be
otherwise, which is especially beneficial for the representation of
concatenations and alternations.

This doesn't quite solve the memory usage problems though, since an
AstKind is still quite big (over 200 bytes). The next step will be
boxing each of the variants of an AstKind which should hopefully resolve
the issue.

Ref #1090
BurntSushi added a commit that referenced this issue Oct 8, 2023
The AstKind experiment proved unfruitful. I think the issue here is that
the savings on Vec<Ast> didn't prove to be enough to offset the extra
heap allocation that resulted from the indirection.

This seems to be a sweet spot. It would be nice to get Ast down below 16
bytes, but it's not clear how to do that (without much larger changes
that I don't feel inclined to pursue).

Fixes #1090
BurntSushi added a commit that referenced this issue Oct 9, 2023
The AstKind experiment proved unfruitful. I think the issue here is that
the savings on Vec<Ast> didn't prove to be enough to offset the extra
heap allocation that resulted from the indirection.

This seems to be a sweet spot. It would be nice to get Ast down below 16
bytes, but it's not clear how to do that (without much larger changes
that I don't feel inclined to pursue).

Fixes #1090
BurntSushi added a commit that referenced this issue Oct 9, 2023
This puts every Ast value behind a box to conserve space. It makes
things like Vec<Ast> quite a bit smaller than what they would be
otherwise, which is especially beneficial for the representation of
concatenations and alternations.

This doesn't quite solve the memory usage problems though, since an
AstKind is still quite big (over 200 bytes). The next step will be
boxing each of the variants of an AstKind which should hopefully resolve
the issue.

Ref #1090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants