Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Storing Large Strings in Memory #194

Merged
merged 48 commits into from
Jan 5, 2024

Conversation

ahirreddy
Copy link
Collaborator

@ahirreddy ahirreddy commented Dec 20, 2023

  • This PR adds a new ResolvedFile interface that encapsulates all file reads (import, importstr, etc). Behind this a file can be backed by a string in memory (old behavior) or by a file on disk
  • We implement ParserInputs for each - the in-memory one is trivial and the on-disk one is backed by a RandomAccessFile (with buffered reads around it for performance)
  • Testing against our internal workloads - this greatly reduced memory presure as there are cases where we would previously store single files ranging between 100MB - 1GB in memory, in addition to its parsed AST.

Misc

  • Add support for customizing compression levels in std.xz. The default level (6) is very expensive (both in time and space) - so individual call sites can opt for lower levels
  • Uses (path, CRC32) for the ParseCache key. CRC32 should be sufficient to detect changes in a given file - and is significantly more performant that easily available alternatives (~10x faster than MD5)

@ahirreddy ahirreddy changed the title [WIP] Avoid Storing Large Strings in Memory Avoid Storing Large Strings in Memory Dec 29, 2023
@lihaoyi-databricks
Copy link
Contributor

Can we use a fastparse.ReaderParserInput or IteratorParserInput here? This kind of use case is exactly what those guys are for (see Streaming Parsing docs), and it would avoid us needing to maintain our own bespoke implementation

@ahirreddy
Copy link
Collaborator Author

I tried using that one, but it doesn't support backtracking - which is used at some point when dealing with syntax errors

@lihaoyi-databricks
Copy link
Contributor

lihaoyi-databricks commented Dec 31, 2023

ReaderParserInput is meant to support backtracking, in that it is supposed to buffer up the input text until it hits a cut that allows it to call dropBuffer. Is that not working for some reason, is it or it failing with some error? Or is it simply keeping a larger buffer than is possible than with the random access interface that you are using?

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing, or is the change of ParseCache key to (path, CRC) sufficient? I would expect that the number of files being parsed at any point in time (O(10s)) would be dwarfed by the number of files that were previously being kept in the parse cache (O(10000s)), and so maybe reading the individual files into memory during parsing is sufficient as long as we don't keep the file contents around after. This is something that's worth testing out, as if true this would the change trivial and avoid needing to maintain a bunch of fiddly ParserInput/BufferedRandomAccessFile code

@ahirreddy
Copy link
Collaborator Author

ReaderParserInput definitely doesn't support backtracking. It extends BufferedParserInput - for which checkTraceable‎ throws

  def checkTraceable() = throw new RuntimeException(
    s"Cannot perform `.traced` on an `${getClass.getName}`, as it needs to parse " +
      "the input a second time to collect traces, which is impossible after an " +
      "`IteratorParserInput` is used once and the underlying Iterator exhausted."
  )

I don't see any code that would make it work for Reader even if we changed checkTraceable. It think we'd need to call mark(position) on the underlying Reader and then reset() whenever we need to go backwards. That's basically what I did for our use case - just with a RandomAccessFile directly.

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing,

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory. Depending on the order in which you process - you could easily OOM if you concurrently buffer and parse multiple of these (depending on the order in which Bazel requests a particular compile). I figured the easiest thing here was just to avoid the in memory buffering. There are further issues with 100+MB arrays - as they'll result in heap fragmentation and large object pinning - so even with plenty of space the GC may still declare OOM as it can't move the huge objects and can lose the ability to allocate new ones.

@lihaoyi-databricks
Copy link
Contributor

Got it. The first parse should be fine, but the call to .traced that we do on failure does a second parse that doesn't work with java.io.Reader. I think the current approach is fine then

@lihaoyi-databricks
Copy link
Contributor

I think this looks fine. @szeiger leaving this open for a bit for you to take a look before we merge it

private[this] def readPath(path: Path): Option[ResolvedFile] = {
val osPath = path.asInstanceOf[OsPath].p
if (os.exists(osPath) && os.isFile(osPath)) {
Some(new CachedResolvedFile(path.asInstanceOf[OsPath], memoryLimitBytes = 2048L * 1024L * 1024L))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the significance of this number? Should it just be Int.MaxValue.toLong or Int.MaxValue.toLong + 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Int.MaxValue.toLong and added scaladoc:

 * @param memoryLimitBytes The maximum size of a file that we will resolve. This is not the size of
 * the buffer, but a mechanism to fail when being asked to resolve (and downstream parse) a file
 * that is beyond this limit.
```.

Basically, we have some pathological imports (1GB+) which I eventually want to ban (all the ones I found could be trivially modified upstream to not produce such huge files). In a followup we can make this param configurable.


cc @carl-db 

@szeiger
Copy link
Collaborator

szeiger commented Jan 2, 2024

Why are we keeping the parsed source files at all? AFAIR it's just for cache invalidation. Maybe we could replace them with hashes. Parsing could be done with a standard streaming parse or even load the entire file into memory temporarily.

@ahirreddy
Copy link
Collaborator Author

@szeiger This PR does replace the cache entry with a CRC hash (was much faster than MD5- which made a huge difference for ~1GB files). Right now the PR will keep up to 1MB files in memory, but I can change that so that we always streaming parse.

parseCache.getOrElseUpdate((path, txt), {
val parsed = fastparse.parse(txt, new Parser(path, strictImportSyntax).document(_)) match {
def parse(path: Path, content: ResolvedFile)(implicit ev: EvalErrorScope): Either[Error, (Expr, FileScope)] = {
parseCache.getOrElseUpdate((path, content.contentHash.toString), {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, we're storing the hash as a string which explains why the ParseCache is unchanged.

@szeiger
Copy link
Collaborator

szeiger commented Jan 3, 2024

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory.

We may want to skip Jsonnet parsing for these files altogether. If the file name ends with .json we could simply use uJSON and have it generate an Sjsonnet AST consisting of static objects.

@lihaoyi-databricks lihaoyi-databricks merged commit 74bbe26 into master Jan 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants