Avoid Storing Large Strings in Memory #194

ahirreddy · 2023-12-20T07:36:21Z

This PR adds a new ResolvedFile interface that encapsulates all file reads (import, importstr, etc). Behind this a file can be backed by a string in memory (old behavior) or by a file on disk
We implement ParserInputs for each - the in-memory one is trivial and the on-disk one is backed by a RandomAccessFile (with buffered reads around it for performance)
Testing against our internal workloads - this greatly reduced memory presure as there are cases where we would previously store single files ranging between 100MB - 1GB in memory, in addition to its parsed AST.

Misc

Add support for customizing compression levels in std.xz. The default level (6) is very expensive (both in time and space) - so individual call sites can opt for lower levels
Uses (path, CRC32) for the ParseCache key. CRC32 should be sufficient to detect changes in a given file - and is significantly more performant that easily available alternatives (~10x faster than MD5)

This reverts commit f653e53.

This reverts commit 0a1253a.

This reverts commit 50f127d.

This reverts commit a4dc4d0.

This reverts commit 972fa25.

This reverts commit 23113c7.

This reverts commit 73a2283.

lihaoyi-databricks · 2023-12-31T09:27:55Z

Can we use a fastparse.ReaderParserInput or IteratorParserInput here? This kind of use case is exactly what those guys are for (see Streaming Parsing docs), and it would avoid us needing to maintain our own bespoke implementation

ahirreddy · 2023-12-31T20:22:23Z

I tried using that one, but it doesn't support backtracking - which is used at some point when dealing with syntax errors

lihaoyi-databricks · 2023-12-31T23:44:54Z

ReaderParserInput is meant to support backtracking, in that it is supposed to buffer up the input text until it hits a cut that allows it to call dropBuffer. Is that not working for some reason, is it or it failing with some error? Or is it simply keeping a larger buffer than is possible than with the random access interface that you are using?

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing, or is the change of ParseCache key to (path, CRC) sufficient? I would expect that the number of files being parsed at any point in time (O(10s)) would be dwarfed by the number of files that were previously being kept in the parse cache (O(10000s)), and so maybe reading the individual files into memory during parsing is sufficient as long as we don't keep the file contents around after. This is something that's worth testing out, as if true this would the change trivial and avoid needing to maintain a bunch of fiddly ParserInput/BufferedRandomAccessFile code

ahirreddy · 2024-01-02T03:15:09Z

ReaderParserInput definitely doesn't support backtracking. It extends BufferedParserInput - for which checkTraceable‎ throws

  def checkTraceable() = throw new RuntimeException(
    s"Cannot perform `.traced` on an `${getClass.getName}`, as it needs to parse " +
      "the input a second time to collect traces, which is impossible after an " +
      "`IteratorParserInput` is used once and the underlying Iterator exhausted."
  )

I don't see any code that would make it work for Reader even if we changed checkTraceable. It think we'd need to call mark(position) on the underlying Reader and then reset() whenever we need to go backwards. That's basically what I did for our use case - just with a RandomAccessFile directly.

On further thought, I wonder if using a custom ParserInput is giving us any benefit here: do we actually need to avoid reading individual files into memory for parsing,

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory. Depending on the order in which you process - you could easily OOM if you concurrently buffer and parse multiple of these (depending on the order in which Bazel requests a particular compile). I figured the easiest thing here was just to avoid the in memory buffering. There are further issues with 100+MB arrays - as they'll result in heap fragmentation and large object pinning - so even with plenty of space the GC may still declare OOM as it can't move the huge objects and can lose the ability to allocate new ones.

lihaoyi-databricks · 2024-01-02T03:31:29Z

Got it. The first parse should be fine, but the call to .traced that we do on failure does a second parse that doesn't work with java.io.Reader. I think the current approach is fine then

lihaoyi-databricks · 2024-01-02T03:40:06Z

I think this looks fine. @szeiger leaving this open for a bit for you to take a look before we merge it

sjsonnet/src/sjsonnet/Importer.scala

lihaoyi-databricks · 2024-01-02T05:00:04Z

sjsonnet/src-jvm-native/sjsonnet/SjsonnetMain.scala

+  private[this] def readPath(path: Path): Option[ResolvedFile] = {
+    val osPath = path.asInstanceOf[OsPath].p
+    if (os.exists(osPath) && os.isFile(osPath)) {
+      Some(new CachedResolvedFile(path.asInstanceOf[OsPath], memoryLimitBytes = 2048L * 1024L * 1024L))


What's the significance of this number? Should it just be Int.MaxValue.toLong or Int.MaxValue.toLong + 1?

Changed to Int.MaxValue.toLong and added scaladoc:

* @param memoryLimitBytes The maximum size of a file that we will resolve. This is not the size of * the buffer, but a mechanism to fail when being asked to resolve (and downstream parse) a file * that is beyond this limit. ```. Basically, we have some pathological imports (1GB+) which I eventually want to ban (all the ones I found could be trivially modified upstream to not produce such huge files). In a followup we can make this param configurable. cc @carl-db

szeiger · 2024-01-02T15:27:37Z

Why are we keeping the parsed source files at all? AFAIR it's just for cache invalidation. Maybe we could replace them with hashes. Parsing could be done with a standard streaming parse or even load the entire file into memory temporarily.

ahirreddy · 2024-01-02T23:38:21Z

@szeiger This PR does replace the cache entry with a CRC hash (was much faster than MD5- which made a huge difference for ~1GB files). Right now the PR will keep up to 1MB files in memory, but I can change that so that we always streaming parse.

szeiger · 2024-01-03T15:46:51Z

sjsonnet/src/sjsonnet/Importer.scala

-    parseCache.getOrElseUpdate((path, txt), {
-      val parsed = fastparse.parse(txt, new Parser(path, strictImportSyntax).document(_)) match {
+  def parse(path: Path, content: ResolvedFile)(implicit ev: EvalErrorScope): Either[Error, (Expr, FileScope)] = {
+    parseCache.getOrElseUpdate((path, content.contentHash.toString), {


Oh, I see, we're storing the hash as a string which explains why the ParseCache is unchanged.

szeiger · 2024-01-03T15:56:00Z

On this point, we actually have 100MB-1GB input JSONs that take up a huge amount of memory.

We may want to skip Jsonnet parsing for these files altogether. If the file name ends with .json we could simply use uJSON and have it generate an Sjsonnet AST consisting of static objects.

ahirreddy added 30 commits December 19, 2023 23:22

no buffer strings

ff2b110

clsoer

0a1253a

fix perf

f653e53

Revert "fix perf"

79cbbe7

This reverts commit f653e53.

Revert "clsoer"

c6e337c

This reverts commit 0a1253a.

fix up naming

b823e37

this is slow

2757328

buffered reader

8caa0a5

fix jmh

a9bab89

wip

f230288

parser input

597e248

8k buffer

61a9150

xz

300688d

memory limit

573c1f7

memory limit 20

95cafc2

wip

50a2259

limit 40

62e1474

2gb limit

9e3bbe4

long

ef1a463

fix limit

4f1aa87

size

972fa25

split eval works

a4dc4d0

eval;

50f127d

Revert "eval;"

1b0394b

This reverts commit 50f127d.

Revert "split eval works"

a99bf47

This reverts commit a4dc4d0.

Revert "size"

2530189

This reverts commit 972fa25.

fast util key map

73a2283

object 2 object

23113c7

Revert "object 2 object"

f9c31f6

This reverts commit 23113c7.

Revert "fast util key map"

b2a159f

This reverts commit 73a2283.

ahirreddy added 5 commits December 29, 2023 14:42

cleanup

b65ed5f

crc

9bbd08f

cleanup

9ef506f

wip

f71cee6

add test cases

25ee568

ahirreddy changed the title ~~[WIP] Avoid Storing Large Strings in Memory~~ Avoid Storing Large Strings in Memory Dec 29, 2023

ahirreddy requested review from lihaoyi-databricks and szeiger December 29, 2023 23:57

ahirreddy added 4 commits December 29, 2023 17:59

moved

c3ecaff

closer

bbad767

split out test

aba79be

fix compile

2a170c0

lihaoyi-databricks reviewed Jan 2, 2024

View reviewed changes

sjsonnet/src/sjsonnet/Importer.scala Outdated Show resolved Hide resolved

lihaoyi-databricks reviewed Jan 2, 2024

View reviewed changes

sjsonnet/src/sjsonnet/Importer.scala Outdated Show resolved Hide resolved

lihaoyi-databricks reviewed Jan 2, 2024

View reviewed changes

sjsonnet/src/sjsonnet/Importer.scala Outdated Show resolved Hide resolved

lihaoyi-databricks reviewed Jan 2, 2024

View reviewed changes

ahirreddy added 3 commits January 2, 2024 15:46

avoid extra seek

b8bdff0

avoid extra copy

c6956c2

added doc

67953ed

szeiger reviewed Jan 3, 2024

View reviewed changes

lihaoyi-databricks merged commit 74bbe26 into master Jan 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid Storing Large Strings in Memory #194

Avoid Storing Large Strings in Memory #194

ahirreddy commented Dec 20, 2023 •

edited

Loading

lihaoyi-databricks commented Dec 31, 2023

ahirreddy commented Dec 31, 2023

lihaoyi-databricks commented Dec 31, 2023 •

edited

Loading

ahirreddy commented Jan 2, 2024

lihaoyi-databricks commented Jan 2, 2024

lihaoyi-databricks commented Jan 2, 2024

lihaoyi-databricks Jan 2, 2024

ahirreddy Jan 2, 2024

szeiger commented Jan 2, 2024

ahirreddy commented Jan 2, 2024

szeiger Jan 3, 2024

szeiger commented Jan 3, 2024

Avoid Storing Large Strings in Memory #194

Avoid Storing Large Strings in Memory #194

Conversation

ahirreddy commented Dec 20, 2023 • edited Loading

Misc

lihaoyi-databricks commented Dec 31, 2023

ahirreddy commented Dec 31, 2023

lihaoyi-databricks commented Dec 31, 2023 • edited Loading

ahirreddy commented Jan 2, 2024

lihaoyi-databricks commented Jan 2, 2024

lihaoyi-databricks commented Jan 2, 2024

lihaoyi-databricks Jan 2, 2024

Choose a reason for hiding this comment

ahirreddy Jan 2, 2024

Choose a reason for hiding this comment

szeiger commented Jan 2, 2024

ahirreddy commented Jan 2, 2024

szeiger Jan 3, 2024

Choose a reason for hiding this comment

szeiger commented Jan 3, 2024

ahirreddy commented Dec 20, 2023 •

edited

Loading

lihaoyi-databricks commented Dec 31, 2023 •

edited

Loading