Project Blog Post #539

ananyagoenka · 2025-05-13T21:33:45Z

Closes #499

sampsyo

Looking good overall! The design and implementation both sound solid here. I have a few questions about your eval results that would be great to address before we publish!

sampsyo · 2025-05-14T16:02:02Z

content/blog/2025-05-13-bril-concurrency-extension.md

+title = "Sailing Bril’s IR into Concurrent Waters"
+[extra]
+bio = """
+  Ananya Goenka is an undergraduate studying CS at Cornell. While she isn’t nerding out over programming languages or weird ISA quirks, she’s writing for Creme de Cornell or getting way too invested in obscure books.


The magazine looks cool! Would you be interested in adding a hyperlink? 😃

sampsyo · 2025-05-14T16:02:50Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+We introduced two new opcodes:
+
+*   { "op": "spawn", "dest": "t", "type": "thread", "funcs": \["worker"\], "args": \["x", "y"\] }


This would probably be a little easier to read if it were in Markdown code backticks.

sampsyo · 2025-05-14T16:03:20Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+*   **join**: An effect operation that takes a single thread handle argument and blocks until the corresponding thread completes.
+
+To prevent clients from forging thread IDs, we defined a new primitive type thread in our TypeScript definitions (bril.ts). This opaque type ensures only genuine spawn instructions can produce valid thread handles.


More code backticks are in order:

thread -> thread

bril.ts -> bril.ts

…and consider doing this to filename and TypeScript symbol names throughout.

sampsyo · 2025-05-14T16:13:44Z

content/blog/2025-05-13-bril-concurrency-extension.md

+2.  **Shared heap**: All heap allocations (alloc, load, store) target a single global Heap instance, exposing potential data races—a faithful reflection of real-world concurrency.
+
+
+### 2.3 Interpreter Implementation


Maybe add a quick overview here to note that this is about an implementation in the reference interpreter, which is written in TS and runs on Deno.

sampsyo · 2025-05-14T16:17:39Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+#### Stubbed Concurrency (Option A)
+
+Our first pass implemented spawn/join synchronously in-process: spawn would directly call evalFunc(...) and immediately resolve, making join a no-op. This stub served as a correctness check and allowed us to validate the grammar and TypeScript types without introducing asynchrony.


Instead of referring to internal interpreter implementation details (the evalFunc function), it might be a little clearer to just say abstractly how this works: we recursively call the interpreter to run the function directly, just like a function call.

sampsyo · 2025-05-14T16:37:22Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+3.  The main isolate is busy servicing heap requests from _both_ workers. The event loop context‐switching and message queue flooding create contention, so neither the “workers” nor the main thread run at full core capacity.
+
+I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.


sampsyo · 2025-05-14T16:42:59Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+3.  The main isolate is busy servicing heap requests from _both_ workers. The event loop context‐switching and message queue flooding create contention, so neither the “workers” nor the main thread run at full core capacity.
+
+I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.


Maybe this is because of the parenthetical but l 16st track of which benchmark you're talking about at which point. Also, this bit is confusing:

because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8)

Is the computation per RPC simple or complex?

sampsyo · 2025-05-14T16:46:33Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.
+
+Some lessons/thoughts.  To actually _win_ with real parallelism under this design, we must batch memory operations. For example, transform long loops into single RPC calls that process entire slices (e.g. “sum these 1 000 elements” in one go), amortizing the message‐passing cost.SharedArrayBuffers could eliminate RPC entirely by mapping our Bril heap into a typed array visible in all workers. Then each load/store is a direct memory access, and you’d see true multicore speedups on large-N benchmarks. For an intermediate step, we could group every 1 000 loads/stores into one batched message, cutting messaging overhead by two orders of magnitude, which should already push the breakeven point down toward the 10 M-element range.


Missing a space before SharedArrayBuffers.

sampsyo · 2025-05-14T16:49:58Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+[escape] main: 0/1 allocs are thread-local   # spawn causes escape
+[escape] mixed: 1/2 allocs are thread-local  # one local, one escaping
+```


Instead of just pasting the output of your tool can you explain what this is saying? What are "main" and "mixed"? Why are the numbers so small (1 and 2)?

sampsyo · 2025-05-14T16:54:23Z

content/blog/2025-05-13-bril-concurrency-extension.md

+
+*   **Interprocedural Escape Analysis**: Extend escape.py to track pointers across function boundaries and calls, increasing precision and enabling stack-based allocation for truly local objects.
+
+*   **Robust Testing Harness**: Integrate with continuous integration (CI) to run our concurrency and escape-analysis suites on every commit, ensuring regressions are caught early.


Hmm, aren’t your interpreter tests enabled by default already? It also seems pretty easy to add your escape analysis tests to the main test suite, assuming they already use Turnt.

sampsyo · 2025-05-28T12:24:22Z

Hi, @ananyagoenka! I would love to publish your blog post. Can you please wrap up the revisions discussed above so I can hit the green button?

add final proj blog post

b88a8b4

sampsyo requested changes May 14, 2025

View reviewed changes

sampsyo added the 2025sp label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project Blog Post #539

Project Blog Post #539

Uh oh!

ananyagoenka commented May 13, 2025 •

edited

Loading

Uh oh!

sampsyo left a comment

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo May 14, 2025

Uh oh!

sampsyo commented May 28, 2025

Uh oh!

Uh oh!


		We introduced two new opcodes:

		* { "op": "spawn", "dest": "t", "type": "thread", "funcs": \["worker"\], "args": \["x", "y"\] }


		* join: An effect operation that takes a single thread handle argument and blocks until the corresponding thread completes.

		To prevent clients from forging thread IDs, we defined a new primitive type thread in our TypeScript definitions (bril.ts). This opaque type ensures only genuine spawn instructions can produce valid thread handles.

		2. Shared heap: All heap allocations (alloc, load, store) target a single global Heap instance, exposing potential data races—a faithful reflection of real-world concurrency.


		### 2.3 Interpreter Implementation


		#### Stubbed Concurrency (Option A)

		Our first pass implemented spawn/join synchronously in-process: spawn would directly call evalFunc(...) and immediately resolve, making join a no-op. This stub served as a correctness check and allowed us to validate the grammar and TypeScript types without introducing asynchrony.


		3. The main isolate is busy servicing heap requests from _both_ workers. The event loop context‐switching and message queue flooding create contention, so neither the “workers” nor the main thread run at full core capacity.

		I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.


		I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.

		Some lessons/thoughts. To actually _win_ with real parallelism under this design, we must batch memory operations. For example, transform long loops into single RPC calls that process entire slices (e.g. “sum these 1 000 elements” in one go), amortizing the message‐passing cost.SharedArrayBuffers could eliminate RPC entirely by mapping our Bril heap into a typed array visible in all workers. Then each load/store is a direct memory access, and you’d see true multicore speedups on large-N benchmarks. For an intermediate step, we could group every 1 000 loads/stores into one batched message, cutting messaging overhead by two orders of magnitude, which should already push the breakeven point down toward the 10 M-element range.


		* Interprocedural Escape Analysis: Extend escape.py to track pointers across function boundaries and calls, increasing precision and enabling stack-based allocation for truly local objects.

		* Robust Testing Harness: Integrate with continuous integration (CI) to run our concurrency and escape-analysis suites on every commit, ensuring regressions are caught early.

Project Blog Post #539

Are you sure you want to change the base?

Project Blog Post #539

Uh oh!

Conversation

ananyagoenka commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sampsyo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sampsyo commented May 28, 2025

Uh oh!

Uh oh!

ananyagoenka commented May 13, 2025 •

edited

Loading