Skip to content

Fix #23224: Optimize simple tuple extraction #23373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

noti0na1
Copy link
Member

@noti0na1 noti0na1 commented Jun 16, 2025

Fix #23224:

This PR optimizes simple tuple extraction by avoiding unnecessary tuple allocations and refines the typing of bind patterns for named tuples.

  • Optimise makePatDef to reduce tuple creation when a pattern uses only simple variables or wildcards.
  • If the selector of a match has bottom type, use the type from pattern for the bind variable.

For example:

def f1: (Int, Int, Int) = (1, 2, 3)
def test1 =
  val (a, b, c) = f1
  a + b + c

Before this PR:

val $1$: (Int, Int, Int) =
  this.f1:(Int, Int, Int) @unchecked match 
    {
      case Tuple3.unapply[Int, Int, Int](a @ _, b @ _, c @ _) =>
        Tuple3.apply[Int, Int, Int](a, b, c)
    }
val a: Int = $1$._1
val b: Int = $1$._2
val c: Int = $1$._3
a + b + c

After this PR:

val $2$: (Int, Int, Int) =
  this.f1:(Int, Int, Int) @unchecked match 
    {
      case $1$ @ Tuple3.unapply[Int, Int, Int](_, _, _) =>
        $1$:(Int, Int, Int)
    }
val a: Int = $2$._1
val b: Int = $2$._2
val c: Int = $2$._3
a + b + c

Also in genBCode now:

val $2$: Tuple3 =  
  matchResult1[Tuple3]:
    {
      case val x1: Tuple3 = this.f1():Tuple3
      if x1 ne null then
        {
          case val $1$: Tuple3 = x1
          return[matchResult1] $1$:Tuple3
        }
        else ()
      throw new MatchError(x1)
    }
val a: Int = Int.unbox($2$._1())
val b: Int = Int.unbox($2$._2())
val c: Int = Int.unbox($2$._3())
a + b + c

I use the regular expression (val\s*\(\s*[a-zA-Z_]\w*(\s*,\s*[a-zA-Z_]\w*)*\s*\)\s*=) to search in the compiler, and found 400+ places which are simple tuple extraction like this.

@noti0na1 noti0na1 requested a review from Copilot June 16, 2025 00:23
Copilot

This comment was marked as resolved.

@noti0na1
Copy link
Member Author

Split some change into a separate PR #23380 to diagnose the errors.

@noti0na1 noti0na1 marked this pull request as ready for review June 18, 2025 11:36
@noti0na1 noti0na1 requested a review from odersky June 18, 2025 12:17
@noti0na1 noti0na1 force-pushed the optimize-tuple-extract branch from 423dbad to 545dd69 Compare June 26, 2025 08:27
@noti0na1
Copy link
Member Author

noti0na1 commented Jun 26, 2025

Added a byte code test to ensure there is no tuple creation in the generated code.

@noti0na1 noti0na1 requested review from sjrd and smarter June 26, 2025 08:33
Copy link
Member

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any evidence that this is actually better? It should be a straightforward case of unescaping alloc for the JVM to optimize away. The Scala.js optimizer, for example, routinely stack-allocates the produced tuple.

If this doesn't actually improve run time, we're introducing some complexity in the compiler for no reason.

If it does actually improve performance, why do it only for tuple extraction, and not for extraction of other case classes?

@@ -0,0 +1,43 @@

class Test:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this test? As a pos test, it doesn't test any rewrite rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just tests and explains different cases for desugaring, by different number of variables in the pattern.

""".stripMargin
checkBCode(code) { dir =>
val c = loadClassNode(dir.lookupName("C.class", directory = false).input)
assertNoInvoke(getMethod(c, "f1"), "scala/Tuple2$", "apply") // no Tuple2.apply call
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure that there is no new Tuple2.

@@ -2816,6 +2816,7 @@ class Typer(@constructorOnly nestingLevel: Int = 0) extends Namer
if isStableIdentifierOrLiteral || isNamedTuplePattern then pt
else if isWildcardStarArg(body1)
|| pt == defn.ImplicitScrutineeTypeRef
|| pt.isBottomType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment on why the bottom-related changes are necessary? I find them suspicious. This change should be nothing but an optimization, so it shouldn't change typing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have tests like: val (a, b) = ???. Since we now bind the tuple pattern to a variable, I want to give the bind variable a tuple type instead of a bottom type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that necessary for correctness? There is really no point in improving the code for such an extraction. By definition, the code will throw before it gets to the tuple extraction.

@noti0na1
Copy link
Member Author

noti0na1 commented Jul 1, 2025

Absolutely, the JVM can often optimize many cases when it proves that a Tuple is pure and its temporary object doesn't escape. As a compiler, we should still try our best to generate efficient and non-redundent code possible.

Consider extracting values from a tuple that contains both an Int and a String. Previously, this required unboxing the Int twice, boxing it again, and creating a throwaway tuple, all for no benefit. This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.

I chose to optimize tuple handling because tuples are widely used for return multiple values, as well as in compiler itself (see the regular text search results in the codebase). Ideally, we would extend this optimization to other data structure, but it is hard to do before typing.

@sjrd
Copy link
Member

sjrd commented Jul 1, 2025

As a compiler, we should still try our best to generate efficient and non-redundent code possible.

Not really. Our job is to generate code that the next compiler in line will be able to optimize. If you're generating assembly, you want to generate code that the processor will be happy about (no unpredictable branches, for example). If you're generating JVM bytecode, you want to generate code that the JVM will be happy about. There is a whole chain of compilers to think about.

This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.

It generates code that is simpler. But the compiler code is definitely more complex. The increase of lines of code in the compiler is clear. These are new code paths, that behave in a special way in special situations.

If we are making the compiler more complex, but we are not measurably improving performance, it's a net loss.

@odersky
Copy link
Contributor

odersky commented Jul 1, 2025

We could maybe benchmark on Scala Native? This might be easier to do. And if we get a net win there it would be a justification.

@noti0na1
Copy link
Member Author

noti0na1 commented Jul 1, 2025

As a chain of compilers, I think at each stage, we should generate code that makes "sense" with the best effort. We should not generate a bunch of non-sense allocations/boxing and rely on a blackbox to "optimize" every mistake.

So this is a principle problem for me and this PR partially fixes the problem, it's not about how many lines are added to the codebase.

@noti0na1
Copy link
Member Author

noti0na1 commented Jul 1, 2025

We may benchmark this using jmh to monitor the tuple allocation number and memory usage, if someone can help?

@odersky
Copy link
Contributor

odersky commented Jul 1, 2025 via email

@noti0na1
Copy link
Member Author

noti0na1 commented Jul 1, 2025

OK, I am able to write some recursive code with tuple to stress test the code, and my PR has measurable performance improvement.

// P(n) = P(n-2) + P(n-3)
def padovan(n: Int, v: (Int, Int, Int) = (1, 0, 0)): Int =
  val (v_n_3, v_n_2, v_n_1) = v
  if n == 0 then v_n_3
  else padovan(n - 1, (v_n_2, v_n_1, v_n_2 + v_n_3))


@main def Test =
  val n = 1000000000
  val i = padovan(n)
  println(s"padovan of $n is $i")

We can ignore the result because of overflow.

Without this PR: ~7.8s
With this PR: ~5.2s

The results are from many runs.

@nmichael44
Copy link

@odersky @sjrd @noti0na1 With openjdk-23.0.1. Comparing the two functions below:

object JitTest:
  def f2(x0: Int, x1: Int, x2: Int, x3: Int): Int =
    val (y0, y1, y2, y3) = (x0 * 2, x1 * 3, x2 * 5, x3 * 7)

    y0 + y1 + y2 + y3

  def f4(x0: Int, x1: Int, x2: Int, x3: Int): Int =
    val y0 = x0 * 2
    val y1 = x1 * 3
    val y2 = x2 * 5
    val y3 = x3 * 7

    y0 + y1 + y2 + y3

Hotspot(C2) output (after millions of runs):

For 'f2':

[Instructions begin]
  0x000002261aedd580:   xchg   %ax,%ax
[Entry Point]
  # {method} {0x000002266e977ce0} 'f2' '(IIII)I' in 'app/JitTest$'
  # this:     rdx:rdx   = 'app/JitTest$'
  # parm0:    r8        = int
  # parm1:    r9        = int
  # parm2:    rdi       = int
  # parm3:    rsi       = int
  #           [sp+0x20]  (sp of caller)
  0x000002261aedd582:   mov    0x8(%rdx),%r10d
  0x000002261aedd586:   cmp    0x8(%rax),%r10d
  0x000002261aedd58a:   jne    0x000002261adee4e0           ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000002261aedd590:   sub    $0x18,%rsp
  0x000002261aedd597:   mov    %rbp,0x10(%rsp)
  0x000002261aedd59c:   cmpl   $0x0,0x20(%r15)
  0x000002261aedd5a4:   jne    0x000002261aedd691           ;*synchronization entry
                                                            ; - app.JitTest$::f2@-1 (line 10)
  0x000002261aedd5aa:   shl    %r8d
  0x000002261aedd5ad:   lea    0x80(%r8),%r10d
  0x000002261aedd5b4:   movabs $0x601804ab0,%rcx            ;   {oop(a 'java/lang/Integer'[256] {0x0000000601804ab0})}
  0x000002261aedd5be:   cmp    $0x100,%r10d
  0x000002261aedd5c5:   jb     0x000002261aedd651           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd5cb:   lea    (%r9,%r9,2),%eax             ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f2@11 (line 10)
  0x000002261aedd5cf:   lea    0x80(%rax),%r11d
  0x000002261aedd5d6:   cmp    $0x100,%r11d
  0x000002261aedd5dd:   jb     0x000002261aedd666           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd5e3:   lea    (%rdi,%rdi,4),%r11d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f2@17 (line 10)
  0x000002261aedd5e7:   lea    0x80(%r11),%r10d
  0x000002261aedd5ee:   cmp    $0x100,%r10d
  0x000002261aedd5f5:   jb     0x000002261aedd63f
  0x000002261aedd5f7:   lea    0x0(,%rsi,8),%r10d
  0x000002261aedd5ff:   sub    %esi,%r10d
  0x000002261aedd602:   lea    0x80(%r10),%ebx
  0x000002261aedd609:   cmp    $0x100,%ebx
  0x000002261aedd60f:   jb     0x000002261aedd62d
  0x000002261aedd611:   add    %r8d,%eax
  0x000002261aedd614:   add    %r11d,%eax
  0x000002261aedd617:   add    %r10d,%eax
  0x000002261aedd61a:   add    $0x10,%rsp
  0x000002261aedd61e:   pop    %rbp
  0x000002261aedd61f:   cmp    0x448(%r15),%rsp             ;   {poll_return}
  0x000002261aedd626:   ja     0x000002261aedd67b
  0x000002261aedd62c:   ret    
  0x000002261aedd62d:   movslq %r10d,%r10                   ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd630:   mov    0x210(%rcx,%r10,4),%r10d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@26 (line 10)
  0x000002261aedd638:   mov    0xc(%r12,%r10,8),%r10d       ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@69 (line 10)
  0x000002261aedd63d:   jmp    0x000002261aedd611
  0x000002261aedd63f:   movslq %r11d,%r10                   ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd642:   mov    0x210(%rcx,%r10,4),%r11d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@18 (line 10)
  0x000002261aedd64a:   mov    0xc(%r12,%r11,8),%r11d       ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@59 (line 10)
  0x000002261aedd64f:   jmp    0x000002261aedd5f7
  0x000002261aedd651:   movslq %r8d,%r10                    ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd654:   mov    0x210(%rcx,%r10,4),%r11d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd65c:   mov    0xc(%r12,%r11,8),%r8d        ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@39 (line 10)
  0x000002261aedd661:   jmp    0x000002261aedd5cb
  0x000002261aedd666:   movslq %eax,%r10                    ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd669:   mov    0x210(%rcx,%r10,4),%r10d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@12 (line 10)
  0x000002261aedd671:   mov    0xc(%r12,%r10,8),%eax        ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@49 (line 10)
  0x000002261aedd676:   jmp    0x000002261aedd5e3           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd67b:   movabs $0x2261aedd61f,%r10          ;   {internal_word}
  0x000002261aedd685:   mov    %r10,0x460(%r15)
  0x000002261aedd68c:   jmp    0x000002261adf53e0           ;   {runtime_call SafepointBlob}
  0x000002261aedd691:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x000002261aedd696:   jmp    0x000002261aedd5aa
  0x000002261aedd69b:   hlt    
  0x000002261aedd69c:   hlt    
  0x000002261aedd69d:   hlt    
  0x000002261aedd69e:   hlt    
  0x000002261aedd69f:   hlt    
[Exception Handler]
  0x000002261aedd6a0:   jmp    0x000002261ae27160           ;   {no_reloc}
[Deopt Handler Code]
  0x000002261aedd6a5:   call   0x000002261aedd6aa
  0x000002261aedd6aa:   subq   $0x5,(%rsp)
  0x000002261aedd6af:   jmp    0x000002261adf4680           ;   {runtime_call DeoptimizationBlob}
  0x000002261aedd6b4:   hlt    
  0x000002261aedd6b5:   hlt    
  0x000002261aedd6b6:   hlt    
  0x000002261aedd6b7:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]

And for f4:

[Instructions begin]
  0x000002261aef9a00:   xchg   %ax,%ax
[Entry Point]
  # {method} {0x000002266e977f40} 'f4' '(IIII)I' in 'app/JitTest$'
  # this:     rdx:rdx   = 'app/JitTest$'
  # parm0:    r8        = int
  # parm1:    r9        = int
  # parm2:    rdi       = int
  # parm3:    rsi       = int
  #           [sp+0x20]  (sp of caller)
  0x000002261aef9a02:   mov    0x8(%rdx),%r10d
  0x000002261aef9a06:   cmp    0x8(%rax),%r10d
  0x000002261aef9a0a:   jne    0x000002261adee4e0           ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000002261aef9a10:   sub    $0x18,%rsp
  0x000002261aef9a17:   mov    %rbp,0x10(%rsp)
  0x000002261aef9a1c:   cmpl   $0x0,0x20(%r15)
  0x000002261aef9a24:   jne    0x000002261aef9a6e           ;*synchronization entry
                                                            ; - app.JitTest$::f4@-1 (line 23)
  0x000002261aef9a2a:   lea    (%r9,%r9,2),%r11d
  0x000002261aef9a2e:   lea    (%rdi,%rdi,4),%r10d
  0x000002261aef9a32:   lea    (%r11,%r8,2),%r8d
  0x000002261aef9a36:   add    %r8d,%r10d
  0x000002261aef9a39:   lea    0x0(,%rsi,8),%eax
  0x000002261aef9a40:   sub    %esi,%eax
  0x000002261aef9a42:   add    %r10d,%eax                   ;*iadd {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f4@32 (line 28)
  0x000002261aef9a45:   add    $0x10,%rsp
  0x000002261aef9a49:   pop    %rbp
  0x000002261aef9a4a:   cmp    0x448(%r15),%rsp             ;   {poll_return}
  0x000002261aef9a51:   ja     0x000002261aef9a58
  0x000002261aef9a57:   ret    
  0x000002261aef9a58:   movabs $0x2261aef9a4a,%r10          ;   {internal_word}
  0x000002261aef9a62:   mov    %r10,0x460(%r15)
  0x000002261aef9a69:   jmp    0x000002261adf53e0           ;   {runtime_call SafepointBlob}
  0x000002261aef9a6e:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x000002261aef9a73:   jmp    0x000002261aef9a2a
[Exception Handler]
  0x000002261aef9a78:   jmp    0x000002261ae27160           ;   {no_reloc}
[Deopt Handler Code]
  0x000002261aef9a7d:   call   0x000002261aef9a82
  0x000002261aef9a82:   subq   $0x5,(%rsp)
  0x000002261aef9a87:   jmp    0x000002261adf4680           ;   {runtime_call DeoptimizationBlob}
  0x000002261aef9a8c:   hlt    
  0x000002261aef9a8d:   hlt    
  0x000002261aef9a8e:   hlt    
  0x000002261aef9a8f:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]

It's not even close. In f2 we allocate, box, unbox, it's all over the place. You just can't rely on
hotspot to always clean up. My vote is to merge.

@He-Pin
Copy link
Contributor

He-Pin commented Jul 2, 2025

Another case is val (a, b) = (b, a), hope this kind of case can be optimized too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimise simple tuple extraction
5 participants