Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First Class Structs: stop lying about underlying type. #1231

Closed
sandreenko opened this issue Dec 31, 2019 · 17 comments · Fixed by #37745
Closed

First Class Structs: stop lying about underlying type. #1231

sandreenko opened this issue Dec 31, 2019 · 17 comments · Fixed by #37745
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@sandreenko
Copy link
Contributor

sandreenko commented Dec 31, 2019

I am trying to get rid of some pessimizations following first-class-structs design, cc @CarolEidt.

The main performance issue is struct returns that are wrapped as int or long types. In these cases even if we inline these calls we can't promote/enregister structs or their fields.

To stop retyping we need to stop calling impFixupCallStructReturn and fgFixupStructReturn, like this.

but then you will get errors in rationalize because it doesn't expect ASG struct trees.
I see two possible solutions:

  1. teach rationalize how to work with these types;
  2. start generating STORE_LCL/BLK/OBJ in importer, like in gtNewBlkOpNode gentree.cpp.

the second interferes with the work of deleting ASG from @mikedn, so that issue is created to discuss how we can make progress without creating merge conflicts. @mikedn how do you see ASG nodes replacement, are you planning to delete it completely, including importation phase?

An example how we keep struct type with the second approach 1. System.IO.FileSystem:FillAttributeInfo(System.String,byref,bool):int (MethodHash=333db0b0) (from VectorAdd_r.dll)

for such IL:

    [ 0]  10 (0x00a) call 0600018E
    [ 1]  15 (0x00f) stloc.1

the old generates:

STMT00004 (IL   ???...  ???)
    [000013] -AC---------              *  ASG       long  
    [000012] *-----------              +--*  IND       long  
    [000011] ------------              |  \--*  ADDR      byref 
    [000010] ------------              |     \--*  LCL_VAR   struct<System.IO.DisableMediaInsertionPrompt, 8> V04 loc1         
    [000009] --C---------              \--*  RET_EXPR  long  (inl return from call [000008])

then transforms it to:

N003 ( 18, 10) [000013] DACXG-------              *  STORE_LCL_FLD long   V04 loc1         [+0]
                                                  *    bool   V04._disableSuccess (offs=0x00) -> V15 tmp8         
                                                  *    int    V04._oldMode (offs=0x04) -> V16 tmp9  

and then generates:
mov qword ptr [V04 rbp-290H], rax

the new importer:

    [000012] -AC---------              *  STORE_LCL_VAR    struct<System.IO.DisableMediaInsertionPrompt, 8> V04 loc1         
    [000009] --C---------              \--*  RET_EXPR  struct(inl return from call [000008])

then:

N001 ( 14,  5) [000008] --CXG-------         t8 =    CALL      struct System.IO.DisableMediaInsertionPrompt.Create
                                                  /--*  t8     struct 
N003 ( 18,  8) [000012] DA-XG-------              *  STORE_LCL_VAR struct<System.IO.DisableMediaInsertionPrompt, 

then:
mov qword ptr [V04 rbp-29CH], rax

so we can keep STRUCT type and with changes in lowering can enregister them.

Other examples. 2. from VectorAddTest`1[Single][System.Single]:VectorAdd(float,float,float):int (MethodHash=051d27ab) (from VectorAdd_r.dll)

for such IL:

    [ 0]   1 (0x001) ldloca.s 0
    [ 1]   3 (0x003) ldarg.0
    [ 2]   4 (0x004) call 0A00002B

the old importer generates:

STMT00002 (IL 0x001...  ???)
               [000007] -A--G-------              *  ASG       simd32 (copy)
               [000006] ----G--N----              +--*  BLK       simd32<32>
               [000003] ------------              |  \--*  ADDR      byref 
               [000002] ------------              |     \--*  LCL_VAR   simd32<System.Numerics.Vector`1[Single]> V03 loc0         
               [000005] -------N----              \--*  SIMD      simd32 float init
               [000004] ------------                 \--*  LCL_VAR   float  V00 arg0 

then rationalize changes it to:

    N001 (  3,  4) [000004] ------------         t4 =    LCL_VAR   float  V00 arg0         
                                                      /--*  t4     float  
    N002 (  4,  5) [000005] -------N----         t5 = *  SIMD      simd32 float init
                                                      /--*  t5     simd32 
    N004 (  8,  8) [000007] DA--G-------              *  STORE_LCL_VAR simd32<System.Numerics.Vector`1[Single]> V03 

and the final result is:

    Generating: N030 (  3,  4) [000004] ------------         t4 =    LCL_VAR   float  V00 arg0          mm0 REG mm0
    IN0005:        vmovss   xmm0, dword ptr [V00 rbp+10H]
                                                                  /--*  t4     float  
    Generating: N032 (  4,  5) [000005] -------N----         t5 = *  SIMD      simd32 float init REG mm0
    IN0006:        vbroadcastss ymm0, ymm0
                                                                  /--*  t5     simd32 
    Generating: N034 (  8,  8) [000007] DA--G-------              *  STORE_LCL_VAR simd32<System.Numerics.Vector`1[Single]> V03 loc0          NA REG NA
    IN0007:        vmovupd  ymmword ptr[V03 rbp-30H], ymm0
    Added IP mapping: 0x0009 STACK_EMPTY (G_M55380_IG04,ins#4,ofs#19)

the new importer generates:

    [000006] ------------              *  STORE_LCL_VAR simd32<System.Numerics.Vector`1[Single]> V03 loc0         
    [000005] -------N----                 *  SIMD      simd32 float init
    [000004] ------------                 \--*  LCL_VAR   float  V00 arg0 

and doesn't do any tranformations later.
in order to keep STORE_LCL_VAR add DA flags.

  1. System.IO.FileSystem:FillAttributeInfo(System.String,byref,bool):int (MethodHash=333db0b0) (from VectorAdd_r.dll)

for such IL:

    [ 0]  98 (0x062) ldloca.s 2
    [ 1] 100 (0x064) initobj 0200000C

the old importer:

    [000102] IA----------              *  ASG       struct (init)
    [000099] D------N----              +--*  LCL_VAR   struct<WIN32_FIND_DATA, 592> V05 loc2         
    [000101] ------------              \--*  CNS_INT   int    0

then

N001 (  1,  1) [000101] ------------       t101 =    CNS_INT   int    0
N002 (  3,  2) [000099] D------N----        t99 =    LCL_VAR_ADDR byref  V05 loc2         
                                                  /--*  t99    byref  
                                                  +--*  t101   int    
               [000410] -A----------              *  STORE_BLK struct<592> (init)

the new importer generates:

    [000101] -A----------              *  STORE_BLK struct<592> (init)
    [000098] ------------              +--*  LCL_VAR_ADDR byref  V05 loc2         
    [000100] ------------              \--*  CNS_INT   int    0
  1. VectorMathTests.Program:Main(System.String[]):int (MethodHash=2c61fab2) (from AbsSqrt_r.dll)

IL:

    [ 0]   1 (0x001) ldloca.s 0
    [ 1]   3 (0x003) ldc.r4 11.000000000000000
    [ 2]   8 (0x008) ldc.r4 13.000000000000000
    [ 3]  13 (0x00d) ldc.r4 8.0000000000000000
    [ 4]  18 (0x012) ldc.r4 4.0000000000000000
    [ 5]  23 (0x017) call 0A000004

old import:

    [000014] -A--G-------              *  ASG       simd16 (copy)
    [000013] ----G--N----              +--*  BLK       simd16<16>
    [000003] ------------              |  \--*  ADDR      byref 
    [000002] ------------              |     \--*  LCL_VAR   simd16<System.Numerics.Vector4> V01 loc0         
    [000012] -------N----              \--*  SIMD      simd16 float initN
    [000011] ------------                 \--*  LIST      float 
    [000004] ------------                    +--*  CNS_DBL   float  11.000000000000000
    [000010] ------------                    \--*  LIST      float 
    [000005] ------------                       +--*  CNS_DBL   float  13.000000000000000
    [000009] ------------                       \--*  LIST      float 
    [000006] ------------                          +--*  CNS_DBL   float  8.0000000000000000
    [000008] ------------                          \--*  LIST      float 
    [000007] ------------                             \--*  CNS_DBL   float  4.0000000000000000

after rationalize:

    N001 (  3,  4) [000004] ------------         t4 =    CNS_DBL   float  11.000000000000000
    N002 (  3,  4) [000005] ------------         t5 =    CNS_DBL   float  13.000000000000000
    N003 (  3,  4) [000006] ------------         t6 =    CNS_DBL   float  8.0000000000000000
    N004 (  3,  4) [000007] ------------         t7 =    CNS_DBL   float  4.0000000000000000
                                                      /--*  t4     float  
                                                      +--*  t5     float  
                                                      +--*  t6     float  
                                                      +--*  t7     float  
    N009 ( 16, 20) [000012] -------N----        t12 = *  SIMD      simd16 float initN
                                                      /--*  t12    simd16 
    N011 ( 20, 23) [000014] DA--G-------              *  STORE_LCL_VAR simd16<System.Numerics.Vector4> V01 loc0 

new import:

   [000013] DA----------              *  STORE_LCL_VAR simd16<System.Numerics.Vector4> V01 loc0         
   [000012] -------N----                 *  SIMD      simd16 float initN
   [000011] ------------                 \--*  LIST      float 
   [000004] ------------                    +--*  CNS_DBL   float  11.000000000000000
   [000010] ------------                    \--*  LIST      float 
   [000005] ------------                       +--*  CNS_DBL   float  13.000000000000000
   [000009] ------------                       \--*  LIST      float 
   [000006] ------------                          +--*  CNS_DBL   float  8.0000000000000000
   [000008] ------------                          \--*  LIST      float 
   [000007] ------------                             \--*  CNS_DBL   float  4.0000000000000000

category:cq
theme:structs
skill-level:expert
cost:large

@sandreenko sandreenko added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization labels Dec 31, 2019
@sandreenko sandreenko self-assigned this Dec 31, 2019
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 31, 2019
@CarolEidt
Copy link
Contributor

I think the first approach (teach rationalize how to deal with struct ASG trees) is the right one. Are there advantages to doing the second? I know that eventually we want to import assignments as stores, but I guess I don't understand the complexity of handling struct assignments in rationalizer.

@mikedn
Copy link
Contributor

mikedn commented Jan 2, 2020

I am trying to get rid of some pessimizations following first-class-structs design,

Great!

but then you will get errors in rationalize because it doesn't expect ASG struct trees.

Presumably you mean ASG struct trees with a call as RHS? Otherwise ASG struct trees are definitely expected in morph.

the second interferes with the work of deleting ASG from @mikedn

Currently I'm working on ADDR, not ASG. Eventually I'll work on ASG removal as well but I need to finish ADDR first (well, at least the part related to local variables). In particular, I need struct typed LCL_FLDs and I need to get rid of IsLocalAddrExpr (or at least limit its use to non-tracked variables to avoid having to deal with indirect local stores in SSA). Also, I'm considering first creating STORE_LCL|FLD_VAR in LocalAddressVisitor first, just to limit the scope of change a bit. But this definitely has to be done in the importer sooner or later, it's a rather sad state of affairs that the IL has the correct model (load/store) and the importer has to replace that with the assignment model.

start generating STORE_LCL/BLK/OBJ in importer

That's tempting. But it may turn out to be a large amount of work, relative to the problem you're trying to solve. Nothing in the frontend understands these stores so you'll very likely need to patch up at least morph, assertion prop, SSA and VN. Granted, in the specific case of a call as RHS things should be simpler: fgMorphCopyBlock can't promote, VN will produce a "new" VN, assertion prop will just kill all existing assertions related to the LHS etc.

teach rationalize how to work with these types

I'm still not sure what exact problem you have found in rationalization. There's this bit of code:

if ((location->TypeGet() == TYP_STRUCT) && !assignment->IsPhiDefn() && !value->IsMultiRegCall())
{
if ((location->OperGet() == GT_LCL_VAR))
{
// We need to construct a block node for the location.
// Modify lcl to be the address form.
location->SetOper(addrForm(locationOp));
LclVarDsc* varDsc = &(comp->lvaTable[location->AsLclVarCommon()->GetLclNum()]);
location->gtType = TYP_BYREF;
GenTreeBlk* storeBlk = nullptr;
unsigned int size = varDsc->lvExactSize;
if (varDsc->HasGCPtr())
{
CORINFO_CLASS_HANDLE structHnd = varDsc->lvVerTypeInfo.GetClassHandle();
GenTreeObj* objNode = comp->gtNewObjNode(structHnd, location);
objNode->ChangeOper(GT_STORE_OBJ);
objNode->SetData(value);
storeBlk = objNode;
}
else
{
storeBlk = new (comp, GT_STORE_BLK)
GenTreeBlk(GT_STORE_BLK, TYP_STRUCT, location, value, comp->typGetBlkLayout(size));
}
storeBlk->gtFlags |= GTF_ASG;
storeBlk->gtFlags |= ((location->gtFlags | value->gtFlags) & GTF_ALL_EFFECT);
GenTree* insertionPoint = location->gtNext;
BlockRange().InsertBefore(insertionPoint, storeBlk);
use.ReplaceWith(comp, storeBlk);
BlockRange().Remove(assignment);
JITDUMP("After transforming local struct assignment into a block op:\n");
DISPTREERANGE(BlockRange(), use.Def());
JITDUMP("\n");
return;
}
else
{
assert(location->OperIsBlk());
}
}
that I'm not quite sure why it's in rationalization. Rationalization's role is to get rid of ASG. If codegen does not support struct typed STORE_LCL_VAR it's lowering's job to convert these to STORE_OBJ.
In any case, this rationalization code already ignores ASGs with a "multi reg call" as RHS so presumably you can easily extend this to "any reg call".

@mikedn
Copy link
Contributor

mikedn commented Jan 2, 2020

I happen to have a little change I've done a while ago that may be of some help. It's not related to calls but it tries to make STORE_LCL_VAR work with registers sized structs, without having to convert them to STORE_OBJ. It's only an experiment right now but I think that's going in the write direction - better handling of register sized structs.

mikedn@81c5d91

It only handles the STORE_LCL_VAR(LCL_VAR) case now, by allocating a suitable register (GPR or XMM) to the LCL_VAR and using it in STORE_LCL_VAR. It also disables reg size struct handling in fgMorphOneAsgBlockOp, another form of lying that we should fix eventually.

@sandreenko
Copy link
Contributor Author

I think the first approach (teach rationalize how to deal with struct ASG trees) is the right one. Are there advantages to doing the second? I know that eventually we want to import assignments as stores, but I guess I don't understand the complexity of handling struct assignments in rationalizer.

The current way of representation of struct ASG in importer involves many nodes like BLK, ADDR, OBJ:

               [000007] -A--G-------              *  ASG       simd32 (copy)
               [000006] ----G--N----              +--*  BLK       simd32<32>
               [000003] ------------              |  \--*  ADDR      byref 
               [000002] ------------              |     \--*  LCL_VAR   simd32<System.Numerics.Vector`1[Single]> V03 loc0  

we add conservative G flag on BLK in that method:

// At this point, we have a tree that we are going to store into a destination.
// TODO-1stClassStructs: This should be a simple store or assignment, and should not require
// GTF_ALL_EFFECT for the dest. This is currently emulating the previous behavior of
// block ops.
if (doCopyBlk)
{
GenTree* dest = new (this, GT_BLK)
GenTreeBlk(GT_BLK, simdType, copyBlkDst, typGetBlkLayout(getSIMDTypeSizeInBytes(clsHnd)));
dest->gtFlags |= GTF_GLOB_REF;

and it affects the later phases, probably, that can be fixed directly by deleting that setting as the comment says, but I think these nodes have negative impact on other phases before rationalize (like optOptimizeCSEs, optOptimizeValnumCSEs), am I wrong?

@CarolEidt
Copy link
Contributor

The conservative handling of struct copies is something that needs to change. Some of these pessimizations were removed, only to be re-added later as an expedient, rather than fixing CSE. I'm not certain that's the case in this instance, but I think we should pursue this avenue.

@sandreenko
Copy link
Contributor Author

but I guess I don't understand the complexity of handling struct assignments in rationalizer.

that I'm not quite sure why it's in rationalization. Rationalization's role is to get rid of ASG. If codegen does not support struct typed STORE_LCL_VAR it's lowering's job to convert these to STORE_OBJ.
In any case, this rationalization code already ignores ASGs with a "multi reg call" as RHS so presumably you can easily extend this to "any reg call".

I have checked my notes and found that asserts were not in the rationalization phase, sorry for the confusion.

But rationalize was transforming ASG struct into STORE_BLK struct when with fixup in importer we had STORE_LCL_FLD long.

IR without fixup after importation (new):
N003 ( 18,  8) [000012] -ACXG---R---              *  ASG       struct (copy) $VN.Void
N002 (  3,  2) [000010] D------N----              +--*  LCL_VAR   struct<System.IO.DisableMediaInsertionPrompt, 8>(P) V04 loc1         
                                                  +--*    bool   V04._disableSuccess (offs=0x00) -> V15 tmp8         
                                                  +--*    int    V04._oldMode (offs=0x04) -> V16 tmp9         
N001 ( 14,  5) [000008] --CXG-------              \--*  CALL      struct System.IO.DisableMediaInsertionPrompt.Create 

after rationalize:

N001 ( 14,  5) [000008] --CXG-------         t8 =    CALL      struct System.IO.DisableMediaInsertionPrompt.Create $240
N002 (  3,  2) [000010] D------N----        t10 =    LCL_VAR_ADDR byref  V04 loc1         
                                                  *    bool   V04._disableSuccess (offs=0x00) -> V15 tmp8         
                                                  *    int    V04._oldMode (offs=0x04) -> V16 tmp9         
                                                  /--*  t10    byref  
                                                  +--*  t8     struct 
               [000402] -ACXG-------              *  STORE_BLK struct<8> (copy)

IR with fixup after importation (old):

               [000013] -AC---------              *  ASG       long  
               [000012] *-----------              +--*  IND       long  
               [000011] ------------              |  \--*  ADDR      byref 
               [000010] ------------              |     \--*  LCL_VAR   struct<System.IO.DisableMediaInsertionPrompt, 8> V04 loc1         
               [000009] --C---------              \--*  RET_EXPR  long  (inl return from call [000008])

after rationalize:

N027 ( 14,  5) [000008] --CXG-------         t8 =    CALL      long   System.IO.DisableMediaInsertionPrompt.Create REG rax
                                                  /--*  t8     long   
N029 ( 18, 10) [000013] DA-XG-------              *  STORE_LCL_FLD long   V04 loc1         [+0] NA

Is STORE_BLK the right node here? As I see from Struct “objects” as lvalues it is acceptabe, but don't we want to have a simple STORE_LCL_VAR in this case?

@mikedn
Copy link
Contributor

mikedn commented Jan 2, 2020

But rationalize was transforming ASG struct into STORE_BLK struct when with fixup in importer we had STORE_LCL_FLD long.

Yes, because copy/init block codegen does not support struct typed STORE_LCL|FLD_VAR. That's IMO something that should be improved, but it's probably a bit complicated due to the complexity of the struct copying/initialization code. The copy/init block codegen refactoring I did recently should help but it's not clear to me what benefit we would get from making such a change at this time.

Is STORE_BLK the right node here? As I see from Struct “objects” as lvalues it is acceptabe, but don't we want to have a simple STORE_LCL_VAR in this case?

IMO yes, it should be STORE_LCL_VAR. That's exactly what I did in the commit mentioned a few posts above but only in some limited circumstances.

@CarolEidt
Copy link
Contributor

Agreed; we want to move toward using a simple LCL_VAR or STORE_LCL_VAR for struct typed locals in general, but for codegen, as @mikedn points out, this requires more changes for handling larger structs (i.e. those that require more than a single load or store).

@sandreenko
Copy link
Contributor Author

Thanks, that was a very helpful discussion for me. I will stop doing changes before rationalizing (except deletion of retyping), adopt @mikedn's change and come back with results.

@BruceForstall BruceForstall changed the title First Class Structs: stop lying about underlaing type. First Class Structs: stop lying about underlying type. Jan 3, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Jan 3, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone Jan 3, 2020
@sandreenko
Copy link
Contributor Author

a small update, I was able to compile and run ConsoleMandel.ScalarFloatSinglethreadADT benchmark with both mikedn's change for STORE_LCL_VAR and with Carol's change for enregistering SIMD8, I see the expected performance boost:

|                     Method |                   Toolchain |    Mean |    Error |   StdDev |  Median |     Min |     Max | Ratio | MannWhitney(3ms) | Gen 0 | Gen 1 | Gen 2 | Allocated |
|--------------------------- |---------------------------- |--------:|---------:|---------:|--------:|--------:|--------:|------:|----------------- |------:|------:|------:|----------:|
| ScalarFloatSinglethreadADT | \Core_Root_base\CoreRun.exe | 4.870 s | 0.0587 s | 0.0490 s | 4.889 s | 4.751 s | 4.917 s |  1.00 |             Base |     - |     - |     - |     272 B |
| ScalarFloatSinglethreadADT | \Core_Root_diff\CoreRun.exe | 1.492 s | 0.0165 s | 0.0146 s | 1.491 s | 1.471 s | 1.527 s |  0.31 |           Faster |     - |     - |     - |     272 B |

dumps are attached, the difference that V25 tmp11 now can be enregistered.
SIMDTestNew.txt
SIMDTestOld.txt

pri0 test results:

Total tests run    : 2709
Total passing tests: 2701
Total failed tests : 8

the changes are in that branch (but it is not ready for review).

The are many problems, the main are:

  1. struct8 lclVar= struct8 call, it creates awful trees like this:
               [000098] -AC-G-------              *  ASG       struct (copy)
               [000097] *-----------              +--*  BLK       struct<Microsoft.Cci.BlobIdx, 4>
               [000096] ------------              |  \--*  ADDR      byref 
               [000095] ------------              |     \--*  FIELD     struct PermissionSet
               [000082] ------------              |        \--*  ADDR      byref 
               [000081] ------------              |           \--*  LCL_VAR   struct<DeclSecurityRow, 16> V04 loc1         
               [000088] --C-G-------              \--*  CALL      struct Microsoft.Cci.MetadataWriter.GetPermissionSetIndex
               [000083] ------------ this in rcx     +--*  LCL_VAR   ref    V00 this         
               [000094] n----------- arg1            \--*  OBJ       struct<System.Collections.Immutable.ImmutableArray`1[[Microsoft.Cci.ICustomAttribute, Microsoft.CodeAnalysis, Version=1.1.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]], 8>
               [000093] ------------                    \--*  ADDR      byref 
               [000092] ------------                       \--*  LCL_VAR   struct<System.Collections.Immutable.ImmutableArray`1[[Microsoft.Cci.ICustomAttribute, Microsoft.CodeAnalysis, Version=1.1.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]], 8> V12 tmp3 

then there is interaction with LocalAddressVisitor that transforms it into:

Replacing the field in promoted struct with local var V20
LocalAddressVisitor modified statement:
STMT00023 (IL   ???...  ???)
               [000098] -AC-G-------              *  ASG       struct (copy)
               [000097] *-----------              +--*  BLK       struct<Microsoft.Cci.BlobIdx, 4>
               [000096] ------------              |  \--*  ADDR      byref 
               [000095] ------------              |     \--*  LCL_VAR   int    V20 tmp11        
               [000088] --C-G-------              \--*  CALL      struct Microsoft.Cci.MetadataWriter.GetPermissionSetIndex
               [000083] ------------ this in rcx     +--*  LCL_VAR   ref    V00 this         
               [000094] n----------- arg1            \--*  OBJ       struct<System.Collections.Immutable.ImmutableArray`1[[Microsoft.Cci.ICustomAttribute, Microsoft.CodeAnalysis, Version=1.1.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]], 8>
               [000093] ------------                    \--*  ADDR      byref 
               [000092] ------------                       \--*  LCL_VAR   struct<System.Collections.Immutable.ImmutableArray`1[[Microsoft.Cci.ICustomAttribute, Microsoft.CodeAnalysis, Version=1.1.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]], 8>(P) V12 tmp3         
                                                           \--*    ref    V12.array (offs=0x00) -> V24 tmp15 

and then we end up with

               [000098] -ACXG-------              *  ASG       int   
               [000095] D----+-N----              +--*  LCL_VAR   int    V20 tmp11        
               [000088] --CXG+------              \--*  CALL      struct Microsoft.Cci.MetadataWriter.GetPermissionSetIndex
               [000083] -----+------ this in rcx     +--*  LCL_VAR   ref    V00 this         
               [000092] -----+------ arg1 in rdx     \--*  LCL_VAR   ref    V24 tmp15  

so we have int = struct and problems with gtNewTempAssign, that can't determinate if it needs bitcast (from Allow TYP_SIMD8 to be enregistered change).

  1. I keep lying about the return type from the current method, it happens in impFixupStructReturnType and hits asserts when we inline, working on it now.

@CarolEidt
Copy link
Contributor

This looks like great progress!

problems with gtNewTempAssign, that can't determinate if it needs bitcast

I think the right answer is to delay the introduction of the bitcast to Lowering. There may be a reasonable interim solution, but it would require adding back ABI details into places like this, which would be nice to avoid.

I've made some additional changes to my SIMD8 PR (#811) mostly to improve performance. I haven't done much testing on it at this stage, so it may actually be worse than before correctness-wise.

@sandreenko
Copy link
Contributor Author

a small update from the offline discussion with Carol:
We do not want to do retyping for return struct case as well. It happens in impFixupStructReturnType and that was missed in my branch.

It creates many interesting cases, for example, System.ValueTuple:Create():System.ValueTuple from SPC x64 checked:
for such IL:

IL_0000  12 00             ldloca.s     0x0
IL_0002  fe 15 ac 01 00 02 initobj      0x20001AC
IL_0008  06                ldloc.0     
IL_0009  2a                ret 

we are importing:

STMT00000 (IL 0x000...  ???)
               [000003] IA----------              *  ASG       struct (init)
               [000000] D------N----              +--*  LCL_VAR   struct<System.ValueTuple, 1> V00 loc0         
               [000002] ------------              \--*  CNS_INT   int    0
STMT00001 (IL 0x008...  ???)
               [000005] ------------              *  RETURN    struct
               [000004] ------------              \--*  LCL_VAR   struct<System.ValueTuple, 1> V00 loc0

and then was retyped as:

STMT00001 (IL 0x008...  ???)
               [000005] ------------              *  RETURN    int   
               [000004] ------------              \--*  LCL_FLD   byte   V00 loc0         [+0]

and after lowering it looked like:

N001 (  1,  1) [000002] ------------         t2 =    CNS_INT   int    0 $41
N002 (  3,  2) [000000] D------N----         t0 =    LCL_VAR_ADDR byref  V00 loc0         d:1
                                                  /--*  t0     byref  
                                                  +--*  t2     int    
               [000009] -A----------              *  STOREIND  byte  
lowering GT_RETURN
N002 (  5,  6) [000005] ------------              *  RETURN    int    $180

in the new version:
we have LCL_VAR struct == CNS_INT int 0, so assertion propagation replaces [000004] with [000002], so we get:

N002 (  2,  2) [000005] ------------              *  RETURN    struct
N001 (  1,  1) [000004] ------------              \--*  CNS_INT   struct 0

that we don't know how to generate (no struct handle), so temporary I block such assertion propagation but it looks like we are missing an optimization opportunity here.

@mikedn
Copy link
Contributor

mikedn commented Jan 21, 2020

we have LCL_VAR struct == CNS_INT int 0, so assertion propagation replaces [000004] with [000002], so we get:

FWIW using CNS_INT like this has always been a bad idea. It really should be always wrapped in INIT_VAL (which could also have ClassLayout* added to it) or, because default initialized structs values are pretty common, a new oper be created just for this (e.g. GT_INITOBJ).

@CarolEidt
Copy link
Contributor

using CNS_INT like this has always been a bad idea.

Agreed. This was something that was actual made worse by the first round of first class structs changes and I debated but decided not to add the extra node where "not needed" (though I've come to believe that it is, in fact, needed).

because default initialized structs values are pretty common, a new oper be created just for this (e.g. GT_INITOBJ)

That seems like a promising idea; I presume that it would have store semantics, and look a lot like GT_STORE_OBJ but with no source operand. Given that there's an IL instruction called initobj that takes an arbitrary init value, I might call this GT_ZEROOBJ or something like that.

@mikedn
Copy link
Contributor

mikedn commented Jan 21, 2020

I presume that it would have store semantics, and look a lot like GT_STORE_OBJ but with no source operand

Not really. I think that IL's initobj semantics is quite unfortunate and should not be replicated:

C# code:

Foo(new BarStruct());

IL code:

IL_0000  12 00             ldloca.s     0x0
IL_0002  fe 15 03 00 00 02 initobj      0x2000003
IL_0008  06                ldloc.0     
IL_0009  28 01 00 00 06    call         0x6000001

ASM code:

C5F857C0             vxorps   xmm0, xmm0
C5FA7F442438         vmovdqu  xmmword ptr [rsp+38H], xmm0
C5FA6F442438         vmovdqu  xmm0, xmmword ptr [rsp+38H]
C5FA7F442428         vmovdqu  xmmword ptr [rsp+28H], xmm0
488D4C2428           lea      rcx, bword ptr [rsp+28H]
E8BDA3FFFF           call     Program:Foo(BarStruct)

While it's possible to improve the JIT to handle this better, things would have been easier if the IL was just:

initobj
call         0x6000001

Given that there's an IL instruction called initobj that takes an arbitrary init value, I might call this GT_ZEROOBJ or something like that.

That's initblk, initobj always zeroes out the destination location.

@sandreenko
Copy link
Contributor Author

we have LCL_VAR struct == CNS_INT int 0, so assertion propagation replaces [000004] with [000002], so we get:

FWIW using CNS_INT like this has always been a bad idea. It really should be always wrapped in INIT_VAL (which could also have ClassLayout* added to it) or, because default initialized structs values are pretty common, a new oper be created just for this (e.g. GT_INITOBJ).

INIT_VAL right now doesn't have ClassLayout*, does it? It can be easily added if we change its base type from GenTreeOp to something new with ClassLayout* field.

However, from what I saw, INIT_VAL nowadays doesn't have an important semantic role, all phases just skip it and access its childer, so as I understand the only purpose of it is to block some optimizations (like Value numbering), and that is why we don't create INIT_VAL for 0 to keep these optimizations. Am I right or does it have another purpose?

As I understand the proposed design is to create a new node for initobj IL.
What about IL like this:

ldloca.s     0x0
ldc.i4.0 <- init value
ldc.i4.8 <- the number of bytes to init
initblk

nowadays we generate ASG(BLOCK, 0), do we want to generate ZEROOBJ instead?

I like the idea, especially because it can be done in the master branch as a separate PR.
I don't fully understand procs and cons of it; are you expecting that we will be able to generate better code with this new node? Should it survive until lowering or later?

@mikedn
Copy link
Contributor

mikedn commented Jan 21, 2020

INIT_VAL right now doesn't have ClassLayout*, does it? It can be easily added if we change its base type from GenTreeOp to something new with ClassLayout* field.

Yes, it's probably easy to add layout to it but it's not very useful ATM because INIT_VAL is not used in the most common case - zero initialization. It's only used when the initialization value is non-constant and those cases are practically non existent, at least in the FX libraries.

However, from what I saw, INIT_VAL nowadays doesn't have an important semantic role, all phases just skip it and access its childer, so as I understand the only purpose of it is to block some optimizations (like Value numbering), and that is why we don't create INIT_VAL for 0 to keep these optimizations. Am I right or does it have another purpose?

Yes, that's pretty much why it exists. I think that the more precise reason is that without it the IR semantics might be a bit ambiguous - is ASG(IND<struct>, IND<int>) filling the LHS with the byte value taken from the RHS or is it copying 4 bytes from the RHS? You can see this in OperIsCopyBlkOp and OperIsInitBlkOp where the init/copy distinction is based on the RHS being constant or INIT_VAL.

nowadays we generate ASG(BLOCK, 0), do we want to generate ZEROOBJ instead?

Perhaps, I'm not sure how common initblk is. I think that the C# compiler doesn't use it but others (e.g. VC++) do. The initblk/cpblk situation is unfortunately complicated:

  • They end up a store(load) or store(init) which makes sense only when the number of bytes is constant. When it's not things this load/store model seems pretty weird and unnecessary - you want a load/store model for the sake SSA, VN and others but at the same time you can't really do anything if the number of bytes isn't constant.
  • When the number of bytes is constant we end up creating ClassLayout objects without a class handle. I've done that in order to be able to move things forward but it seems weird to me and I think of it as a temporary solution. It may be better if OBJ has layout and BLK only has size, though this might complicate a bit code like LowerBlockStore.
  • One way or another, you cannot avoid the generation of ASG at import time, at least not easily. You'd just generate ASG(LCL(0), ZEROOBJ) instead of ASG(LCL(0), CNS_INT(0)). Then some form of propagation would replace uses of LCL(0) with a ZEROOBJ, without worrying about generating some kind of invalid IR like RETURN(CNS_INT(0)).

I don't fully understand procs and cons of it;

I think that the main problem with adding something like ZEROOBJ is that it would be just an alternate representation of INIT_VAL(CNS_INT(0)). Having 2 ways to represent the same thing in a compiler IR isn't exactly a bright idea so it may be better to just use INIT_VAL(CNS_INT(0)). The trouble with it is that the IR is bigger. Not a great idea in a JIT, especially when 99.9% of initializations are zero init.

As for the main advantage - this is IMO pushing the JIR IR in the right direction by removing a case of context dependent node semantics. From LCL_VARs nodes that are loads or stores depending on which side of ASG they are, to STRUCT typed IND nodes that cannot be CSEd because you don't know the exact type and to CNS_INT nodes that aren't really integer constants, this kind of context dependency always causes problems.

are you expecting that we will be able to generate better code with this new node

Well, it should avoid the issue you're having with RETURN(CNS_INT(0)) and assertion propagation.

Maybe it could also be used to improve the codegen in the example I posted above but I haven't checked what's going on there. That's probably a rather different issue (e.g. it could be that because the temporary created for implicit byref args is address exposed assertion propagation doesn't touch it even though it could in this case).

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants