Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failed: System.Text.Tests.TranscodingStreamTests.Read_Span(bufferLength: 1) #47444

Closed
VincentBu opened this issue Jan 26, 2021 · 33 comments
Closed
Assignees
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Milestone

Comments

@VincentBu
Copy link
Contributor

Run: runtime 20210125.57

Failed tests:

net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
        - System.Text.Tests.TranscodingStreamTests.Read_Span
                - System.Text.Tests.TranscodingStreamTests.Read_Span(bufferLength: 1)

Error message:

Assert.Equal() Failure
                                 ↓ (pos 3492055)
Expected: ···bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···
Actual:   ···bcdefghijklmnopqrstuVwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···
                                 ↑ (pos 3492055)

stack trace:
   at System.Text.Tests.TranscodingStreamTests.<RunReadTest>g__RunOneTestIteration|22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 372
   at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 335
   at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 312
@VincentBu VincentBu added arch-arm32 os-linux Linux OS (any supported distro) labels Jan 26, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.Encoding untriaged New issue has not been triaged by the area owner labels Jan 26, 2021
@ghost
Copy link

ghost commented Jan 26, 2021

Tagging subscribers to this area: @tarekgh, @krwq
See info in area-owners.md if you want to be subscribed.

Issue Details

Run: runtime 20210125.57

Failed tests:

net6.0-Linux-Release-arm-CoreCLR_checked-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440
        - System.Text.Tests.TranscodingStreamTests.Read_Span
                - System.Text.Tests.TranscodingStreamTests.Read_Span(bufferLength: 1)

Error message:

Assert.Equal() Failure
                                 ↓ (pos 3492055)
Expected: ···bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···
Actual:   ···bcdefghijklmnopqrstuVwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···
                                 ↑ (pos 3492055)

stack trace:
   at System.Text.Tests.TranscodingStreamTests.<RunReadTest>g__RunOneTestIteration|22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 372
   at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 335
   at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 312
Author: VincentBu
Assignees: -
Labels:

arch-arm32, area-System.Text.Encoding, os-linux, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

No case conversion occurs as part of these tests. Is this failure reproducible? Are bits of memory randomly being flipped as we succumb to the slow, silent entropy of the universe?

@stephentoub
Copy link
Member

Are bits of memory randomly being flipped as we succumb to the slow, silent entropy of the universe?

Having a morbid day, are we? 😄

@danmoseley
Copy link
Member

This test runs in this configuration (this queue, at least) on many different machines:

TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= now(-120d)
| where Type == "System.Text.Tests.TranscodingStreamTests"
| where Method == "Read_Span"
| where QueueName1 == "ubuntu.1804.armarch.open"
| summarize dcount(MachineName)

dcount_MachineName

230

In that last 4 months it has failed twice -- both on the same box in much the same way

Bad memory on ddvsotx2l280 ? cc @MattGal

https://engsrvprod.kusto.windows.net/engineeringdata?query=H4sIAAAAAAAEAG2PQWvDMAyF74H8B9FLN3DLunsGHSOQwgZrAjsOzxaJm0QOlkIX2I%2bfk451h11s6b1PD6lCliPy2AlDmnzByTuC1pHNHBEGePOhLQR7Bk%2b%2fTWH%2fZQ%2f%2bY8HiX9gZODcYEHJHjhu08JABkD%2ffbHb3d%2fYWrkQ1DQhZBqty4hi%2frfBT4sPC2ypoYuOto7qUgLpf5NV19hml8XaZPqK27%2bWg6Y99uW2217l23Xp2huBPaARK0UHQKngagxbnSS2bqJ9MBXlwSLabXnQ%2fq9o0jvDSvI44LuVuxpl1HcUYaNq4sYn1PtRjjySs4DHeYJo0SZNvsaswnG4BAAA%3d&web=0

TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= now(-120d)
| where Type == "System.Text.Tests.TranscodingStreamTests"
| where Method == "Read_Span"
| where Result == 'Fail'
| project Started, Duration, Type, Method, FriendlyName, MachineName, QueueName1, Message, StackTrace, Arguments, Branch

Started Duration Type Method FriendlyName MachineName QueueName1 Message StackTrace Arguments Branch
2020-12-23 21:07:36.0500000 3.757 System.Text.Tests.TranscodingStreamTests Read_Span System.Text.Encoding.Tests ddvsotx2l280 ubuntu.1804.armarch.open Assert.Equal() Failure\n ↓ (pos 4401399)\nExpected: ···vwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd···\nActual: ···vwxyzabcdefghijklmnoPqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd···\n ↑ (pos 4401399) at System.Text.Tests.TranscodingStreamTests.g__RunOneTestIteration|22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 372 at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 335 at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 312 bufferLength: 1 refs/pull/46351/merge
2021-01-25 21:31:04.3900000 2.032 System.Text.Tests.TranscodingStreamTests Read_Span System.Text.Encoding.Tests ddvsotx2l280 ubuntu.1804.armarch.open Assert.Equal() Failure\n ↓ (pos 3492055)\nExpected: ···bcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···\nActual: ···bcdefghijklmnopqrstuVwxyzabcdefghijklmnopqrstuvwxyzabcdefghij···\n ↑ (pos 3492055) at System.Text.Tests.TranscodingStreamTests.g__RunOneTestIteration|22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 372 at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 335 at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 312 bufferLength: 1 refs/heads/master

@danmoseley
Copy link
Member

It might be interesting to query failure % of tests in general on ddvsotx2l280 compared to other machines with the same configuration and OS.

@tarekgh tarekgh removed the untriaged New issue has not been triaged by the area owner label Jan 26, 2021
@tarekgh tarekgh added this to the 6.0.0 milestone Jan 26, 2021
@tarekgh tarekgh added the test-run-core Test failures in .NET Core test runs label Jan 26, 2021
@danmoseley
Copy link
Member

For the record, the machine seems middle of the pack by rate of test failures or segfaults, using this query

WorkItems
| where Finished >= now(-120d)
| where QueueName == "ubuntu.1804.armarch.open"
| summarize Pass=countif(ExitCode == 0), Fail=countif(ExitCode == 1), Abort=countif(ExitCode != 0 and ExitCode != 1), SegV= countif(ExitCode==139) by MachineName
| extend FailRate = round(toreal(Fail)/(Pass+Fail+Abort), 4) * 100, AbortRate = round(toreal(Abort)/(Pass+Fail+Abort), 4) * 100, SegVRate = round(toreal(SegV)/(Pass+Fail+Abort), 6) * 100

So... who knows? Twice could be a coincidence.

What is the chance that this single bit error would be in roughly the same place in the buffer ? Is this the only test that writes a large buffer and reads it back? @GrabYourPitchforks any other idea?

@danmoseley
Copy link
Member

Maybe @jkotas has an idea

@jkotas
Copy link
Member

jkotas commented Jan 27, 2021

I think we should wait and see whether it fails on some other machine too. I can be a bad memory on the specific machine.

If we see it failing on more than one machine, it can be dangling pointer bug or GC hole bug. We would need a process dump taken as close to the failure point as possible to diagnose it.

@danmoseley
Copy link
Member

OK, let's check back in a month...

@MattGal
Copy link
Member

MattGal commented Jan 27, 2021

OK, let's check back in a month...

Something easy to do and I don't mind doing, if you pick a correlation id of a specific run for me, is to send exactly hundreds of clones of the just the work item you suspect is a problem using the Helix SDK. We'd get pretty useful data about the general flakiness of the test and know whether to blame that one machine from this.

@danmoseley
Copy link
Member

@MattGal thanks for the offer. Can you extract the "correlation ID" from one of the results in the query above?

@MattGal
Copy link
Member

MattGal commented Jan 27, 2021

Sure, can you confirm somewhere that this log is the exact one to clone a bunch of times?

As part of the recent surge (I think you even contributed) we log the correlation id at the top of all these logs now like:
Console log: 'System.Text.Encoding.Tests' from job 8d51db18-6cb8-42d7-a68e-dfa7c178bbf5 ...

@danmoseley
Copy link
Member

Ah, I see, I didn't know what correlation ID is -- sounds like it's job ID.

Yes, that's one of the ones with the problem. It would be interesting to loop that a zillion times on this machine and grab the console output of any failures. thank you!

@MattGal
Copy link
Member

MattGal commented Jan 27, 2021

I was thinking a zillion times on all the machines, to isolate "we think this machine has failing hardware" thing?

Job Id is a long in the DB (in the case of the GUID above, "JobId" == 13132067)

@MattGal
Copy link
Member

MattGal commented Jan 27, 2021

I ran this test 800x in a row across 274/278 of these machines (see JobId: 13167068 Correlation: 7082a89-a973-405f-af71-9babae5ac23e)

It ended up passing 3x on that machine:

WorkItems
| where Queued > ago(1d)
| where JobName == "7082a892-a973-405f-af71-9babae5ac23e" 
| where Name != JobName
| where MachineName == "ddvsotx2l280"

I am set up to be able to do this over and over to any scale we like but I"m trying to keep it minimal as PRs still need to work. I may run a few hundred here and there today and add them to my checks.

@MattGal
Copy link
Member

MattGal commented Jan 28, 2021

Machine's back in the pool. I ran stress-ng loads suggested by Dan both on the machine for 2 hours and inside the container the test failed on, no luck :(

@VincentBu
Copy link
Contributor Author

Failed again in runtime-coreclr libraries-jitstress 20210705.1

Failed tests:

net6.0-Linux-Release-arm-CoreCLR_checked-tailcallstress-(Ubuntu.1804.Arm32.Open)Ubuntu.1804.Armarch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm32v7-bfcd90a-20200121150440

- System.Text.Tests.TranscodingStreamTests.Read_Span(bufferLength: 1)
- System.Text.Tests.TranscodingStreamTests.Read_ByteByByte

Error message:

Assert.Equal() Failure
↓ (pos 2488567)
Expected: ···jklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqr···
Actual:   ···jklmnopqrstuvwxyzabcDefghijklmnopqrstuvwxyzabcdefghijklmnopqr···
↑ (pos 2488567)


Stack trace
   at System.Text.Tests.TranscodingStreamTests.<RunReadTest>g__RunOneTestIteration|22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 390
   at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 337
   at System.Text.Tests.TranscodingStreamTests.Read_ByteByByte() in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 293

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Text.Encoding labels Jul 6, 2021
@kunalspathak
Copy link
Member

kunalspathak commented Jul 22, 2021

This is the failure on same machine ddvsotx2l280 but on different position 2488567. In prior run it was failing on position 3492055. @MattGal , @danmoseley

I tried reproducing it on the local Linux/arm machine and it doesn't reproduce.

@GrabYourPitchforks
Copy link
Member

Spitballing a few ideas.

One option is that we try to log the physical memory address where this failure occurs. From the managed side it'd be easy enough to record the process's virtual memory address, but I don't know how to reliably map that back to a physical memory address short of having some system-level service already running on the box and using IPC.

Another option is to query the machine name at the start of the test and to perform a different action if we're on the target box. That could include skipping the test, using absurdly-sized payloads to try to trigger the repro more rapidly, or whatever else we think appropriate. Risks here would be that: (a) there's an opportunity cost for doing this work if we already suspect a hardware problem and have already proposed a resolution of pulling the machine from rotation; and (b) we don't want to risk accidentally leaving this code checked in.

A big-hammer is to memtest86+ the physical box. 😈

@kunalspathak
Copy link
Member

This is a low hit non-deterministic issue, I am inclined to moving this out of .NET6.0. Let me know if anyone thinks that it should be investigated in .NET6.

@danmoseley
Copy link
Member

@MattGal can we ask for someone to run memtest? It's hard to understand otherwise why it's the same box.

@MattGal
Copy link
Member

MattGal commented Jul 23, 2021

@MattGal can we ask for someone to run memtest? It's hard to understand otherwise why it's the same box.

I took the machine offline and will file a ticket to get DDFUN to try this out. Due to my only access being over KVM, I can't seem to get to the boot loader screen to try memtest. Someone with physical access may have to attempt this.

@danmoseley
Copy link
Member

danmoseley commented Jul 23, 2021

I ran the query above again, and although this test has run on 34 distinct machines in the last 120 days, the only two failures (new ones since last time) have both been on ddvsotx2l280. Both failures had the same pattern.

I'm guessing this test is one of the most stressful on memory, but I don't know off the top of my head how to create a query like "which tests disproportionately fail on ddvsotx2l280". If that threw up other memory-intensive tests, it would be more evidence.

TestResults
| join kind=inner WorkItems on WorkItemId
| join kind=inner Jobs on JobId
| where Finished >= now(-365d)
| where Type == "System.Text.Tests.TranscodingStreamTests"
| where Method == "Read_Span"
| where Result == 'Fail'
| project Started, Duration, Type, Method, FriendlyName, MachineName, QueueName1, Message, StackTrace, Arguments, Branch
Started Duration Type Method FriendlyName MachineName QueueName1 Message StackTrace Arguments Branch
2021-07-05 09:11:45.7870000 2.156 System.Text.Tests.TranscodingStreamTests Read_Span System.Text.Encoding.Tests ddvsotx2l280 ubuntu.1804.armarch.open Assert.Equal() Failure
↓ (pos 6482135)
Expected: ···defghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijkl···
Actual: ···defghijklmnopqrstuvwXyzabcdefghijklmnopqrstuvwxyzabcdefghijkl···
↑ (pos 6482135)
at System.Text.Tests.TranscodingStreamTests.g__RunOneTestIteration 22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 390
at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in /
/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 337
at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 314
bufferLength: 1
2021-07-06 21:41:48.9150000 2.884 System.Text.Tests.TranscodingStreamTests Read_Span System.Text.Encoding.Tests ddvsotx2l280 ubuntu.1804.armarch.open Assert.Equal() Failure
↓ (pos 4839671)
Expected: ···lmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrst···
Actual: ···lmnopqrstuvwxyzabcdeFghijklmnopqrstuvwxyzabcdefghijklmnopqrst···
↑ (pos 4839671)
at System.Text.Tests.TranscodingStreamTests.g__RunOneTestIteration 22_0(Int32 stringLength, <>c__DisplayClass22_0& ) in //src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 390
at System.Text.Tests.TranscodingStreamTests.RunReadTest(Func`3 callback) in /
/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 337
at System.Text.Tests.TranscodingStreamTests.Read_Span(Int32 bufferLength) in /_/src/libraries/System.Text.Encoding/tests/Encoding/TranscodingStreamTests.cs:line 314
bufferLength: 1

@kunalspathak kunalspathak assigned MattGal and unassigned kunalspathak Jul 26, 2021
@kunalspathak
Copy link
Member

Assigning to you @MattGal - Feel free to redirect to me once we get the result of memtest.

@JulieLeeMSFT
Copy link
Member

@MattGal, any update with memtest?

@danmoseley
Copy link
Member

@parose1 responded a few days ago on the issue linked.

This is failing ~1 time every 2 months, so it will be a long time before we have more data. I suggest to either close this (optimistically) or move out of 6.0

@danmoseley
Copy link
Member

The query above shows no new failures since 7/6 so I suggest to close optimistically. I still think it's important to re-flash or whatever the device.

@kunalspathak
Copy link
Member

I agree with @danmoseley .

@JulieLeeMSFT
Copy link
Member

Thanks @danmoseley and @kunalspathak.

@MattGal
Copy link
Member

MattGal commented Aug 10, 2021

@MattGal, any update with memtest?

It didn't seem to exist for this specific hardware setup, so it got reimaged instead.

@danmoseley
Copy link
Member

Let's meet back here in 2 months and see whether this test failed again and if so on which machine.

@danmoseley
Copy link
Member

Hopefully @MattGal the machine name didn't change? If not what is it now.

@MattGal
Copy link
Member

MattGal commented Aug 10, 2021

Hopefully @MattGal the machine name didn't change? If not what is it now.

@parose1 can confirm but in my experience, they do not change the machine name. That's why all our Macs still have their Jenkins names :)

@ghost ghost locked as resolved and limited conversation to collaborators Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm32 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-linux Linux OS (any supported distro) test-run-core Test failures in .NET Core test runs
Projects
None yet
Development

No branches or pull requests

9 participants