fix(ci): flaky Fabric 2.x run tx endpoint test #718

petermetz · 2021-03-24T05:55:29Z

Describe the bug

The test below fails occasionally and only on the CI servers of GHA, not reproducible on development machines unfortunately.
packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts

To Reproduce

Keep submitting PRs and you'll hit this issue every once in a while, forcing you to re-run the CI and then have it pass...

Expected behavior

Tests are as stable as possible.

Logs/Stack traces

ok 48 - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v1-4-x/run-transaction-endpoint-v1.test.ts # time=239891.244ms

# Subtest: packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
    # BEFORE runs tx on a Fabric v2.2.0 ledger
    Detected current process to be running inside a Github Action. Pruning all docker resources...
    [2021-03-24T04:43:56.536Z] DEBUG (Containers#pruneDockerResources()): Finished pruning all docker resources. Outcome: {
      containers: { ContainersDeleted: null, SpaceReclaimed: 0 },
      images: { ImagesDeleted: null, SpaceReclaimed: 0 },
      networks: { NetworksDeleted: null },
      volumes: {
        VolumesDeleted: [
          '73ea125010f87ec68098c611a17a9b2a68d6c0e7478b8d37508608cedb1cf310',
          [length]: 1
        ],
        SpaceReclaimed: 2638646768
      }
    }
    ok 1 Pruning didnt throw OK
    # runs tx on a Fabric v2.2.0 ledger
    # test count(1) != plan(null)
    # failed 1 test
not ok 49 - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts # time=3600223.477ms
  ---
  env:
    TS_NODE_COMPILER_OPTIONS: '{"jsx":"react"}'
  file: packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
  timeout: 3600000
  command: /opt/hostedtoolcache/node/14.15.1/x64/bin/node
  args:
    - -r
    - /home/runner/work/cactus/cactus/node_modules/ts-node/register/index.js
    - --max-old-space-size=4096
    - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
  stdio:
    - 0
    - pipe
    - 2
  cwd: /home/runner/work/cactus/cactus
  failures:
    - tapError: no plan
  exitCode: null
  signal: SIGTERM
  ...

Bail out! packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts

Screenshots

N/A

Cloud provider or hardware configuration:

GitHub Actions Runner

Operating system name, version, build:

Ubuntu 18/20 LTS

Hyperledger Cactus release version or commit (git rev-parse --short HEAD):

main @ fea547f

Hyperledger Cactus Plugins/Connectors Used

Fabric

Additional context

This seems to be the last bug standing after several other ones were fixed recently as part of the effort to make #656 possible.
Guess on what might be going wrong: The Fabric 2.x AIO container may be hanging at boot with a race condition that it cannot recover from. This is more likely than it being just a not enough time on slow hardware type of thing because usually 3 out of 4 runs in the CI matrix pass in 30 minutes but then the 4th one takes an hour+ and still times out...

cc: @takeutak @sfuji822 @hartm @jonathan-m-hamilton @AzaharaC @jordigiam @kikoncuo @jagpreetsinghsasan

The text was updated successfully, but these errors were encountered:

Epic facepalm once again. Turns out the default restart try count of supervisord is too low which leads to race conditions. Increasing the retry count from 4 to 20 should do it, this way the fabric-network process (see supervisord.conf file) should be 5 times as "patient" waiting for the docker daemon to launch within the AIO container. What was happening before is that the fabric-network script tried launching itself in parallel with the docker daemon, but it would time out before the docker daemon could come online. Published these images as ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries and ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries Fixes hyperledger#718 Fixes hyperledger#876 Fixes hyperledger#320 Fixes hyperledger#319 Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>

Epic facepalm once again. Turns out the default restart try count of supervisord is too low which leads to race conditions. Increasing the retry count from 4 to 20 should do it, this way the fabric-network process (see supervisord.conf file) should be 5 times as "patient" waiting for the docker daemon to launch within the AIO container. What was happening before is that the fabric-network script tried launching itself in parallel with the docker daemon, but it would time out before the docker daemon could come online. Published these images as ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries and ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries Fixes #718 Fixes #876 Fixes #320 Fixes #319 Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>

Epic facepalm once again. Turns out the default restart try count of supervisord is too low which leads to race conditions. Increasing the retry count from 4 to 20 should do it, this way the fabric-network process (see supervisord.conf file) should be 5 times as "patient" waiting for the docker daemon to launch within the AIO container. What was happening before is that the fabric-network script tried launching itself in parallel with the docker daemon, but it would time out before the docker daemon could come online. Published these images as ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries and ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries Fixes hyperledger#718 Fixes hyperledger#876 Fixes hyperledger#320 Fixes hyperledger#319 Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>

petermetz added the bug Something isn't working label Mar 24, 2021

petermetz self-assigned this Sep 2, 2021

petermetz mentioned this issue Sep 3, 2021

fix(test): flaky fabric AIO container boot #876 #1300

Merged

petermetz closed this as completed in #1300 Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): flaky Fabric 2.x run tx endpoint test #718

fix(ci): flaky Fabric 2.x run tx endpoint test #718

petermetz commented Mar 24, 2021

fix(ci): flaky Fabric 2.x run tx endpoint test #718

fix(ci): flaky Fabric 2.x run tx endpoint test #718

Comments

petermetz commented Mar 24, 2021