Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ci): flaky Fabric 2.x run tx endpoint test #718

Closed
petermetz opened this issue Mar 24, 2021 · 0 comments · Fixed by #1300
Closed

fix(ci): flaky Fabric 2.x run tx endpoint test #718

petermetz opened this issue Mar 24, 2021 · 0 comments · Fixed by #1300
Assignees
Labels
bug Something isn't working

Comments

@petermetz
Copy link
Member

Describe the bug

The test below fails occasionally and only on the CI servers of GHA, not reproducible on development machines unfortunately.
packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts

To Reproduce

Keep submitting PRs and you'll hit this issue every once in a while, forcing you to re-run the CI and then have it pass...

Expected behavior

Tests are as stable as possible.

Logs/Stack traces

ok 48 - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v1-4-x/run-transaction-endpoint-v1.test.ts # time=239891.244ms

# Subtest: packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
    # BEFORE runs tx on a Fabric v2.2.0 ledger
    Detected current process to be running inside a Github Action. Pruning all docker resources...
    [2021-03-24T04:43:56.536Z] DEBUG (Containers#pruneDockerResources()): Finished pruning all docker resources. Outcome: {
      containers: { ContainersDeleted: null, SpaceReclaimed: 0 },
      images: { ImagesDeleted: null, SpaceReclaimed: 0 },
      networks: { NetworksDeleted: null },
      volumes: {
        VolumesDeleted: [
          '73ea125010f87ec68098c611a17a9b2a68d6c0e7478b8d37508608cedb1cf310',
          [length]: 1
        ],
        SpaceReclaimed: 2638646768
      }
    }
    ok 1 Pruning didnt throw OK
    # runs tx on a Fabric v2.2.0 ledger
    # test count(1) != plan(null)
    # failed 1 test
not ok 49 - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts # time=3600223.477ms
  ---
  env:
    TS_NODE_COMPILER_OPTIONS: '{"jsx":"react"}'
  file: packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
  timeout: 3600000
  command: /opt/hostedtoolcache/node/14.15.1/x64/bin/node
  args:
    - -r
    - /home/runner/work/cactus/cactus/node_modules/ts-node/register/index.js
    - --max-old-space-size=4096
    - packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts
  stdio:
    - 0
    - pipe
    - 2
  cwd: /home/runner/work/cactus/cactus
  failures:
    - tapError: no plan
  exitCode: null
  signal: SIGTERM
  ...

Bail out! packages/cactus-plugin-ledger-connector-fabric/src/test/typescript/integration/fabric-v2-2-x/run-transaction-endpoint-v1.test.ts

Screenshots

N/A

Cloud provider or hardware configuration:

GitHub Actions Runner

Operating system name, version, build:

Ubuntu 18/20 LTS

Hyperledger Cactus release version or commit (git rev-parse --short HEAD):

main @ fea547f

Hyperledger Cactus Plugins/Connectors Used

Fabric

Additional context

This seems to be the last bug standing after several other ones were fixed recently as part of the effort to make #656 possible.
Guess on what might be going wrong: The Fabric 2.x AIO container may be hanging at boot with a race condition that it cannot recover from. This is more likely than it being just a not enough time on slow hardware type of thing because usually 3 out of 4 runs in the CI matrix pass in 30 minutes but then the 4th one takes an hour+ and still times out...

cc: @takeutak @sfuji822 @hartm @jonathan-m-hamilton @AzaharaC @jordigiam @kikoncuo @jagpreetsinghsasan

@petermetz petermetz added the bug Something isn't working label Mar 24, 2021
@petermetz petermetz self-assigned this Sep 2, 2021
petermetz added a commit to petermetz/cacti that referenced this issue Sep 3, 2021
Epic facepalm once again. Turns out the default restart try
count of supervisord is too low which leads to race conditions.
Increasing the retry count from 4 to 20 should do it, this way
the fabric-network process (see supervisord.conf file) should
be 5 times as "patient" waiting for the docker daemon to launch
within the AIO container.

What was happening before is that the fabric-network script
tried launching itself in parallel with the docker daemon, but
it would time out before the docker daemon could come online.

Published these images as
ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries
and
ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries

Fixes hyperledger#718
Fixes hyperledger#876
Fixes hyperledger#320
Fixes hyperledger#319

Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>
petermetz added a commit that referenced this issue Sep 7, 2021
Epic facepalm once again. Turns out the default restart try
count of supervisord is too low which leads to race conditions.
Increasing the retry count from 4 to 20 should do it, this way
the fabric-network process (see supervisord.conf file) should
be 5 times as "patient" waiting for the docker daemon to launch
within the AIO container.

What was happening before is that the fabric-network script
tried launching itself in parallel with the docker daemon, but
it would time out before the docker daemon could come online.

Published these images as
ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries
and
ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries

Fixes #718
Fixes #876
Fixes #320
Fixes #319

Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>
RafaelAPB pushed a commit to RafaelAPB/blockchain-integration-framework that referenced this issue Mar 9, 2022
Epic facepalm once again. Turns out the default restart try
count of supervisord is too low which leads to race conditions.
Increasing the retry count from 4 to 20 should do it, this way
the fabric-network process (see supervisord.conf file) should
be 5 times as "patient" waiting for the docker daemon to launch
within the AIO container.

What was happening before is that the fabric-network script
tried launching itself in parallel with the docker daemon, but
it would time out before the docker daemon could come online.

Published these images as
ghcr.io/hyperledger/cactus-fabric2-all-in-one:2021-09-02--fix-876-supervisord-retries
and
ghcr.io/hyperledger/cactus-fabric-all-in-one:2021-09-02--fix-876-supervisord-retries

Fixes hyperledger#718
Fixes hyperledger#876
Fixes hyperledger#320
Fixes hyperledger#319

Signed-off-by: Peter Somogyvari <peter.somogyvari@accenture.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant