Skip to content

Vector not recovering SQS source processing after network disruption #23227

Open
@jkaflik

Description

@jkaflik

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We observe Vector process stuck on processing data on network disruption.
It's similar issue as in #20337, however, in our case setup is SQS source with ClickHouse sink.

When the NAT machine is dropping connections, both source and sink components produces following:

connection error: Connection reset by peer

It's expected. However, the process doesn't recover.

What I noticed from logs is that the SQS source eventually gave up on message ack when doing DeleteMessageBatch:

2025-06-18T08:33:13.350059Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3012487 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::orchestrator: a retry is either unnecessary or not possible, exiting attempt loop

ClickHouse sink has one successful HTTP request with 200 response after Connection reset by peer was observed. SQS source seems to be not able to recover.

Unfortunately, settings that were introduced in #20120 are not available to SQS source itself. Maybe increasing timeouts could help. However, I found there is only one attempt expected:

2025-06-18T08:33:13.298263Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3827355 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=1 max_attempts=1

Configuration

...
sinks:
  clickhouse:
    acknowledgements:
      enabled: true
    auth:
      password: ${CLICKHOUSE_PASSWORD}
      strategy: basic
      user: {$CLICKHOUSE_USER}
    batch:
      max_events: 15000
      timeout_secs: 5
    buffer:
      max_events: 15000
    database: default
    endpoint: https://SERVER:8443
    inputs:
      - flatten_metrics
    skip_unknown_fields: true
    date_time_best_effort: true
    table: xxx
    type: clickhouse
  prometheus:
    address: 0.0.0.0:9090
    flush_period_secs: 60
    inputs:
      - component_id_to_name_backward_compatibility
    type: prometheus_exporter

sources:
  sqs_input:
    decoding:
      codec: json
    delete_message: true
    queue_url: https://...
    region: us-west-2
    type: aws_sqs
    visibility_timeout_secs: 600
    timeout:
      connect_timeout_seconds: 20
      operation_timeout_seconds: 20
      read_timeout_seconds: 20
  vector_logs:
    type: internal_logs
  vector_metrics:
    type: internal_metrics

transforms:
  component_id_to_name_backward_compatibility:
    inputs:
      - vector_metrics
    source: |
      if exists(.tags.component_id) {
        .tags.component_name = .tags.component_id
      }
    type: remap
  flatten_metrics:
    inputs:
      - sqs_input
    source: |
      ...
    type: remap

Version

0.47.0

Debug Output

https://pastila.nl/?0000e025/b082240738a9a148cf97bef74197470a#RPiacgawYfSaQ5YNol4b8w==

Example Data

No response

Additional Context

No response

References

#20337

Metadata

Metadata

Assignees

No one assigned

    Labels

    source: aws_sqsAnything `aws_sqs` source relatedtype: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions