Vector not recovering SQS source processing after network disruption

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment



### Problem

We observe Vector process stuck on processing data on network disruption. 
It's similar issue as in https://github.com/vectordotdev/vector/issues/20337, however, in our case setup is SQS source with ClickHouse sink.

When the NAT machine is dropping connections, both source and sink components produces following:

> connection error: Connection reset by peer

It's expected. However, the process doesn't recover.

What I noticed from logs is that the SQS source eventually gave up on message ack when doing `DeleteMessageBatch`:

```
2025-06-18T08:33:13.350059Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3012487 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::orchestrator: a retry is either unnecessary or not possible, exiting attempt loop
```

ClickHouse sink has one successful HTTP request with 200 response after `Connection reset by peer` was observed. SQS source seems to be not able to recover. 

Unfortunately, settings that were introduced in https://github.com/vectordotdev/vector/pull/20120 are not available to SQS source itself. Maybe increasing timeouts could help. However, I found there is only one attempt expected:

```
2025-06-18T08:33:13.298263Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3827355 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=1 max_attempts=1
```

### Configuration

```text
...
sinks:
  clickhouse:
    acknowledgements:
      enabled: true
    auth:
      password: ${CLICKHOUSE_PASSWORD}
      strategy: basic
      user: {$CLICKHOUSE_USER}
    batch:
      max_events: 15000
      timeout_secs: 5
    buffer:
      max_events: 15000
    database: default
    endpoint: https://SERVER:8443
    inputs:
      - flatten_metrics
    skip_unknown_fields: true
    date_time_best_effort: true
    table: xxx
    type: clickhouse
  prometheus:
    address: 0.0.0.0:9090
    flush_period_secs: 60
    inputs:
      - component_id_to_name_backward_compatibility
    type: prometheus_exporter

sources:
  sqs_input:
    decoding:
      codec: json
    delete_message: true
    queue_url: https://...
    region: us-west-2
    type: aws_sqs
    visibility_timeout_secs: 600
    timeout:
      connect_timeout_seconds: 20
      operation_timeout_seconds: 20
      read_timeout_seconds: 20
  vector_logs:
    type: internal_logs
  vector_metrics:
    type: internal_metrics

transforms:
  component_id_to_name_backward_compatibility:
    inputs:
      - vector_metrics
    source: |
      if exists(.tags.component_id) {
        .tags.component_name = .tags.component_id
      }
    type: remap
  flatten_metrics:
    inputs:
      - sqs_input
    source: |
      ...
    type: remap
```

### Version

0.47.0

### Debug Output

```text
https://pastila.nl/?0000e025/b082240738a9a148cf97bef74197470a#RPiacgawYfSaQ5YNol4b8w==
```

### Example Data

_No response_

### Additional Context

_No response_

### References

https://github.com/vectordotdev/vector/issues/20337

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vector not recovering SQS source processing after network disruption #23227

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector not recovering SQS source processing after network disruption #23227

Description

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions