Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
We observe Vector process stuck on processing data on network disruption.
It's similar issue as in #20337, however, in our case setup is SQS source with ClickHouse sink.
When the NAT machine is dropping connections, both source and sink components produces following:
connection error: Connection reset by peer
It's expected. However, the process doesn't recover.
What I noticed from logs is that the SQS source eventually gave up on message ack when doing DeleteMessageBatch
:
2025-06-18T08:33:13.350059Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3012487 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::orchestrator: a retry is either unnecessary or not possible, exiting attempt loop
ClickHouse sink has one successful HTTP request with 200 response after Connection reset by peer
was observed. SQS source seems to be not able to recover.
Unfortunately, settings that were introduced in #20120 are not available to SQS source itself. Maybe increasing timeouts could help. However, I found there is only one attempt expected:
2025-06-18T08:33:13.298263Z DEBUG source{component_kind="source" component_id=sqs_input component_type=aws_sqs}:sqs.DeleteMessageBatch{rpc.service="sqs" rpc.method="DeleteMessageBatch" sdk_invocation_id=3827355 rpc.system="aws-api"}:try_op: aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=1 max_attempts=1
Configuration
...
sinks:
clickhouse:
acknowledgements:
enabled: true
auth:
password: ${CLICKHOUSE_PASSWORD}
strategy: basic
user: {$CLICKHOUSE_USER}
batch:
max_events: 15000
timeout_secs: 5
buffer:
max_events: 15000
database: default
endpoint: https://SERVER:8443
inputs:
- flatten_metrics
skip_unknown_fields: true
date_time_best_effort: true
table: xxx
type: clickhouse
prometheus:
address: 0.0.0.0:9090
flush_period_secs: 60
inputs:
- component_id_to_name_backward_compatibility
type: prometheus_exporter
sources:
sqs_input:
decoding:
codec: json
delete_message: true
queue_url: https://...
region: us-west-2
type: aws_sqs
visibility_timeout_secs: 600
timeout:
connect_timeout_seconds: 20
operation_timeout_seconds: 20
read_timeout_seconds: 20
vector_logs:
type: internal_logs
vector_metrics:
type: internal_metrics
transforms:
component_id_to_name_backward_compatibility:
inputs:
- vector_metrics
source: |
if exists(.tags.component_id) {
.tags.component_name = .tags.component_id
}
type: remap
flatten_metrics:
inputs:
- sqs_input
source: |
...
type: remap
Version
0.47.0
Debug Output
https://pastila.nl/?0000e025/b082240738a9a148cf97bef74197470a#RPiacgawYfSaQ5YNol4b8w==
Example Data
No response
Additional Context
No response