Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output Cloudwatch Logs not using Region #11963

Open
alec-medcrypt opened this issue Oct 7, 2022 · 23 comments
Open

Output Cloudwatch Logs not using Region #11963

alec-medcrypt opened this issue Oct 7, 2022 · 23 comments
Labels
upstream bug or issues that rely on dependency fixes

Comments

@alec-medcrypt
Copy link

I have a telegraf config setup on a node without the AWS CLI installed. Below is the config:

`# Generic, basic /usr/local/etc/telegraf.conf file for FreeBSD

Gathers some basic metrics and transmits them to cloudwatch

Be sure to set the region below

[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
logfile = "/var/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false

[[processors.aws_ec2]]
imds_tags = ["instanceId"]

[[inputs.tail]]
name_override = "app_logs"
files = ["/var/log/app/*.log"]
data_format = "grok"
grok_patterns = ['%{GREEDYDATA:message}']

[[outputs.cloudwatch_logs]]
region = "us-east-2"
log_group = "/namespace/env/app"
log_stream = "tag:instanceId"
log_data_metric_name = "app_logs"
log_data_source = "field:message"`

I am getting the following errors:

2022-10-07T17:16:36Z E! [telegraf] Error running agent: connecting output outputs.cloudwatch_logs: Error connecting to output "outputs.cloudwatch_logs": operation error CloudWatch Logs: DescribeLogGroups, failed to resolve service endpoint, an AWS region is required, but was not found 2022-10-07T17:16:37Z E! [agent] Failed to connect to [outputs.cloudwatch_logs], retrying in 15s, error was 'operation error CloudWatch Logs: DescribeLogGroups, failed to resolve service endpoint, an AWS region is required, but was not found'

@powersj
Copy link
Contributor

powersj commented Oct 11, 2022

Hi,

What are you using for credentials to log in to AWS?

The typical config stanza includes the following options:

  ## Amazon Credentials
  ## Credentials are loaded in the following order
  ## 1) Web identity provider credentials via STS if role_arn and
  ##    web_identity_token_file are specified
  ## 2) Assumed credentials via STS if role_arn is specified
  ## 3) explicit credentials from 'access_key' and 'secret_key'
  ## 4) shared profile from 'profile'
  ## 5) environment variables
  ## 6) shared credentials file
  ## 7) EC2 Instance Profile
  # access_key = ""
  # secret_key = ""
  # token = ""
  # role_arn = ""
  # web_identity_token_file = ""
  # role_session_name = ""
  # profile = ""
  # shared_credential_file = ""

Are you certain whatever credentials you are using also have access to that region?

Thanks!

@powersj powersj added the waiting for response waiting for response from contributor label Oct 11, 2022
@telegraf-tiger
Copy link
Contributor

Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Page. Thank you!

@starek4
Copy link

starek4 commented Jul 17, 2023

@alec-medcrypt did you solve this? having the same issue.

I am using these options:

  • region
  • access_key
  • secret_key
  • log_group
  • log_stream
  • log_data_metric_name
  • log_data_source

The very same config is working on my local desktop, but not working on IoT arm device. Exactly the same behaviour as you described. On my local desktop I have the AWS CLI, on the IoT device I don't have it.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jul 17, 2023
@starek4
Copy link

starek4 commented Jul 17, 2023

Hmm, I figure it out. Even tho I have correctly setup the aws region in the config file via region key, it started to work after I also set the AWS_REGION env variable.

@colinbut
Copy link

Can we re-open this issue please?

agree with @starek4 - unless i'm missing something else, I also found out that setting up the region key doesn't seem to work and only works when set AWS_REGION env variable.

If it is required to set the env variable then that defeats purpose of having the region key at all in the conf file. I feel we should fix this so that the region key in the conf file works without needing to set the env variable

@powersj
Copy link
Contributor

powersj commented Aug 17, 2023

@colinbut @starek4,

Can you find what the last working version of telegraf was so we can look at what changed?

Looking at the AWS credentials.go, we set the region based on the value in the toml if no RoleARN was set.

@powersj powersj reopened this Aug 17, 2023
@powersj powersj added the waiting for response waiting for response from contributor label Aug 17, 2023
@colinbut
Copy link

@powersj,

I'm using the latest version and I'm using profile credentials. Not sure about @starek4 - what credentials method you using?

The error we're seeing appears to be from the go aws sdk:

operation error CloudWatch Logs: DescribeLogGroups, failed to resolve service endpoint, an AWS region is required,

However, irrespective i had a look at the lines of code you referenced.

Tbh, I don't really understand Go and no proficient in it but doesn't the following code reads "try load options into options variable and if error, execute configV2.WithRegion(c.Region)"

... which if my reading of the code is correct then means the region will never get set assuming load options is successful.

options := []func(*configV2.LoadOptions) error{
		configV2.WithRegion(c.Region),
}

Shouldn't it be something like this instead?

options,err := configV2.LoadOptions
if err != nil {
//... 
}

options = append(options, configV2.WithRegion(c.Region))

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 18, 2023
@starek4
Copy link

starek4 commented Aug 19, 2023

I am using the access key and secret key combination. And also the latest version, just the version for armhf (arm32v7).

@powersj
Copy link
Contributor

powersj commented Aug 19, 2023

Shouldn't it be something like this instead?

No, this is a common pattern. Essentially the client (telegraf) builds an array of options, like connection time outs, credentials to load, etc. and passes that entire array of options to the connect call. Then in this case the AWS library will load all those options, set up the necessary settings and then connect.

You can see examples of this in the AWS SDK for Go v2 Configuration page. There they explain that we can set the region in one of two ways, either how we are with WithRegion(), or via the environment variable.

On Monday, I'll build a custom version of telegraf and we can see what your platforms are reporting.

@powersj
Copy link
Contributor

powersj commented Aug 21, 2023

@starek4, @colinbut,

I have put up PR #13803 which includes some additionally logging. Can you please download the artifacts and reproduce the issue. Those artifacts should be up in 5-10mins.

Please provide the full logs. I would like to see what was set, when and where.

@powersj powersj added the waiting for response waiting for response from contributor label Aug 21, 2023
@colinbut
Copy link

@powersj

2023-08-26T09:13:07Z D! [agent] Connecting outputs
2023-08-26T09:13:07Z D! [agent] Attempting connection to [outputs.cloudwatch_logs]
linux
arm64
env AWS_REGION is ""
setting region to: eu-west-1
loading with root credentials
loaded config is using region: eu-west-1
2023-08-26T09:13:07Z E! [agent] Failed to connect to [outputs.cloudwatch_logs], retrying in 15s, error was "operation error CloudWatch Logs: DescribeLogGroups, failed to resolve service endpoint, an AWS region is required, but was not found"
linux
arm64
env AWS_REGION is ""
setting region to: eu-west-1
loading with root credentials
loaded config is using region: eu-west-1
2023-08-26T09:13:22Z E! [telegraf] Error running agent: connecting output outputs.cloudwatch_logs: error connecting to output "outputs.cloudwatch_logs": operation error CloudWatch Logs: DescribeLogGroups, failed to resolve service endpoint, an AWS region is required, but was not found

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 27, 2023
@powersj
Copy link
Contributor

powersj commented Aug 28, 2023

@colinbut - is eu-west-1 the expected region? If so, that does in fact look like we are reading it correctly and setting the value.

Can you show the same example but with the AWS region set via the environment?

@powersj powersj added the waiting for response waiting for response from contributor label Aug 28, 2023
@colinbut
Copy link

@powersj

Yes i'm using eu-west-1 region.

Here's the log with region set as env var:

root@ip-172-31-34-228 bin]# AWS_REGION=eu-west-1 ./telegraf --config ../../etc/telegraf/telegraf.conf
2023-08-28T14:57:04Z I! Loading config: ../../etc/telegraf/telegraf.conf
2023-08-28T14:57:04Z I! Starting Telegraf 1.28.0-938ed112
2023-08-28T14:57:04Z I! Available plugins: 239 inputs, 9 aggregators, 28 processors, 24 parsers, 59 outputs, 5 secret-stores
2023-08-28T14:57:04Z I! Loaded inputs: cpu disk diskio docker_log kernel mem processes swap system
2023-08-28T14:57:04Z I! Loaded aggregators:
2023-08-28T14:57:04Z I! Loaded processors:
2023-08-28T14:57:04Z I! Loaded secretstores:
2023-08-28T14:57:04Z I! Loaded outputs: cloudwatch_logs
2023-08-28T14:57:04Z I! Tags enabled: host=###
2023-08-28T14:57:04Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"###", Flush Interval:10s
2023-08-28T14:57:04Z D! [agent] Initializing plugins
2023-08-28T14:57:04Z D! [outputs.cloudwatch_logs] Log data: key "field", source "message"...
2023-08-28T14:57:04Z D! [outputs.cloudwatch_logs] Log stream "/aws/ec2/####"...
2023-08-28T14:57:04Z D! [agent] Connecting outputs
2023-08-28T14:57:04Z D! [agent] Attempting connection to [outputs.cloudwatch_logs]
linux
arm64
env AWS_REGION is "eu-west-1"
setting region to: eu-west-1
loading with root credentials
loaded config is using region: eu-west-1
2023-08-28T14:57:04Z D! [outputs.cloudwatch_logs] Found log group "/aws/ec2/####"
2023-08-28T14:57:04Z D! [agent] Successfully connected to outputs.cloudwatch_logs
2023-08-28T14:57:04Z D! [agent] Starting service inputs
2023-08-28T14:57:14Z D! [outputs.cloudwatch_logs] Processing metric 'docker_log map[com.docker.

*note i've masked certain data to preserve my company's ip but the logs clearly show it can connect to the outputs.cloudwatch_logs output plugin.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 28, 2023
@powersj
Copy link
Contributor

powersj commented Aug 28, 2023

I'm thinking this is worth an upstream issue then. Both scenarios appear to load a config using the correct region and you appear to have permissions for the region. Nothing appears to be obviously wrong when we make the request.

Can you file an issue at the upstream issue: https://github.com/aws/aws-sdk-go-v2/issues

@powersj powersj added waiting for response waiting for response from contributor upstream bug or issues that rely on dependency fixes labels Aug 28, 2023
@colinbut
Copy link

I can certainly try...

Agree, it appears to be the go aws sdk could be the problem if the code what you say is as what you expect.

Have raised issue on aws-sdk-go-v2 - aws/aws-sdk-go-v2#2260

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Aug 30, 2023
@colinbut
Copy link

colinbut commented Sep 2, 2023

for anyone encountering this problem also, the workaround for time being is to simply invoke telegraf by supplying the AWS Region as env var, e.g.

AWS_REGION=eu-west-1 ./telegraf --config /etc/telegraf/telegraf.conf

Alternatively, if managing this via systemd, can edit the default service file:

e.g.

[Unit]
Description=Telegraf
Documentation=https://github.com/influxdata/telegraf
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/telegraf
User=telegraf
Environment="AWS_REGION=eu-west-1"
ExecStart=/usr/bin/telegraf -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d $TELEGRAF_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartForceExitStatus=SIGPIPE
KillMode=control-group
LimitMEMLOCK=8M:8M

[Install]
WantedBy=multi-user.target

@powersj
Copy link
Contributor

powersj commented Sep 5, 2023

Thanks for raising the issue with them. I have responded on the thread as well.

powersj added a commit to powersj/telegraf that referenced this issue Sep 5, 2023
@powersj
Copy link
Contributor

powersj commented Sep 5, 2023

@colinbut,

I have put up another PR for you to try after trying to learn about the different aws config loading options. Looking at #10841 it does look like a change to how credentials were loaded occurred.

Could you give #13868 a try once artifacts are posted?

@powersj powersj added the waiting for response waiting for response from contributor label Sep 5, 2023
@colinbut
Copy link

colinbut commented Sep 6, 2023

@powersj

i now get a different error:

2023-09-06T09:45:23Z D! [outputs.cloudwatch_logs] Log data: key "field", source "message"...
2023-09-06T09:45:23Z D! [outputs.cloudwatch_logs] Log stream "log-group"...
2023-09-06T09:45:23Z D! [agent] Connecting outputs
2023-09-06T09:45:23Z D! [agent] Attempting connection to [outputs.cloudwatch_logs]
linux
arm64
env AWS_REGION is ""
setting region to: eu-west-1
loading with root credentials
loaded config is using region: eu-west-1
2023-09-06T09:45:26Z E! [agent] Failed to connect to [outputs.cloudwatch_logs], retrying in 15s, error was "operation error CloudWatch Logs: DescribeLogGroups, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"/\": unsupported protocol scheme \"\""
linux
arm64
env AWS_REGION is ""
setting region to: eu-west-1
loading with root credentials
loaded config is using region: eu-west-1
2023-09-06T09:45:44Z E! [telegraf] Error running agent: connecting output outputs.cloudwatch_logs: error connecting to output "outputs.cloudwatch_logs": operation error CloudWatch Logs: DescribeLogGroups, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "/": unsupported protocol scheme ""

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 6, 2023
@powersj
Copy link
Contributor

powersj commented Sep 6, 2023

@colinbut

Two questions as I'm not sure I follow the limited logs:

  1. Can you provide your config with any secrets removed?
  2. The log messages show some comments about log data and stream. Which means it did connect, but then I see connecting outputs. Are the log messages from two different runs? single run?

@powersj powersj added the waiting for response waiting for response from contributor label Sep 6, 2023
@colinbut
Copy link

colinbut commented Sep 7, 2023

@powersj

just like last time, it is one single run.

I'm running the telegraf within EC2 instance that has a docker container running inside it. Therefore, I'm using the docker_log inputs plugin to capture logs on stdout and sending to AWS CloudWatch via the cloudwatch outputs plugin.

My telegraf config:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = "0s"
  debug = true
  hostname = ""
  omit_hostname = false

 [[outputs.cloudwatch_logs]]
   region = "eu-west-1"
   profile = "######"
   log_group = "#######"
   log_stream = "log-group"
   log_data_metric_name  = "docker_log"
   log_data_source  = "field:message"

 [[inputs.docker_log]]
    endpoint = "unix:///var/run/docker.sock"
    from_beginning = true
    timeout = "5s"
    source_tag = false

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 7, 2023
powersj added a commit to powersj/telegraf that referenced this issue Sep 7, 2023
@colinbut
Copy link

colinbut commented Mar 1, 2024

@powersj Hi, just wondering whether there is any update regarding this matter since it's been nearly 6 months lapsed.

I can see the last activity is your commit of adding debug logs? to an open branch of yours ?

@powersj
Copy link
Contributor

powersj commented Mar 1, 2024

Hi,

I have no update as I have not looked into this further. There was a branch that you were testing to help me try to understand what was going on, but I am not sure anything was learned from that. Additionally, when we asked upstream for help their opinion was for us to use a debugger. Which means the next steps are either I or someone else will need to somehow reproduce this and walk through what may or may not be going on.

There is a workaround via the environment variable, which is nice, but I understand it is not ideal. I can add this back to my list as I did forget about it, but it could be faster if someone with more knowledge of AWS tried to as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug or issues that rely on dependency fixes
Projects
None yet
4 participants