-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vh dcos #3159
Vh dcos #3159
Conversation
- added possible to configure timeout - added app metrics gathering
- app metrics special treatment
- testing of slaves filtering
- added possible to configure timeout - added app metrics gathering - added units into metric name
- handling "NaN" values - more generic handling of app metrics
- up-to-date output example
plugins/inputs/dcos/README.md
Outdated
# Base URL of DC/OS cluster, e.g. http://dcos.example.com | ||
cluster_url="" | ||
# Authentication token, obtained by running: dcos config show core.dcos_acs_token | ||
auth_token="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reading the http auth documentation and found this:
Authentication tokens expire after 5 days. You can view the expiration time in the “exp” (Expiration Time) Claim of the JSON Web Token (JWT). You can refresh your token by re-logging in to DC/OS.
Does this mean then that the user would need to update this token every 5 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I missed that. It doesn't seem there is a way how to programmatically obtain the token.
I can add support to turn off the authentication but I doubt someone would do that.
I will update code to be able to work from inside the cluster (different URLs), as Mesos is written to do it this way.
plugins/inputs/dcos/README.md
Outdated
> dcos_cpu,hostname=10.0.1.178,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S5,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,scope=node,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,host=GAMGEE user_percent=0.17,system_percent=0.13,idle_percent=99.58,wait_percent=0.05,cores_count=4,total_percent=0.30000000000000004 1503487657000000000 | ||
> dcos_cpus,framework_principal=cassandra-principal,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0002,hostname=10.0.3.245,scope=container,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,executor_name=node-2_executor,executor_id=node-2_executor__171b3056-0eda-4eb5-ba55-65b4a5f82c84,host=GAMGEE,framework_name=cassandra,framework_role=cassandra-role,container_id=2ded88f3-ecce-4dad-9718-c472a65dde27,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5 user_time_seconds=270.73,system_time_seconds=158.24,limit_count=1,throttled_time_seconds=697.690003783 1503487657000000000 | ||
> dcos_mem,source=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,executor_id=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_0_VHOST=oinker.acme.org,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,framework_principal=dcos_marathon,executor_name=Command\ Executor\ (Task:\ oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3)\ (Command:\ sh\ -c\ 'export\ OINKE...'),host=GAMGEE,container_id=aa3e66b2-b77e-42c9-9e5f-86e6b1cfd1fe,hostname=10.0.3.245,HAPROXY_GROUP=external,framework_name=marathon,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0001,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,framework_role=slave_public,scope=container total_bytes=310349824,limit_bytes=167772160 1503487657000000000 | ||
> dcos_disk,framework_name=marathon,framework_principal=dcos_marathon,scope=container,source=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_GROUP=external,framework_role=slave_public,host=GAMGEE,container_id=aa3e66b2-b77e-42c9-9e5f-86e6b1cfd1fe,executor_id=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_0_VHOST=oinker.acme.org,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0001,hostname=10.0.3.245,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,executor_name=Command\ Executor\ (Task:\ oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3)\ (Command:\ sh\ -c\ 'export\ OINKE...') limit_bytes=0,used_bytes=0 1503487657000000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
executer_name here is very messy, maybe we should whitelist the datapoint tags we want to collect, basically just path and interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
plugins/inputs/dcos/dcos.go
Outdated
const AgentContainersUrl = "/system/v1/agent/%s/metrics/v0/containers" | ||
const AgentContainerMetricsUrl = "/system/v1/agent/%s/metrics/v0/containers/%s" | ||
const AgentContainerAppMetricsUrl = "/system/v1/agent/%s/metrics/v0/containers/%s/app" | ||
const AgentNodeMetricsUrl = "/system/v1/agent/%s/metrics/v0/node" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call these something like MesosMasterStateSummaryPathTemplate
Style nitpick, use const block:
const (
)
plugins/inputs/dcos/dcos.go
Outdated
const AgentNodeMetricsUrl = "/system/v1/agent/%s/metrics/v0/node" | ||
|
||
var tr *http.Transport | ||
var client *http.Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these part of the main struct, otherwise you won't be able to have different settings if you define the plugin more than once. You will only need to have a field for client since it keeps a reference to the transport.
plugins/inputs/dcos/dcos.go
Outdated
|
||
if client == nil { | ||
tr = &http.Transport{ | ||
ResponseHeaderTimeout: timeout - time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is done, just having a client timeout is probably good enough.
I think we will need to support TLS connections to dcos cluster. config example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout: I saw this pattern with two timeouts in other plugins.
TLS: This is one of the TODOs. We haven't tested this with Enterprise DC/OS, default installation don't have TLS enabled by default.
Now it depends how the authentication problem will be solve. If plugin will not work from outside, this doesn't need to be done.
plugins/inputs/dcos/dcos.go
Outdated
# DC/OS agent node file system mount for which related metrics should be gathered. Leave empty for all. | ||
file_system_mounts = [] | ||
# DC/OS agent node network interface names for which related metrics should be gathered. Leave empty for all. | ||
network_interfaces = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename interface_include
plugins/inputs/dcos/dcos.go
Outdated
# List of DC/OS agent hostnames from which the metrics should be gathered. Leave empty for all. | ||
agents = [] | ||
# DC/OS agent node file system mount for which related metrics should be gathered. Leave empty for all. | ||
file_system_mounts = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename path_include
plugins/inputs/dcos/dcos.go
Outdated
# Authentication token, obtained by running: dcos config show core.dcos_acs_token | ||
auth_token="" | ||
# List of DC/OS agent hostnames from which the metrics should be gathered. Leave empty for all. | ||
agents = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename agent_include
, this is in case we need to extend this to have an exclude version later.
plugins/inputs/dcos/dcos.go
Outdated
} | ||
|
||
func (m *Dcos) Gather(acc telegraf.Accumulator) error { | ||
err := m.validateConfiguration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only be ran once ever.
plugins/inputs/dcos/dcos.go
Outdated
continue | ||
} | ||
var measurementSuffix string | ||
if metricType != App { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we go to great lengths to create the measurement suffix, it might be cleaner to just leave the suffix as the field prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done based on our previous design decision to 'group' metrics per feature.
- added support for gathering metric using internal URLs - changed names of filtering configuration - improved code and comments - improved HTTP connection pooling
- added support for TLS connection - added stream JSON decoding Minor improvements: - client_timeout renamed to more convenient response_timeout - added comments for main struct fields
- configuration validation returns error on each problem separately Minor improvement: - string values are silently skipped
Thanks a lot for the code review! |
@nhaugo can we close this PR? |
# Public URL of DC/OS cluster, e.g. http://dcos.example.com. Use of access from outside of the DC/OS cluster. master_hostname has higher priority, if set | ||
#cluster_url="" | ||
# Authentication token, obtained by running: dcos config show core.dcos_acs_token. Leave empty for no authentication. | ||
# Warning: authentication token is valid only 5 days in DC/OS 1.10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify my understanding, if using master_hostname
then the auth_token
is not required, and if using cluster_url
then you would need to update this every 5 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using cluster_url auth_token should set, cause it by default required, and yes, it will have to be updated each 5 days.
Because of that I chose master_hostname as default, and left cluster_url for remote connection as second option.
Main reason is that I don't know how the plugin will be used.
For sure, updating telegraf configuration each 5 days, because of invalid token, seems not usable, I can leave it in code and remove it from configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, don't change it for now and I will figure out how we want to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I found the POST /auth/login endpoint in enterprise DC/OS (might need to scroll to the bottom). It seems like it could be used, or perhaps another endpoint that is specific to the enterprise version, but without a copy it will be difficult to figure out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. As I hadn't available Enterprise DC/OS I didn't checked its API.
So, for Enterprise DC/OS we can have configuration for the service token and we can perform login. But for free DC/OS, this will not work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need an Enterprise DC/OS before we can implement any of it's login methods. I did determine how to create a jwt without expiration.
I think the required auth_token_secret
is too sensitive to place in the telegraf config but we could document how to manually create the jwt and that could be added to the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, who can arrange a trial access to an Enterprise DC/OS instance? Should I ask Tim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way how to create new authentication token with prolonged expiration looks as a hack, depending on internals of DC/OS.
Are we sure, that such stuff should be documented or referenced in an official open-source application?
plugins/inputs/dcos/README.md
Outdated
# HTTP Response timeout, value must be more than a second | ||
#response_timeout = 30s | ||
# Set of default allowed tags. See readme.md for more tag keys. | ||
taginclude = ["cluster_url","path","interface","hostname","container_id","mesos_id","framework_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add this to the example config since it is a global filter option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you mean to remove it from the readme, but leave it in the plugin default config, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, remove it from both places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. In a previous comment you suggested to filter out some ugly tags.
Did you mentioned that just for the plugin output example, not as an default option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find what I said about this, maybe you can paste a quote of it? We shouldn't have the taginclude option in the readme or default config since this is only something the plugin user would use for special filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. We have a similar situation in the docker input which was setup to add labels as tags. This worked okay since the user generally sets the labels, until kubernetes starting setting it's own labels which resulted in many useless labels. Eventually we added the docker_label_include/docker_label_exclude filters specifically for labels.
It seems to me that some of these container tags are not very useful, and so I think we should only collect specific tags that we know are going to be helpful. In my opinion it is better to add too little than too much. While the taginclude
option could be used by the plugin user to further restrict the tags, I'd like to have the plugin behave well without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that the tag white-list should be hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be fine if they are hardcoded, any tags we add need to be explicit and purposeful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will implement it this way: Current tags in taginclude will be filtered in the code, however, if taginclude will be specified by user, hardcoded filter will be skipped, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, have the hardcoded filter be independent of the taginclude, you won't actually have access to it from within the plugin anyway.
- hardcoded filtering of metrics
Input plugin for gathering DC/OS agent metrics