Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vh dcos #3159

Closed
wants to merge 30 commits into from
Closed

Vh dcos #3159

wants to merge 30 commits into from

Conversation

vlastahajek
Copy link
Contributor

@vlastahajek vlastahajek commented Aug 23, 2017

Input plugin for gathering DC/OS agent metrics

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

@danielnelson danielnelson added this to the 1.5.0 milestone Aug 23, 2017
@danielnelson danielnelson added the feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin label Aug 23, 2017
# Base URL of DC/OS cluster, e.g. http://dcos.example.com
cluster_url=""
# Authentication token, obtained by running: dcos config show core.dcos_acs_token
auth_token=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading the http auth documentation and found this:

Authentication tokens expire after 5 days. You can view the expiration time in the “exp” (Expiration Time) Claim of the JSON Web Token (JWT). You can refresh your token by re-logging in to DC/OS.

Does this mean then that the user would need to update this token every 5 days?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I missed that. It doesn't seem there is a way how to programmatically obtain the token.
I can add support to turn off the authentication but I doubt someone would do that.
I will update code to be able to work from inside the cluster (different URLs), as Mesos is written to do it this way.

> dcos_cpu,hostname=10.0.1.178,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S5,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,scope=node,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,host=GAMGEE user_percent=0.17,system_percent=0.13,idle_percent=99.58,wait_percent=0.05,cores_count=4,total_percent=0.30000000000000004 1503487657000000000
> dcos_cpus,framework_principal=cassandra-principal,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0002,hostname=10.0.3.245,scope=container,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,executor_name=node-2_executor,executor_id=node-2_executor__171b3056-0eda-4eb5-ba55-65b4a5f82c84,host=GAMGEE,framework_name=cassandra,framework_role=cassandra-role,container_id=2ded88f3-ecce-4dad-9718-c472a65dde27,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5 user_time_seconds=270.73,system_time_seconds=158.24,limit_count=1,throttled_time_seconds=697.690003783 1503487657000000000
> dcos_mem,source=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,executor_id=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_0_VHOST=oinker.acme.org,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,framework_principal=dcos_marathon,executor_name=Command\ Executor\ (Task:\ oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3)\ (Command:\ sh\ -c\ 'export\ OINKE...'),host=GAMGEE,container_id=aa3e66b2-b77e-42c9-9e5f-86e6b1cfd1fe,hostname=10.0.3.245,HAPROXY_GROUP=external,framework_name=marathon,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0001,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,framework_role=slave_public,scope=container total_bytes=310349824,limit_bytes=167772160 1503487657000000000
> dcos_disk,framework_name=marathon,framework_principal=dcos_marathon,scope=container,source=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_GROUP=external,framework_role=slave_public,host=GAMGEE,container_id=aa3e66b2-b77e-42c9-9e5f-86e6b1cfd1fe,executor_id=oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3,HAPROXY_0_VHOST=oinker.acme.org,cluster_id=5e3d4c26-6bc0-4ad3-8142-d104376ed2b5,framework_id=84259197-1dda-44b1-934b-63fdf2ab1d37-0001,hostname=10.0.3.245,mesos_id=84259197-1dda-44b1-934b-63fdf2ab1d37-S3,cluster_url=http://dcos-elasticloadba-ivcirgsfmba4-1406858042.us-west-1.elb.amazonaws.com,executor_name=Command\ Executor\ (Task:\ oinker.9bbc8ad4-87e3-11e7-86f2-b2ce0138a0e3)\ (Command:\ sh\ -c\ 'export\ OINKE...') limit_bytes=0,used_bytes=0 1503487657000000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executer_name here is very messy, maybe we should whitelist the datapoint tags we want to collect, basically just path and interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

const AgentContainersUrl = "/system/v1/agent/%s/metrics/v0/containers"
const AgentContainerMetricsUrl = "/system/v1/agent/%s/metrics/v0/containers/%s"
const AgentContainerAppMetricsUrl = "/system/v1/agent/%s/metrics/v0/containers/%s/app"
const AgentNodeMetricsUrl = "/system/v1/agent/%s/metrics/v0/node"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call these something like MesosMasterStateSummaryPathTemplate

Style nitpick, use const block:

const (
)

const AgentNodeMetricsUrl = "/system/v1/agent/%s/metrics/v0/node"

var tr *http.Transport
var client *http.Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these part of the main struct, otherwise you won't be able to have different settings if you define the plugin more than once. You will only need to have a field for client since it keeps a reference to the transport.


if client == nil {
tr = &http.Transport{
ResponseHeaderTimeout: timeout - time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this is done, just having a client timeout is probably good enough.

I think we will need to support TLS connections to dcos cluster. config example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout: I saw this pattern with two timeouts in other plugins.

TLS: This is one of the TODOs. We haven't tested this with Enterprise DC/OS, default installation don't have TLS enabled by default.
Now it depends how the authentication problem will be solve. If plugin will not work from outside, this doesn't need to be done.

# DC/OS agent node file system mount for which related metrics should be gathered. Leave empty for all.
file_system_mounts = []
# DC/OS agent node network interface names for which related metrics should be gathered. Leave empty for all.
network_interfaces = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename interface_include

# List of DC/OS agent hostnames from which the metrics should be gathered. Leave empty for all.
agents = []
# DC/OS agent node file system mount for which related metrics should be gathered. Leave empty for all.
file_system_mounts = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename path_include

# Authentication token, obtained by running: dcos config show core.dcos_acs_token
auth_token=""
# List of DC/OS agent hostnames from which the metrics should be gathered. Leave empty for all.
agents = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename agent_include, this is in case we need to extend this to have an exclude version later.

}

func (m *Dcos) Gather(acc telegraf.Accumulator) error {
err := m.validateConfiguration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be ran once ever.

continue
}
var measurementSuffix string
if metricType != App {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we go to great lengths to create the measurement suffix, it might be cleaner to just leave the suffix as the field prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done based on our previous design decision to 'group' metrics per feature.

- added support for gathering metric using internal URLs
- changed names of filtering configuration
- improved code and comments
- improved HTTP connection pooling
 - added support for TLS connection
 - added stream JSON decoding
Minor improvements:
 - client_timeout renamed to more convenient response_timeout
 - added comments for main struct fields
 - configuration validation returns error on each problem separately
Minor improvement:
 - string values are silently skipped
@vlastahajek
Copy link
Contributor Author

Thanks a lot for the code review!

@ivankudibal
Copy link

@nhaugo can we close this PR?

# Public URL of DC/OS cluster, e.g. http://dcos.example.com. Use of access from outside of the DC/OS cluster. master_hostname has higher priority, if set
#cluster_url=""
# Authentication token, obtained by running: dcos config show core.dcos_acs_token. Leave empty for no authentication.
# Warning: authentication token is valid only 5 days in DC/OS 1.10.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding, if using master_hostname then the auth_token is not required, and if using cluster_url then you would need to update this every 5 days?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using cluster_url auth_token should set, cause it by default required, and yes, it will have to be updated each 5 days.
Because of that I chose master_hostname as default, and left cluster_url for remote connection as second option.
Main reason is that I don't know how the plugin will be used.

For sure, updating telegraf configuration each 5 days, because of invalid token, seems not usable, I can leave it in code and remove it from configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, don't change it for now and I will figure out how we want to handle it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I found the POST /auth/login endpoint in enterprise DC/OS (might need to scroll to the bottom). It seems like it could be used, or perhaps another endpoint that is specific to the enterprise version, but without a copy it will be difficult to figure out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. As I hadn't available Enterprise DC/OS I didn't checked its API.

So, for Enterprise DC/OS we can have configuration for the service token and we can perform login. But for free DC/OS, this will not work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need an Enterprise DC/OS before we can implement any of it's login methods. I did determine how to create a jwt without expiration.

I think the required auth_token_secret is too sensitive to place in the telegraf config but we could document how to manually create the jwt and that could be added to the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, who can arrange a trial access to an Enterprise DC/OS instance? Should I ask Tim?

Copy link
Contributor Author

@vlastahajek vlastahajek Oct 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way how to create new authentication token with prolonged expiration looks as a hack, depending on internals of DC/OS.

Are we sure, that such stuff should be documented or referenced in an official open-source application?

# HTTP Response timeout, value must be more than a second
#response_timeout = 30s
# Set of default allowed tags. See readme.md for more tag keys.
taginclude = ["cluster_url","path","interface","hostname","container_id","mesos_id","framework_name"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add this to the example config since it is a global filter option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, you mean to remove it from the readme, but leave it in the plugin default config, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, remove it from both places

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. In a previous comment you suggested to filter out some ugly tags.

Did you mentioned that just for the plugin output example, not as an default option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find what I said about this, maybe you can paste a quote of it? We shouldn't have the taginclude option in the readme or default config since this is only something the plugin user would use for special filtering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. We have a similar situation in the docker input which was setup to add labels as tags. This worked okay since the user generally sets the labels, until kubernetes starting setting it's own labels which resulted in many useless labels. Eventually we added the docker_label_include/docker_label_exclude filters specifically for labels.

It seems to me that some of these container tags are not very useful, and so I think we should only collect specific tags that we know are going to be helpful. In my opinion it is better to add too little than too much. While the taginclude option could be used by the plugin user to further restrict the tags, I'd like to have the plugin behave well without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that the tag white-list should be hardcoded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be fine if they are hardcoded, any tags we add need to be explicit and purposeful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will implement it this way: Current tags in taginclude will be filtered in the code, however, if taginclude will be specified by user, hardcoded filter will be skipped, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, have the hardcoded filter be independent of the taginclude, you won't actually have access to it from within the plugin anyway.

 - hardcoded filtering of metrics
@danielnelson
Copy link
Contributor

#3519

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin new plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants