Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zipkin host collector (agent) #1778

Closed
codefromthecrypt opened this issue Nov 6, 2017 · 20 comments
Closed

Zipkin host collector (agent) #1778

codefromthecrypt opened this issue Nov 6, 2017 · 20 comments
Labels

Comments

@codefromthecrypt
Copy link
Member

Especially due to limited runtimes, like PHP, we should consider a host agent project. While we have a lot of tooling in java (ex benchmarked codecs, ability to estimate size etc), running another jvm as sidecar in VMs and for containers might be a difficult sell. Anything light weight is good. Go could be preferable, since managing dependencies is easier.

Options include rolling own hopefully leveraging libraries from zipkin-go, or layering on a future open sourcing of the AWS X-Ray agent. Meanwhile, we should make sure at least 3 parties sign up to help maintain this as there's significant long-term effort. We don't want to build interest for an agent and then drop it after people are using it.

cc @openzipkin/core

@basvanbeek
Copy link
Member

I would be interested in being one of those parties moving this forward using Go, optionally taking pieces from the Zipkin-Go pakage.

Regardless with which language / ecosystem we end up with I think we could already start specifying our requirements, wishlists, scope, etc.

@jcchavezs
Copy link
Contributor

I also have interest on this. Golang is a good option IMO as a portability advantage and also performance wise. I guess as outcome of this issue we might come up with a list of requirements for the agent.

@connectwithnara
Copy link

Brave is sufficient for services implemented in Java. However for JS and other runtimes it may become tedious to repeat implementing common aspects of the tracing client like compression, batching and posting data to backend. These aspects can be implemented in the agent instead. The sampling decision though should be taken in the library.

We should also think about compatibility between tracing library version and the agent because there will scenarios where the agent is not updated but the client library is and so on. This also means that we should avoid tight coupling between them.

We should probably start listing out what we want to implement in the library and the agent.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Nov 7, 2017 via email

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Nov 7, 2017 via email

@jcchavezs
Copy link
Contributor

I just started working on an agent. Have a look at https://github.com/jcchavezs/zipkin-agent

@hexchain
Copy link

hexchain commented Nov 18, 2017

I wrote an agent like this several months ago, mainly because we have lots of microservices written in Python and running in small containers, which connect to Kafka directly to send spans. And that's just too many connections for Kafka brokers.

All it does is to receive spans from a UDP port and then sends them in batch to Kafka.

@eirslett
Copy link
Contributor

eirslett commented Nov 18, 2017

There's already an existing open source project you can use for this, fluentd. I used it with Scribe (before Zipkin went to kafka town), it works pretty well. The main project is written in Ruby, but parts of it have been ported to Go. We might need to write some documentation about how you can use it with Zipkin efficiently.

@hexchain
Copy link

@eirslett Seems that fluentd cannot combine multiple JSON document to a list and send the list in one message. Reducing the amount of Kafka messages is important to me.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Nov 19, 2017 via email

@hexchain
Copy link

We do have such mechanism in our tracing library, but I still prefer to have this kind of batching in agent, simply because the agent receives spans from all containers on one host so batching can be done more efficiently.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Nov 20, 2017 via email

@jcchavezs
Copy link
Contributor

@hexchain regarding the combination of json and Kafka @stakhiv has an idea on how to make it work in a overheadless way.

@devinsba
Copy link
Member

devinsba commented May 9, 2019

I've been thinking about this one a lot, with the change to armeria we could get a pretty small server image with just the http collector and a passthrough storage shim. I'm wondering if there is an appetite here still. I know we don't particularly want to go down the route of reimplementing all of this in go due to duplicating work.

Thoughts?

@anuraaga
Copy link
Contributor

anuraaga commented May 9, 2019

Do you think to use Graal to remove the JVM? I think even with armeria it would be a fairly heavyweight sidecar still unless using Graal. But then it'd be quite small (I have toyed with Graal and have failed in creating a server that supports TLS but a sidecar wouldn't need it).

I wonder though if people can just use envoy as a zipkin sidecar though.

@yurishkuro
Copy link
Contributor

Have you folks considered OpenCensus agent?

@jcchavezs
Copy link
Contributor

jcchavezs commented May 9, 2019 via email

@devinsba
Copy link
Member

devinsba commented May 9, 2019

So for me the answer for both "maybe try OpenCensus agent" and "write it in something other than java" are the same, if we do that we lose the ability to leverage the fairly extensive library of reporters that are already written in java. In my mind the power here would be in allowing the languages that do not support them to report spans over the ubiquitous http which would end up in kafka/kinesis/sqs/(whatever the future holds)

If we do feel that another language suits this better (smaller binary, faster startup, whatever) then we would want to write a compatibility test suite that could be run against both zipkin and the separate application to validate that both apps behave the same way for the same sets of input and follow the specifications of the API. Incidentally this would allow third-parties to also validate their implementations

@codefromthecrypt
Copy link
Member Author

I think this issue is pie in the sky and also wouldn't affect the codebase in this repo.

I think people close to the codebase here will know there's extensive work vs ticking boxes, for example how and which logs are written, metrics are emitted, what can be supported or extended, how many people can and how close are they to the project. I won't troll by citing numerous examples of projects either not prioritizing things like format parity, data size, or abandonment. Suffice to say either a 3rd party or 1st party clone isn't going to replace this server. If someone wants to (as they always could have), they can write a contrib proxy, make it popular etc, or help other proxies like pitchfork or census.

Meanwhile, we undersell largely our own server. While we are focused on a lot of things, we've updated this to literally use the same infra as those who left twitter with the experience of the first attempt (finagle -> armeria). We also have numerous works in progress to reduce memory overhead per request and also address things like rate limiting. Duplicating all of this in a new language for the sake of it is expensive. Again folks can, but personally I see no advantage intentionally not improving our server, especially after all the investments we've made.

So, basically I agree with @devinsba and @anuraaga .. if there's concrete concern about which JVM should be used, we can address that in docker image. If there are overhead improvements, nothing to stop them happening here. If someone wants to experiment with another agent, there are places to do that including 3rd party repos, personal repos and contrib.

Meanwhile, this repo is in a different org now, apache. If we did anything else, that would either not be in this org, or a new incubator entry. Suffice to say this issue is out-of-date, even if insightful, so closing.

Thanks to all for the feedback!