Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbon-relay-ng dies consistently #146

Closed
KlavsKlavsen opened this issue Dec 12, 2016 · 19 comments
Closed

carbon-relay-ng dies consistently #146

KlavsKlavsen opened this issue Dec 12, 2016 · 19 comments

Comments

@KlavsKlavsen
Copy link

I've just inserted carbon-relay-ng in front of my good old carbon backend - to relay counters to 2xinfluxdb as well.

It has now died / exit'ed twice, after 1 to 2 days time - without logging anything of relevance :(

Its running on centos 6.

Config is:

instance = "default"
max_procs = 2
listen_addr = "0.0.0.0:2003"
admin_addr = "0.0.0.0:3024"
http_addr = "0.0.0.0:8081"
spool_dir = "/var/spool/carbon-relay-ng"
carbon-relay-ng.pid"
pid_file = "carbon-relay-ng.pid"
log_level = "notice"
bad_metrics_max_age = "24h"
allowed
validation_level_legacy = "medium"
validation_level_m20 = "medium"
validate_order = false
init = [
'addRoute sendAllMatch default-route 172.16.62.47:2013 spool=true pickle=false 172.16.62.49:2013 spool=true pickle=false 172.16.62.46:2013 spool=true pickle=false'
]

[instrumentation]
graphite_addr = "" # localhost:2003 (how about feeding back into the relay itself? :)
graphite_interval = 1000 # in ms

Any ideas what it could be, or what I could try? (besides wrapping it in a restart-script)

@Dieterbe
Copy link
Contributor

which version is this? git master just received a bunch of refactorings which have received some non-extensive testing (e.g. tested basic functionality over timespan of minutes, not days)

@KlavsKlavsen
Copy link
Author

I just built it from master - carbon-relay-ng-0.7_54_g78c140e

@nokernel
Copy link

the line with PID is it a bad copy paste?

@Dieterbe
Copy link
Contributor

@nokernel to be clear you mean @KlavsKlavsen's config right? yeah that's weird. There's also a stray "allowed" in there?

@KlavsKlavsen
Copy link
Author

bad paste.. here's the full

instance = "default"
max_procs = 2
listen_addr = "0.0.0.0:2003"
admin_addr = "0.0.0.0:3024"
http_addr = "0.0.0.0:8081"
spool_dir = "/var/spool/carbon-relay-ng"
#spool_dir = "spool"
#pid_file = "/var/run/carbon-relay-ng.pid"
pid_file = "carbon-relay-ng.pid"
#one of critical error warning notice info debug
log_level = "notice"
# How long to keep track of invalid metrics seen
# Useful time units are "s", "m", "h"
bad_metrics_max_age = "24h"
# Metric name validation strictness for legacy metrics. Valid values are:
# strict - Block anything that can upset graphite: valid characters are [A-Za-z0-9_-.]; consecutive dots are not allowed
# medium - Valid characters are ASCII; no embedded NULLs
# none   - No validation is performed
validation_level_legacy = "medium"
# Metric validation for carbon2.0 (metrics2.0) metrics.
# Metrics that contain = or _is_ are assumed carbon2.0.
# Valid values are:
# medium - checks for unit and mtype tag, presence of another tag, and constency (use = or _is_, not both)
# none   - No validation is performed
validation_level_m20 = "medium"

# you can also validate that each series has increasing timestamps
validate_order = false

# put init commands here, in the same format as you'd use for the telnet interface
# here's some examples:
#init = [
#     'addBlack prefix collectd.localhost',  # ignore hosts that don't set their hostname properly (implicit substring matrch).
#     'addBlack regex ^foo\..*\.cpu+', # ignore foo..cpu.... (regex pattern match)
#     'addAgg sum ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._sum_$1.requests.$2 10 20',
#     'addAgg avg ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._avg_$1.requests.$2 5 10',
#     'addRoute sendAllMatch carbon-default  127.0.0.1:2005 spool=true pickle=false',
#     'addRoute sendAllMatch carbon-tagger sub==  127.0.0.1:2006',  # all metrics with '=' in them are metrics2.0 format for tagger
#     'addRoute sendFirstMatch analytics regex=(Err/s|wait_time|logger)  graphite.prod:2003 prefix=prod. spool=true pickle=true  graphite.staging:2003 prefix=staging. spool=true pickle=true'
#]
init = [
     'addRoute sendAllMatch default-route  172.16.62.47:2013 spool=true pickle=false  172.16.62.49:2013 spool=true pickle=false  172.16.62.46:2013 spool=true pickle=false'
]

[instrumentation]
# in addition to serving internal metrics via expvar, you can optionally send em to graphite
graphite_addr = ""  # localhost:2003 (how about feeding back into the relay itself? :)
graphite_interval = 1000  # in ms

@Dieterbe
Copy link
Contributor

try putting it in the most verbose log mode.
you sure it doesn't log anything or print to stdout/stderr?

@KlavsKlavsen
Copy link
Author

it logs fine on start - and then runs for a day +
I capture output with nohup.. and last lines are after first startup:
08:36:53.201231 ▶ NOTI listening on 0.0.0.0:2003/tcp
08:36:53.201290 ▶ NOTI listening on 0.0.0.0:2003/udp
08:36:53.204124 ▶ NOTI admin TCP listener starting on 0.0.0.0:3024
08:36:53.206720 ▶ NOTI dest 192.168.32.47:2013 updating conn. online: true
08:36:53.211707 ▶ NOTI dest 192.168.32.46:2013 updating conn. online: true
08:36:53.211877 ▶ NOTI admin HTTP listener starting on 0.0.0.0:8081
08:36:53.212035 ▶ NOTI dest 172.16.62.49:2013 updating conn. online: true

and then nothing. It gets a lot of traffic (300 hosts * ~1500 average counters per host)

@KlavsKlavsen
Copy link
Author

I see no memory leaks (increase in memory usage atleast)

@KlavsKlavsen
Copy link
Author

how much logging (and IO) will that amount of traffic in your proposed logging mode occur over 1-2 days?

@KlavsKlavsen
Copy link
Author

ahh - after I wrapped the process in a bash script (that does a while : - restarting it.. again and again :) - my nohup output cathes more info..
Out of memory: Kill process 20721 (carbon-relay-ng) score 779 or sacrifice child..

so OOM appearently figured it was using too much memory.
25283 root 20 0 705m 243m 3968 S 17.0 1.5 2:18.56 carbon-relay-ng

current footprint. I'll monitor it to see how that changes.

@KlavsKlavsen
Copy link
Author

it does rise rather quickly in memory usage..
25283 root 20 0 1123m 662m 3976 S 11.3 4.2 7:42.81 carbon-relay-ng

Could it be the aggregation/spooling of counters it can't "get to receiver" or something? there's no errors in the logs though.

@KlavsKlavsen
Copy link
Author

it seems to spool to file though - so that should not be an issue.

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 13, 2016

Your setup looks sane. Am I reading this right that it got killed after consuming 705 MB virtual memory? and now you pasted a ps output where it consumes 1.1GB , which is indeed more than it should, though at least not super crazy.

anyway, once it consumes a good amount of memory (let's say >600MB)
can you wget http://localhost:8081/debug/pprof/heap, send me the heap and a copy of your binary, and i'll see what's consuming the memory. (upload it somewhere or email it to me. my domain is raintank.io and the piece before the at is dieter)

(if you're interested in doing it yourself, see https://blog.golang.org/profiling-go-programs )

@KlavsKlavsen
Copy link
Author

its now running 6.6GB memory.. so its a memory leak.
heap.zip

25283 root 20 0 14.0g 6.6g 1820 S 24.9 42.3 250:38.96 carbon-relay-ng

heap is only 13k in clear text - see attached.

@KlavsKlavsen
Copy link
Author

I'm guessing - it could be something like faulty GC'ing by go ? - since your heap still looks small.

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 14, 2016

to be clear. the heap file is just a profile of the heap. not an actual heap dump. please also send me a copy of the binary that was running when you got the heap profile. I tried to profile by compiling 0.7_54_g78c140e myself, but I'm not sure if it's matching the profile.

@KlavsKlavsen
Copy link
Author

go version go1.7.3 linux/amd64
is what centos 6 has. zip of binary attached.
carbon-relay-ng-0.7_54_g78c140e-1.zip

@Dieterbe
Copy link
Contributor

Dieterbe commented Dec 14, 2016

go tool pprof carbon-relay-ng-0.7_54_g78c140e-1/bin/carbon-relay-ng heap
Entering interactive mode (type "help" for commands)
(pprof) top30
5.39GB of 5.43GB total (99.38%)
Dropped 76 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.39GB 99.38% 99.38%     5.39GB 99.38%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.38%     5.39GB 99.38%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.38%     5.39GB 99.33%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.38%     5.39GB 99.38%  main.(*Conn).HandleData
         0     0% 99.38%     5.43GB   100%  runtime.goexit

this looks like #50 . basically we're collecting internal performance stats but if we don't send them anywhere, they keep piling up. silly but that's how it is for now.
For now you could set graphite_addr = "localhost:2003" so that it sends stats into itself and processes them.

for longer term, the good news is my new employer (raintank / grafanalabs) will be starting to support this project so I should be able to adress this some time within the next few weeks.

@KlavsKlavsen
Copy link
Author

That fixed it.. after 2 days of not dying - its now using 788M of memory (still a bit leaky it seems.. but much better :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants