carbon-relay-ng dies consistently #146

KlavsKlavsen · 2016-12-12T07:58:11Z

I've just inserted carbon-relay-ng in front of my good old carbon backend - to relay counters to 2xinfluxdb as well.

It has now died / exit'ed twice, after 1 to 2 days time - without logging anything of relevance :(

Its running on centos 6.

Config is:

instance = "default"
max_procs = 2
listen_addr = "0.0.0.0:2003"
admin_addr = "0.0.0.0:3024"
http_addr = "0.0.0.0:8081"
spool_dir = "/var/spool/carbon-relay-ng"
carbon-relay-ng.pid"
pid_file = "carbon-relay-ng.pid"
log_level = "notice"
bad_metrics_max_age = "24h"
allowed
validation_level_legacy = "medium"
validation_level_m20 = "medium"
validate_order = false
init = [
'addRoute sendAllMatch default-route 172.16.62.47:2013 spool=true pickle=false 172.16.62.49:2013 spool=true pickle=false 172.16.62.46:2013 spool=true pickle=false'
]

[instrumentation]
graphite_addr = "" # localhost:2003 (how about feeding back into the relay itself? :)
graphite_interval = 1000 # in ms

Any ideas what it could be, or what I could try? (besides wrapping it in a restart-script)

Dieterbe · 2016-12-12T08:25:16Z

which version is this? git master just received a bunch of refactorings which have received some non-extensive testing (e.g. tested basic functionality over timespan of minutes, not days)

KlavsKlavsen · 2016-12-12T09:57:52Z

I just built it from master - carbon-relay-ng-0.7_54_g78c140e

nokernel · 2016-12-13T12:22:30Z

the line with PID is it a bad copy paste?

Dieterbe · 2016-12-13T12:57:37Z

@nokernel to be clear you mean @KlavsKlavsen's config right? yeah that's weird. There's also a stray "allowed" in there?

KlavsKlavsen · 2016-12-13T12:59:55Z

bad paste.. here's the full

instance = "default"
max_procs = 2
listen_addr = "0.0.0.0:2003"
admin_addr = "0.0.0.0:3024"
http_addr = "0.0.0.0:8081"
spool_dir = "/var/spool/carbon-relay-ng"
#spool_dir = "spool"
#pid_file = "/var/run/carbon-relay-ng.pid"
pid_file = "carbon-relay-ng.pid"
#one of critical error warning notice info debug
log_level = "notice"
# How long to keep track of invalid metrics seen
# Useful time units are "s", "m", "h"
bad_metrics_max_age = "24h"
# Metric name validation strictness for legacy metrics. Valid values are:
# strict - Block anything that can upset graphite: valid characters are [A-Za-z0-9_-.]; consecutive dots are not allowed
# medium - Valid characters are ASCII; no embedded NULLs
# none   - No validation is performed
validation_level_legacy = "medium"
# Metric validation for carbon2.0 (metrics2.0) metrics.
# Metrics that contain = or _is_ are assumed carbon2.0.
# Valid values are:
# medium - checks for unit and mtype tag, presence of another tag, and constency (use = or _is_, not both)
# none   - No validation is performed
validation_level_m20 = "medium"

# you can also validate that each series has increasing timestamps
validate_order = false

# put init commands here, in the same format as you'd use for the telnet interface
# here's some examples:
#init = [
#     'addBlack prefix collectd.localhost',  # ignore hosts that don't set their hostname properly (implicit substring matrch).
#     'addBlack regex ^foo\..*\.cpu+', # ignore foo..cpu.... (regex pattern match)
#     'addAgg sum ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._sum_$1.requests.$2 10 20',
#     'addAgg avg ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._avg_$1.requests.$2 5 10',
#     'addRoute sendAllMatch carbon-default  127.0.0.1:2005 spool=true pickle=false',
#     'addRoute sendAllMatch carbon-tagger sub==  127.0.0.1:2006',  # all metrics with '=' in them are metrics2.0 format for tagger
#     'addRoute sendFirstMatch analytics regex=(Err/s|wait_time|logger)  graphite.prod:2003 prefix=prod. spool=true pickle=true  graphite.staging:2003 prefix=staging. spool=true pickle=true'
#]
init = [
     'addRoute sendAllMatch default-route  172.16.62.47:2013 spool=true pickle=false  172.16.62.49:2013 spool=true pickle=false  172.16.62.46:2013 spool=true pickle=false'
]

[instrumentation]
# in addition to serving internal metrics via expvar, you can optionally send em to graphite
graphite_addr = ""  # localhost:2003 (how about feeding back into the relay itself? :)
graphite_interval = 1000  # in ms

Dieterbe · 2016-12-13T13:09:43Z

try putting it in the most verbose log mode.
you sure it doesn't log anything or print to stdout/stderr?

KlavsKlavsen · 2016-12-13T13:30:43Z

it logs fine on start - and then runs for a day +
I capture output with nohup.. and last lines are after first startup:
08:36:53.201231 ▶ NOTI listening on 0.0.0.0:2003/tcp
08:36:53.201290 ▶ NOTI listening on 0.0.0.0:2003/udp
08:36:53.204124 ▶ NOTI admin TCP listener starting on 0.0.0.0:3024
08:36:53.206720 ▶ NOTI dest 192.168.32.47:2013 updating conn. online: true
08:36:53.211707 ▶ NOTI dest 192.168.32.46:2013 updating conn. online: true
08:36:53.211877 ▶ NOTI admin HTTP listener starting on 0.0.0.0:8081
08:36:53.212035 ▶ NOTI dest 172.16.62.49:2013 updating conn. online: true

and then nothing. It gets a lot of traffic (300 hosts * ~1500 average counters per host)

KlavsKlavsen · 2016-12-13T13:31:12Z

I see no memory leaks (increase in memory usage atleast)

KlavsKlavsen · 2016-12-13T13:34:36Z

how much logging (and IO) will that amount of traffic in your proposed logging mode occur over 1-2 days?

KlavsKlavsen · 2016-12-13T14:11:14Z

ahh - after I wrapped the process in a bash script (that does a while : - restarting it.. again and again :) - my nohup output cathes more info..
Out of memory: Kill process 20721 (carbon-relay-ng) score 779 or sacrifice child..

so OOM appearently figured it was using too much memory.
25283 root 20 0 705m 243m 3968 S 17.0 1.5 2:18.56 carbon-relay-ng

current footprint. I'll monitor it to see how that changes.

KlavsKlavsen · 2016-12-13T14:46:18Z

it does rise rather quickly in memory usage..
25283 root 20 0 1123m 662m 3976 S 11.3 4.2 7:42.81 carbon-relay-ng

Could it be the aggregation/spooling of counters it can't "get to receiver" or something? there's no errors in the logs though.

KlavsKlavsen · 2016-12-13T14:47:29Z

it seems to spool to file though - so that should not be an issue.

Dieterbe · 2016-12-13T15:30:15Z

Your setup looks sane. Am I reading this right that it got killed after consuming 705 MB virtual memory? and now you pasted a ps output where it consumes 1.1GB , which is indeed more than it should, though at least not super crazy.

anyway, once it consumes a good amount of memory (let's say >600MB)
can you wget http://localhost:8081/debug/pprof/heap, send me the heap and a copy of your binary, and i'll see what's consuming the memory. (upload it somewhere or email it to me. my domain is raintank.io and the piece before the at is dieter)

(if you're interested in doing it yourself, see https://blog.golang.org/profiling-go-programs )

KlavsKlavsen · 2016-12-14T14:19:57Z

its now running 6.6GB memory.. so its a memory leak.
heap.zip

25283 root 20 0 14.0g 6.6g 1820 S 24.9 42.3 250:38.96 carbon-relay-ng

heap is only 13k in clear text - see attached.

KlavsKlavsen · 2016-12-14T14:21:20Z

I'm guessing - it could be something like faulty GC'ing by go ? - since your heap still looks small.

Dieterbe · 2016-12-14T14:35:31Z

to be clear. the heap file is just a profile of the heap. not an actual heap dump. please also send me a copy of the binary that was running when you got the heap profile. I tried to profile by compiling 0.7_54_g78c140e myself, but I'm not sure if it's matching the profile.

KlavsKlavsen · 2016-12-14T14:37:37Z

go version go1.7.3 linux/amd64
is what centos 6 has. zip of binary attached.
carbon-relay-ng-0.7_54_g78c140e-1.zip

Dieterbe · 2016-12-14T14:57:27Z

go tool pprof carbon-relay-ng-0.7_54_g78c140e-1/bin/carbon-relay-ng heap
Entering interactive mode (type "help" for commands)
(pprof) top30
5.39GB of 5.43GB total (99.38%)
Dropped 76 nodes (cum <= 0.03GB)
      flat  flat%   sum%        cum   cum%
    5.39GB 99.38% 99.38%     5.39GB 99.38%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*WindowSample).Update
         0     0% 99.38%     5.39GB 99.38%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*StandardHistogram).Update
         0     0% 99.38%     5.39GB 99.33%  github.com/graphite-ng/carbon-relay-ng/vendor/github.com/Dieterbe/go-metrics.(*StandardTimer).Update
         0     0% 99.38%     5.39GB 99.38%  main.(*Conn).HandleData
         0     0% 99.38%     5.43GB   100%  runtime.goexit

this looks like #50 . basically we're collecting internal performance stats but if we don't send them anywhere, they keep piling up. silly but that's how it is for now.
For now you could set graphite_addr = "localhost:2003" so that it sends stats into itself and processes them.

for longer term, the good news is my new employer (raintank / grafanalabs) will be starting to support this project so I should be able to adress this some time within the next few weeks.

KlavsKlavsen · 2016-12-21T09:08:06Z

That fixed it.. after 2 days of not dying - its now using 788M of memory (still a bit leaky it seems.. but much better :)

KlavsKlavsen closed this as completed Dec 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

carbon-relay-ng dies consistently #146

carbon-relay-ng dies consistently #146

KlavsKlavsen commented Dec 12, 2016

Dieterbe commented Dec 12, 2016

KlavsKlavsen commented Dec 12, 2016

nokernel commented Dec 13, 2016

Dieterbe commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

Dieterbe commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

Dieterbe commented Dec 13, 2016 •

edited

Loading

KlavsKlavsen commented Dec 14, 2016

KlavsKlavsen commented Dec 14, 2016

Dieterbe commented Dec 14, 2016 •

edited

Loading

KlavsKlavsen commented Dec 14, 2016

Dieterbe commented Dec 14, 2016 •

edited

Loading

KlavsKlavsen commented Dec 21, 2016

carbon-relay-ng dies consistently #146

carbon-relay-ng dies consistently #146

Comments

KlavsKlavsen commented Dec 12, 2016

Dieterbe commented Dec 12, 2016

KlavsKlavsen commented Dec 12, 2016

nokernel commented Dec 13, 2016

Dieterbe commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

Dieterbe commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

KlavsKlavsen commented Dec 13, 2016

Dieterbe commented Dec 13, 2016 • edited Loading

KlavsKlavsen commented Dec 14, 2016

KlavsKlavsen commented Dec 14, 2016

Dieterbe commented Dec 14, 2016 • edited Loading

KlavsKlavsen commented Dec 14, 2016

Dieterbe commented Dec 14, 2016 • edited Loading

KlavsKlavsen commented Dec 21, 2016

Dieterbe commented Dec 13, 2016 •

edited

Loading

Dieterbe commented Dec 14, 2016 •

edited

Loading

Dieterbe commented Dec 14, 2016 •

edited

Loading