-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
carbon-relay-ng dies consistently #146
Comments
which version is this? git master just received a bunch of refactorings which have received some non-extensive testing (e.g. tested basic functionality over timespan of minutes, not days) |
I just built it from master - carbon-relay-ng-0.7_54_g78c140e |
the line with PID is it a bad copy paste? |
@nokernel to be clear you mean @KlavsKlavsen's config right? yeah that's weird. There's also a stray "allowed" in there? |
bad paste.. here's the full instance = "default" max_procs = 2 listen_addr = "0.0.0.0:2003" admin_addr = "0.0.0.0:3024" http_addr = "0.0.0.0:8081" spool_dir = "/var/spool/carbon-relay-ng" #spool_dir = "spool" #pid_file = "/var/run/carbon-relay-ng.pid" pid_file = "carbon-relay-ng.pid" #one of critical error warning notice info debug log_level = "notice" # How long to keep track of invalid metrics seen # Useful time units are "s", "m", "h" bad_metrics_max_age = "24h" # Metric name validation strictness for legacy metrics. Valid values are: # strict - Block anything that can upset graphite: valid characters are [A-Za-z0-9_-.]; consecutive dots are not allowed # medium - Valid characters are ASCII; no embedded NULLs # none - No validation is performed validation_level_legacy = "medium" # Metric validation for carbon2.0 (metrics2.0) metrics. # Metrics that contain = or _is_ are assumed carbon2.0. # Valid values are: # medium - checks for unit and mtype tag, presence of another tag, and constency (use = or _is_, not both) # none - No validation is performed validation_level_m20 = "medium" # you can also validate that each series has increasing timestamps validate_order = false # put init commands here, in the same format as you'd use for the telnet interface # here's some examples: #init = [ # 'addBlack prefix collectd.localhost', # ignore hosts that don't set their hostname properly (implicit substring matrch). # 'addBlack regex ^foo\..*\.cpu+', # ignore foo..cpu.... (regex pattern match) # 'addAgg sum ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._sum_$1.requests.$2 10 20', # 'addAgg avg ^stats\.timers\.(app|proxy|static)[0-9]+\.requests\.(.*) stats.timers._avg_$1.requests.$2 5 10', # 'addRoute sendAllMatch carbon-default 127.0.0.1:2005 spool=true pickle=false', # 'addRoute sendAllMatch carbon-tagger sub== 127.0.0.1:2006', # all metrics with '=' in them are metrics2.0 format for tagger # 'addRoute sendFirstMatch analytics regex=(Err/s|wait_time|logger) graphite.prod:2003 prefix=prod. spool=true pickle=true graphite.staging:2003 prefix=staging. spool=true pickle=true' #] init = [ 'addRoute sendAllMatch default-route 172.16.62.47:2013 spool=true pickle=false 172.16.62.49:2013 spool=true pickle=false 172.16.62.46:2013 spool=true pickle=false' ] [instrumentation] # in addition to serving internal metrics via expvar, you can optionally send em to graphite graphite_addr = "" # localhost:2003 (how about feeding back into the relay itself? :) graphite_interval = 1000 # in ms |
try putting it in the most verbose log mode. |
it logs fine on start - and then runs for a day + and then nothing. It gets a lot of traffic (300 hosts * ~1500 average counters per host) |
I see no memory leaks (increase in memory usage atleast) |
how much logging (and IO) will that amount of traffic in your proposed logging mode occur over 1-2 days? |
ahh - after I wrapped the process in a bash script (that does a while : - restarting it.. again and again :) - my nohup output cathes more info.. so OOM appearently figured it was using too much memory. current footprint. I'll monitor it to see how that changes. |
it does rise rather quickly in memory usage.. Could it be the aggregation/spooling of counters it can't "get to receiver" or something? there's no errors in the logs though. |
it seems to spool to file though - so that should not be an issue. |
Your setup looks sane. Am I reading this right that it got killed after consuming 705 MB virtual memory? and now you pasted a ps output where it consumes 1.1GB , which is indeed more than it should, though at least not super crazy. anyway, once it consumes a good amount of memory (let's say >600MB) (if you're interested in doing it yourself, see https://blog.golang.org/profiling-go-programs ) |
its now running 6.6GB memory.. so its a memory leak. 25283 root 20 0 14.0g 6.6g 1820 S 24.9 42.3 250:38.96 carbon-relay-ng heap is only 13k in clear text - see attached. |
I'm guessing - it could be something like faulty GC'ing by go ? - since your heap still looks small. |
to be clear. the heap file is just a profile of the heap. not an actual heap dump. please also send me a copy of the binary that was running when you got the heap profile. I tried to profile by compiling 0.7_54_g78c140e myself, but I'm not sure if it's matching the profile. |
go version go1.7.3 linux/amd64 |
this looks like #50 . basically we're collecting internal performance stats but if we don't send them anywhere, they keep piling up. silly but that's how it is for now. for longer term, the good news is my new employer (raintank / grafanalabs) will be starting to support this project so I should be able to adress this some time within the next few weeks. |
That fixed it.. after 2 days of not dying - its now using 788M of memory (still a bit leaky it seems.. but much better :) |
I've just inserted carbon-relay-ng in front of my good old carbon backend - to relay counters to 2xinfluxdb as well.
It has now died / exit'ed twice, after 1 to 2 days time - without logging anything of relevance :(
Its running on centos 6.
Config is:
instance = "default"
max_procs = 2
listen_addr = "0.0.0.0:2003"
admin_addr = "0.0.0.0:3024"
http_addr = "0.0.0.0:8081"
spool_dir = "/var/spool/carbon-relay-ng"
carbon-relay-ng.pid"
pid_file = "carbon-relay-ng.pid"
log_level = "notice"
bad_metrics_max_age = "24h"
allowed
validation_level_legacy = "medium"
validation_level_m20 = "medium"
validate_order = false
init = [
'addRoute sendAllMatch default-route 172.16.62.47:2013 spool=true pickle=false 172.16.62.49:2013 spool=true pickle=false 172.16.62.46:2013 spool=true pickle=false'
]
[instrumentation]
graphite_addr = "" # localhost:2003 (how about feeding back into the relay itself? :)
graphite_interval = 1000 # in ms
Any ideas what it could be, or what I could try? (besides wrapping it in a restart-script)
The text was updated successfully, but these errors were encountered: