Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent container metrics in prometheus route #1704

Closed
zeisss opened this issue Jul 25, 2017 · 51 comments
Closed

Inconsistent container metrics in prometheus route #1704

zeisss opened this issue Jul 25, 2017 · 51 comments
Assignees
Labels

Comments

@zeisss
Copy link

zeisss commented Jul 25, 2017

Our cadvisor reports different containers each we time I query the /metrics route. The problems are consistent across various environments and VMs. I initially found #1635 and thought this to be the same, but the linked #1572 explains that cadvisor seems to pickup two systemd slices for the container, which is not the case according to my logs. Thus a separate issue, just to be sure.

17:50 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
      98
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
      18
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l
      98
17:51 $ curl -s http://docker-012.<domain>:8701/metrics | fgrep container_cpu_usage_seconds_total| wc -l

:8701 is started as follows: $ sudo /opt/cadvisor/bin/cadvisor -port 8701 -logtostderr -v=10

Neither dockerd nor cadvisor print any logs during those requests.

Startup Logs

I0725 17:02:09.462596  109834 storagedriver.go:50] Caching stats in memory for 2m0s
I0725 17:02:09.462727  109834 manager.go:143] cAdvisor running in container: "/"
W0725 17:02:09.496040  109834 manager.go:151] unable to connect to Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
I0725 17:02:09.531430  109834 fs.go:117] Filesystem partitions: map[/dev/dm-0:{mountpoint:/ major:254 minor:0 fsType:ext4 blockSize:0} /dev/mapper/rs--pre--docker--012--vg-var:{mountpoint:/var/lib/docker/aufs major:254 minor:2 fsType:ext4 blockSize:0} /dev/mapper/rs--pre--docker--012--vg-varlog:{mountpoint:/var/log major:254 minor:3 fsType:ext4 blockSize:0}]
I0725 17:02:09.534803  109834 manager.go:198] Machine: {NumCores:8 CpuFrequency:2397223 MemoryCapacity:38034182144 MachineID:c63b565c3eea4c1bab8cc5d972595a51 SystemUUID:423B1F3E-804D-219F-8D0B-EECB74C81279 BootID:9b2c8857-539f-4adf-b2b5-c8e2672968b8 Filesystems:[{Device:/dev/mapper/rs--pre--docker--012--vg-var DeviceMajor:254 DeviceMinor:2 Capacity:40179982336 Type:vfs Inodes:2501856 HasInodes:true} {Device:/dev/mapper/rs--pre--docker--012--vg-varlog DeviceMajor:254 DeviceMinor:3 Capacity:20020748288 Type:vfs Inodes:1250928 HasInodes:true} {Device:/dev/dm-0 DeviceMajor:254 DeviceMinor:0 Capacity:12366823424 Type:vfs Inodes:775200 HasInodes:true}] DiskMap:map[254:1:{Name:dm-1 Major:254 Minor:1 Size:1023410176 Scheduler:none} 254:2:{Name:dm-2 Major:254 Minor:2 Size:40957378560 Scheduler:none} 254:3:{Name:dm-3 Major:254 Minor:3 Size:20476592128 Scheduler:none} 8:0:{Name:sda Major:8 Minor:0 Size:75161927680 Scheduler:cfq} 254:0:{Name:dm-0 Major:254 Minor:0 Size:12700352512 Scheduler:none}] NetworkDevices:[{Name:eth0 MacAddress:00:50:56:bb:37:43 Speed:10000 Mtu:1500}] Topology:[{Id:0 Memory:38034182144 Cores:[{Id:0 Threads:[0] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:2 Memory:0 Cores:[{Id:0 Threads:[1] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:4 Memory:0 Cores:[{Id:0 Threads:[2] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:6 Memory:0 Cores:[{Id:0 Threads:[3] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:8 Memory:0 Cores:[{Id:0 Threads:[4] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:10 Memory:0 Cores:[{Id:0 Threads:[5] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:12 Memory:0 Cores:[{Id:0 Threads:[6] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]} {Id:14 Memory:0 Cores:[{Id:0 Threads:[7] Caches:[{Size:32768 Type:Data Level:1} {Size:32768 Type:Instruction Level:1} {Size:262144 Type:Unified Level:2}]}] Caches:[{Size:15728640 Type:Unified Level:3}]}] CloudProvider:Unknown InstanceType:Unknown InstanceID:None}
I0725 17:02:09.535661  109834 manager.go:204] Version: {KernelVersion:3.16.0-4-amd64 ContainerOsVersion:Debian GNU/Linux 8 (jessie) DockerVersion:1.13.1 DockerAPIVersion:1.26 CadvisorVersion:v0.26.1 CadvisorRevision:d19cc94}
I0725 17:02:09.577920  109834 factory.go:351] Registering Docker factory
W0725 17:02:09.577951  109834 manager.go:247] Registration of the rkt container factory failed: unable to communicate with Rkt api service: rkt: cannot tcp Dial rkt api service: dial tcp 127.0.0.1:15441: getsockopt: connection refused
I0725 17:02:09.577957  109834 factory.go:54] Registering systemd factory
I0725 17:02:09.578235  109834 factory.go:86] Registering Raw factory
I0725 17:02:09.578542  109834 manager.go:1121] Started watching for new ooms in manager
I0725 17:02:09.579461  109834 oomparser.go:185] oomparser using systemd
I0725 17:02:09.579565  109834 factory.go:116] Factory "docker" was unable to handle container "/"
I0725 17:02:09.579582  109834 factory.go:105] Error trying to work out if we can handle /: / not handled by systemd handler
I0725 17:02:09.579586  109834 factory.go:116] Factory "systemd" was unable to handle container "/"
I0725 17:02:09.579592  109834 factory.go:112] Using factory "raw" for container "/"
I0725 17:02:09.579959  109834 manager.go:913] Added container: "/" (aliases: [], namespace: "")
I0725 17:02:09.580102  109834 handler.go:325] Added event &{/ 2017-07-22 16:40:48.746304841 +0200 CEST containerCreation {<nil>}}
I0725 17:02:09.580139  109834 manager.go:288] Starting recovery of all containers
I0725 17:02:09.580237  109834 container.go:407] Start housekeeping for container "/"

Logs for a container

Example: I am missing the metrics for f7ba91df74c8. Cadvisor mentions the container ID only once:

I0725 17:02:09.693203  109834 factory.go:112] Using factory "docker" for container "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855"
I0725 17:02:09.695423  109834 manager.go:913] Added container: "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855" (aliases: [containernameredacted f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855], namespace: "docker")
I0725 17:02:09.695640  109834 handler.go:325] Added event &{/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855 2017-07-25 16:20:00.930924661 +0200 CEST containerCreation {<nil>}}
I0725 17:02:09.695779  109834 container.go:407] Start housekeeping for container "/docker/f7ba91df74c8b923cf66ba2e0ef4190a2089f7dd258d7d57f7e92034192a1855"

System

cadvisor_version_info{cadvisorRevision="d19cc94",cadvisorVersion="v0.26.1",dockerVersion="1.13.1",kernelVersion="3.16.0-4-amd64",osVersion="Debian GNU/Linux 8 (jessie)"} 1

We are running an old docker swarm setup with consul, consul-template and nginx per host. No Kubernetes.

@zeisss zeisss changed the title Inconsistent container metrics Inconsistent container metrics in prometheus route Jul 25, 2017
@micahhausler
Copy link
Contributor

micahhausler commented Jul 27, 2017

We're observing the same behavior, in the kubernetes 1.7.0 kubelet (port 4194), and the docker image for v0.26.1

Versions:

docker: 1.12.6
Kubelet: v1.7.0+coreos.0
OS: CoreOS Linux 1409.7.0
Kernel Version: 4.11.11-coreos

I ran cadvisor on kubernetes using the following DaemonSet

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: cadvisor
  namespace: default
  labels:
    app: "cadvisor"
spec:
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: "cadvisor"
      annotations:
        prometheus.io/scrape: 'true'
        prometheus.io/port: 4194
        prometheus.io/path: '/metrics'
    spec:
      containers:
      - name: "cadvisor"
        image: "google/cadvisor:v0.26.1"
        args:
        - "-port=4194"
        - "-logtostderr"
        livenessProbe:
          httpGet:
            path: /api
            port: 4194
        volumeMounts:
        - name: root
          mountPath: /rootfs
          readOnly: true
        - name: var-run
          mountPath: /var/run
        - name: sys
          mountPath: /sys
          readOnly: true
        - name: var-lib-docker
          mountPath: /var/lib/docker
          readOnly: true
        - name: docker-socket
          mountPath: /var/run/docker.sock
        resources:
          limits:
            cpu: 500.0m
            memory: 256Mi
          requests:
            cpu: 250.0m
            memory: 128Mi
      restartPolicy: Always
      volumes:
      - name: "root"
        hostPath:
          path: /
      - name: "var-run"
        hostPath:
          path: /var/run
      - name: "sys"
        hostPath:
          path: /sys
      - name: "var-lib-docker"
        hostPath:
          path: /var/lib/docker
      - name: "docker-socket"
        hostPath:
          path: /var/run/docker.sock

And this is what it looked like in Prometheus:
screen shot 2017-07-27 at 3 47 45 pm

@zeisss
Copy link
Author

zeisss commented Jul 30, 2017

Running the binary without root permissions fixes the problems, but now container labels are missing. Using the -docker-only flag or accessing docker via tcp/ip leads to no change from the initial behavior.

@fabxc
Copy link

fabxc commented Aug 3, 2017

@zeisss @micahhausler are you both running Prometheus 2.0? In 1.x versions the flapping metrics are not caught by the new staleness handling and thus it should have no immediately visible effect.

In general it's definitely wrong behavior by cAdvisor though that violates the /metrics contract.
This seems to be a recent regression. @derekwaynecarr @timothysc any idea what could have caused this?

@idexter
Copy link

idexter commented Aug 3, 2017

@fabxc I'm using Prometheus 1.5.2 and cAdvisor on host machine and I also have this problem.
As @zeisss said, if I run cAdvisor without root permission, this fix the problem except that container labels is missing.

Worst of all with this bug is that Prometheus sometimes lose some containers metrics... In Grafana my graph with running containers looks like this:

containers_graph

And I see Alerts from AlertManager that containers is down, but actually all containers working all time.

@zeisss
Copy link
Author

zeisss commented Aug 4, 2017

We currently have a workaround by running cadvisor as an explicit user. this is ok for us, as having the CPU and memory graphs is already a win for us. But afaict this mode is missing the docker container labels as well as network and disk I/O metrics.

@zeisss
Copy link
Author

zeisss commented Aug 4, 2017

@fabxc no, we are still running a 1.x prometheus version - but having Prometheus work around this bug in cadvisor is not a good solution IMO.

@zeisss
Copy link
Author

zeisss commented Aug 4, 2017

We are currently in the progress of updating our DEV cluster to Docker 17.06-ce where we are still seeing this behavior, if run as root (/opt/cadvisor/bin/cadvisor -port 8701 -logtostderr):

$ while true; do curl -sS docker-host:8701/metrics | fgrep container_cpu_system_seconds_total | wc -l; sleep 1; done
      28
      28
       9
       9
       5
       6
^C
# HELP cadvisor_version_info A metric with a constant '1' value labeled by kernel version, OS version, docker version, cadvisor version & cadvisor revision.
# TYPE cadvisor_version_info gauge
cadvisor_version_info{cadvisorRevision="057293a",cadvisorVersion="v0.26.0.20+057293a1796d6a-dirty",dockerVersion="17.06.0-ce",kernelVersion="3.16.0-4-amd64",osVersion="Debian GNU/Linux 8 (jessie)"} 1

@sylr
Copy link

sylr commented Aug 4, 2017

I've the same issue with Kubernetes 1.7.2 & 1.7.3.

kubernetes/kubernetes#50151

@bassebaba
Copy link

I have the exact same problem as @dexterHD, makes me crazy, my container-down alert spams me with false alerts all the time.

screen shot 2017-08-06 at 07 42 53

@igortg
Copy link

igortg commented Aug 7, 2017

I just started to explore cAdvisor. Seems to have the same issue using InfluxDB:

image

@Hermain
Copy link

Hermain commented Aug 11, 2017

Having the same issue with docker 17.06, prometheus and docker swarm.
Running v0.24.1 solved it for me

@matthiasr
Copy link

cc @grobie

@dixudx
Copy link

dixudx commented Aug 15, 2017

/cc

@roman-vynar
Copy link

Same thing, 0.26, 0.26.1 are unusable with Prometheus (in our case 1.7.x).
They provide a random number of metrics - different number of metrics exposed by /metrics path at a single moment. Had to go back to the old good 0.25.
Docker 17.03/17.06.

@bassebaba
Copy link

bassebaba commented Aug 16, 2017

@Hermain @roman-vynar According to release notes 0.26 "Bug: Fix prometheus metrics."
So when reverting to 0.25, one misses out on whatever they fixed (but at the same time did break something and introduced the gaps)? I cant find the prometheus-commit that's connected to v0.26 in order to see whats "fixed".

Do we have an ETA on fixing this? No devs in this issue? And no assignee?

@matthiasr
Copy link

According to #1690 (comment) the fix in 0.26.1 isn't working / incomplete, maybe this is the same problem?

@matthiasr
Copy link

Does this problem happen on a cAdvisor built from master, which includes #1679?

@sylr
Copy link

sylr commented Aug 16, 2017

If someone can indicate me how to build hyperkube with a custom cAdvisor commit I'd like to make some tests. I think I found how to do this.

Thanks.

@Cas-pian
Copy link

Cas-pian commented Aug 21, 2017

I meet the same promblem using cadvisor 0.26.1 and prometheus 1.7.1, but it's OK when I changed cadvisor to v0.25.0, and it OK with cadvisor 0.26.1 and prometheus 1.5.3, I'm a little confused, it seems to be a compatibility issue.

@bboreham
Copy link
Contributor

Seeing the same high-level symptoms: for me it's the labels that are missing, not the containers. And when the labels are missing I get a lot more lines for other cgroups.

I'm running kuberntes 1.7.3 on Ubuntu Linux ip-172-20-3-76 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Two examples from the same kubelet on the same machine, a few seconds apart:

Example 1:

# curl -s 127.0.0.1:10255/metrics/cadvisor | grep container_cpu_user_seconds_total
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{id="/"} 3.6788206e+06
container_cpu_user_seconds_total{id="/init.scope"} 69.43
container_cpu_user_seconds_total{id="/kubepods"} 3.49797001e+06
container_cpu_user_seconds_total{id="/kubepods/besteffort"} 162742.99
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod13eacef1-8342-11e7-9534-0a97ed59c75e"} 69.47
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e"} 703.82
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e"} 70.04
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e"} 363.18
container_cpu_user_seconds_total{id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e"} 5.9
container_cpu_user_seconds_total{id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e"} 35733.13
container_cpu_user_seconds_total{id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e"} 150.78
container_cpu_user_seconds_total{id="/kubepods/burstable"} 3.33525364e+06
container_cpu_user_seconds_total{id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e"} 276743.3
container_cpu_user_seconds_total{id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e"} 105958.75
container_cpu_user_seconds_total{id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee"} 366.77
container_cpu_user_seconds_total{id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/8d2eb34023eab40d08ba6e4be149e315c3844749f8321f44be2dcda024534757/\"\""} 366.65
container_cpu_user_seconds_total{id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e"} 434974.97
container_cpu_user_seconds_total{id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e"} 891563
container_cpu_user_seconds_total{id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e"} 17225.18
container_cpu_user_seconds_total{id="/system.slice"} 151482.27
container_cpu_user_seconds_total{id="/system.slice/acpid.service"} 0
container_cpu_user_seconds_total{id="/system.slice/apparmor.service"} 0
container_cpu_user_seconds_total{id="/system.slice/apport.service"} 0
container_cpu_user_seconds_total{id="/system.slice/atd.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cgroupfs-mount.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cloud-config.service"} 0.32
container_cpu_user_seconds_total{id="/system.slice/cloud-final.service"} 0.37
container_cpu_user_seconds_total{id="/system.slice/cloud-init-local.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cloud-init.service"} 0.63
container_cpu_user_seconds_total{id="/system.slice/console-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/cron.service"} 25.49
container_cpu_user_seconds_total{id="/system.slice/dbus.service"} 14.82
container_cpu_user_seconds_total{id="/system.slice/docker.service"} 94117.92
container_cpu_user_seconds_total{id="/system.slice/ebtables.service"} 0
container_cpu_user_seconds_total{id="/system.slice/grub-common.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ifup@cbr0.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ifup@ens3.service"} 0.79
container_cpu_user_seconds_total{id="/system.slice/irqbalance.service"} 40.56
container_cpu_user_seconds_total{id="/system.slice/iscsid.service"} 1.69
container_cpu_user_seconds_total{id="/system.slice/keyboard-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/kmod-static-nodes.service"} 0
container_cpu_user_seconds_total{id="/system.slice/kubelet.service"} 21323.06
container_cpu_user_seconds_total{id="/system.slice/lvm2-lvmetad.service"} 8.94
container_cpu_user_seconds_total{id="/system.slice/lvm2-monitor.service"} 0
container_cpu_user_seconds_total{id="/system.slice/lxcfs.service"} 0.37
container_cpu_user_seconds_total{id="/system.slice/lxd-containers.service"} 0
container_cpu_user_seconds_total{id="/system.slice/mdadm.service"} 0.02
container_cpu_user_seconds_total{id="/system.slice/networking.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ondemand.service"} 0
container_cpu_user_seconds_total{id="/system.slice/open-iscsi.service"} 0
container_cpu_user_seconds_total{id="/system.slice/polkitd.service"} 3.63
container_cpu_user_seconds_total{id="/system.slice/rc-local.service"} 0
container_cpu_user_seconds_total{id="/system.slice/resolvconf.service"} 0
container_cpu_user_seconds_total{id="/system.slice/rsyslog.service"} 100.82
container_cpu_user_seconds_total{id="/system.slice/setvtrgb.service"} 0
container_cpu_user_seconds_total{id="/system.slice/snapd.firstboot.service"} 0
container_cpu_user_seconds_total{id="/system.slice/snapd.service"} 0.04
container_cpu_user_seconds_total{id="/system.slice/ssh.service"} 51.39
container_cpu_user_seconds_total{id="/system.slice/system-getty.slice"} 0
container_cpu_user_seconds_total{id="/system.slice/system-serial\\x2dgetty.slice"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-journal-flush.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-journald.service"} 489.31
container_cpu_user_seconds_total{id="/system.slice/systemd-logind.service"} 3.02
container_cpu_user_seconds_total{id="/system.slice/systemd-modules-load.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-random-seed.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-remount-fs.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-sysctl.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-timesyncd.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-tmpfiles-setup-dev.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-tmpfiles-setup.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-udev-trigger.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-udevd.service"} 0.52
container_cpu_user_seconds_total{id="/system.slice/systemd-update-utmp.service"} 0
container_cpu_user_seconds_total{id="/system.slice/systemd-user-sessions.service"} 0
container_cpu_user_seconds_total{id="/system.slice/ufw.service"} 0
container_cpu_user_seconds_total{id="/user.slice"} 29270.98

Example 2:

# curl -s 127.0.0.1:10255/metrics/cadvisor | grep container_cpu_user_seconds_total
# HELP container_cpu_user_seconds_total Cumulative user cpu time consumed in seconds.
# TYPE container_cpu_user_seconds_total counter
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e/e49ec1309ec25475a7edd8c4dd6d7003fef3f7debd053b234716649d920ac15f",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_prom-node-exporter-w4nvq_monitoring_5f43c843-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="prom-node-exporter-w4nvq"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e/2bf50e4b99aaf24eb05a61b9808d9e60d4fd78ba47ac7669ce29bb3f8c862501",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_reboot-required-rn9h4_monitoring_6b2e45d7-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="reboot-required-rn9h4"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/f3a1a656eabae83bb3a50206d7278b154fe1ddf2521e6a0bfd31667642867968",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e/c6e3b1012a1e607e4d164233f96a4c2ef83f377fc9dfb82e0dab7fc218e4e72a",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_kured-wp23j_kube-system_965b711b-8262-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="kured-wp23j"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e/e62bf79dd1981e285df9138a057b481357a5be6e464b43235e1335ac33bcf00b",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluxd-3608285890-x4bz7_kube-system_d2b82b9c-8355-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="fluxd-3608285890-x4bz7"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/e618f7cb1f3ec97f463ed9f97143890b80c730f53075a127d9f59714aab35163",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e/44f9f0113185f75f827eca36a42a7d2f91e166594c63eb2efecc7155eda03a70",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_scope-probe-master-3cktj_kube-system_53559243-7db5-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="scope-probe-master-3cktj"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/9f9696a06e93a617a4e606731a474966c139681eef1a66344f0d06c965c68e47",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/7c3dc6bb8bb540224ca1f6d121d5fe2c5df0606ce5d45e7a0c802c29765c6625",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_kube-proxy-ip-172-20-3-76.ec2.internal_kube-system_7964f3e653196edee64f6bad72589dee_1",namespace="kube-system",pod_name="kube-proxy-ip-172-20-3-76.ec2.internal"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/3f5329bc7772496d70821ea9c9bc80045af6c29299c42d74e2d27baf8c3cc72a",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e/decb876fb0dad43964deed741609ee45d3cf9049ae9f3ac934aefd596695302c",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluxsvc-438909710-2jtz8_fluxy_cb5d3cc0-8364-11e7-9534-0a97ed59c75e_0",namespace="fluxy",pod_name="fluxsvc-438909710-2jtz8"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e/133181c676d51606d4fa3d7d5c7e7455535636d30c5629526f0ba0cac5fcb522",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_fluentd-loggly-z9jp4_monitoring_cf18531c-8365-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="fluentd-loggly-z9jp4"} 0
container_cpu_user_seconds_total{container_name="POD",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/dabbd2c12e2d2666dd818b0c44be54760a701bdaf850ee4804b32efd36c42754",image="gcr.io/google_containers/pause-amd64:3.0",name="k8s_POD_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 0
container_cpu_user_seconds_total{container_name="authfe",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/c796e0b2c3afc41e1ed6750c9dc9f5550e19efe25f0aa717fe4f9b2578c16c67",image="quay.io/weaveworks/authfe@sha256:c82cb113d15e20f65690aa3ca7f3374ae7ed2257dee2bc131bd61b1ac2bf180a",name="k8s_authfe_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 64953.6
container_cpu_user_seconds_total{container_name="billing-ingester",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/50c82895bd84971bc6b8b9f5873512710ab06f754a0e0d3261bc20a2fddd4533",image="quay.io/weaveworks/billing-ingester@sha256:5fd857a96cac13e9f96678e63a07633af45de0e83a34e8ef28f627cf0589a042",name="k8s_billing-ingester_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 187.34
container_cpu_user_seconds_total{container_name="collection",id="/kubepods/burstable/pode84e93da-865d-11e7-940d-12467a080e24/f33f520aa2ed2c6f2277064fed34c4797ddf76a0a0bef25309348517cb1c4030",image="quay.io/weaveworks/scope@sha256:45be0490dba82f68a20faba8994cde307e9ace863a310196ba91401122bda4f8",name="k8s_collection_collection-3392593966-7nxpw_scope_e84e93da-865d-11e7-940d-12467a080e24_0",namespace="scope",pod_name="collection-3392593966-7nxpw"} 5411.72
container_cpu_user_seconds_total{container_name="exporter",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/cabd4c16d300232a8b823bd5a9553816ff7f0830c6d91634651b4f723035664f",image="prom/memcached-exporter@sha256:b814aa209e2d5969be2ab4c65b5eda547ba657fd81ba47f48b980d20b14befb7",name="k8s_exporter_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 142.5
container_cpu_user_seconds_total{container_name="exporter",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/81fd164c5cc91a483b73ada15ce13f19d3171fc6beddc940fc2b6e747141905d",image="tomwilkie/nats_exporter@sha256:189354d9c966f94d9685009250dc360582baf02f76ecbaa2233e15cff2bc8f7f",name="k8s_exporter_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 107.62
container_cpu_user_seconds_total{container_name="fluentd-loggly",id="/kubepods/burstable/podcf18531c-8365-11e7-9534-0a97ed59c75e/6fe6a67e02419f47a21854b73734042c0d457d42704be4302356180e4f357935",image="quay.io/weaveworks/fluentd-loggly@sha256:19a02a2f8627573572cc2ee3c706aa4ccdab0f59c3a04e577d28035681d30ddc",name="k8s_fluentd-loggly_fluentd-loggly-z9jp4_monitoring_cf18531c-8365-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="fluentd-loggly-z9jp4"} 17386.12
container_cpu_user_seconds_total{container_name="flux",id="/kubepods/besteffort/podd2b82b9c-8355-11e7-9534-0a97ed59c75e/d4ef6d20b97c7f0fefc9d13c0f4b94290eb661035bf21b7f07f38acdd18cb85d",image="quay.io/weaveworks/flux@sha256:e462c0a7c316f5986b3808360dc7c8c269466033c75a1b9553aa8175e02646f7",name="k8s_flux_fluxd-3608285890-x4bz7_kube-system_d2b82b9c-8355-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="fluxd-3608285890-x4bz7"} 36097.96
container_cpu_user_seconds_total{container_name="fluxsvc",id="/kubepods/burstable/podcb5d3cc0-8364-11e7-9534-0a97ed59c75e/aa00624319b1a96a18e0a4717f13e7456e558fea8b84e2694dc8d2b168a44d3d",image="quay.io/weaveworks/fluxsvc@sha256:8d91991f2f6894def54afda4b4afb858b0502ed841a7188db48210b94bfdae4a",name="k8s_fluxsvc_fluxsvc-438909710-2jtz8_fluxy_cb5d3cc0-8364-11e7-9534-0a97ed59c75e_0",namespace="fluxy",pod_name="fluxsvc-438909710-2jtz8"} 897247.03
container_cpu_user_seconds_total{container_name="kube-proxy",id="/kubepods/burstable/pod7964f3e653196edee64f6bad72589dee/8d2eb34023eab40d08ba6e4be149e315c3844749f8321f44be2dcda024534757",image="gcr.io/google_containers/kube-proxy-amd64@sha256:dba7121df9f74b40901fb655053af369f58c82c3636d8125986ce474a759be80",name="k8s_kube-proxy_kube-proxy-ip-172-20-3-76.ec2.internal_kube-system_7964f3e653196edee64f6bad72589dee_1",namespace="kube-system",pod_name="kube-proxy-ip-172-20-3-76.ec2.internal"} 368.98
container_cpu_user_seconds_total{container_name="kured",id="/kubepods/besteffort/pod965b711b-8262-11e7-9534-0a97ed59c75e/12b3c19d2f114a6a111fdc0375bb0c27fb9e108c166e6f674aeddcd5178faa0b",image="weaveworks/kured@sha256:305b073cd3fff9ba0f21a570ee8a9c018d30274fc35045134164c762f44828e0",name="k8s_kured_kured-wp23j_kube-system_965b711b-8262-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="kured-wp23j"} 5.91
container_cpu_user_seconds_total{container_name="logging",id="/kubepods/burstable/pod55af46fe-834c-11e7-9534-0a97ed59c75e/8d7e46f3d99d2f13b04b7e07a4f1062e82450f02f8f7f03c8fb33a83f0248857",image="quay.io/weaveworks/logging@sha256:63c4e6783884e6fcdd24026606756748e5913ab4978efa61ed09034074ddbe27",name="k8s_logging_authfe-1607895901-bjnd9_default_55af46fe-834c-11e7-9534-0a97ed59c75e_0",namespace="default",pod_name="authfe-1607895901-bjnd9"} 41780.76
container_cpu_user_seconds_total{container_name="memcached",id="/kubepods/besteffort/pod94ad7fd4-8351-11e7-9534-0a97ed59c75e/e5d81ddecc6a587e55491e837db3ed46f274e3b02c764f4d6d1ca2e6228fbe0c",image="memcached@sha256:00b68b00139155817a8b1d69d74865563def06b3af1e6fc79ac541a1b2f6b961",name="k8s_memcached_memcached-296817331-t3q5v_kube-system_94ad7fd4-8351-11e7-9534-0a97ed59c75e_0",namespace="kube-system",pod_name="memcached-296817331-t3q5v"} 222.96
container_cpu_user_seconds_total{container_name="nats",id="/kubepods/besteffort/pode4c7eace-8352-11e7-9534-0a97ed59c75e/511ce33319ecc50b928e3dda7025d643c310a5573d89596f89798496d9868342",image="nats@sha256:2dfb204c4d8ca4391dbe25028099535745b3a73d0cf443ca20a7e2504ba93b26",name="k8s_nats_nats-651776541-6vrk3_scope_e4c7eace-8352-11e7-9534-0a97ed59c75e_0",namespace="scope",pod_name="nats-651776541-6vrk3"} 44.25
container_cpu_user_seconds_total{container_name="prom-node-exporter",id="/kubepods/besteffort/pod5f43c843-7db5-11e7-9534-0a97ed59c75e/1ceb1514b5339c67c70ec37d609d361d5ba656ee3697a12de0918f9902d0a134",image="weaveworks/node_exporter@sha256:4f0c14e89da784857570185c4b9f57acb20f4331ef10e013731ac9274243a5a8",name="k8s_prom-node-exporter_prom-node-exporter-w4nvq_monitoring_5f43c843-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="prom-node-exporter-w4nvq"} 707.54
container_cpu_user_seconds_total{container_name="prom-run",id="/kubepods/besteffort/pod6b2e45d7-7db5-11e7-9534-0a97ed59c75e/75468eaf52cf3577dbb462d586fc5aa49a3f5a151fb668a734f8e99f825c1fc5",image="quay.io/weaveworks/docker-ansible@sha256:452d1249e40650249beb700349c7deee26c15da2621e8590f3d56033babb890b",name="k8s_prom-run_reboot-required-rn9h4_monitoring_6b2e45d7-7db5-11e7-9534-0a97ed59c75e_1",namespace="monitoring",pod_name="reboot-required-rn9h4"} 70.57
container_cpu_user_seconds_total{container_name="prometheus",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/e4e3b4f6285c9a12415f347aadbf150c6d782e6b881d2701d4257bf3a4de2651",image="prom/prometheus@sha256:4bf7ad89d607dd8de2f0cff1df554269bff19fe0f18ee482660f7a5dc685d549",name="k8s_prometheus_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 438158.08
container_cpu_user_seconds_total{container_name="scope-probe",id="/kubepods/burstable/pod53559243-7db5-11e7-9534-0a97ed59c75e/e57413febbcc1c28321ccb99df3bf30b9d6555a1db62b743d1b4ee877f23346b",image="quay.io/weaveworks/scope@sha256:bc6ee4a4a568f8075573a8ac44c27759307fce355c22ad66acb1e944b6361b62",name="k8s_scope-probe_scope-probe-master-3cktj_kube-system_53559243-7db5-11e7-9534-0a97ed59c75e_1",namespace="kube-system",pod_name="scope-probe-master-3cktj"} 278471.28
container_cpu_user_seconds_total{container_name="watch",id="/kubepods/burstable/podc7af9dff-8364-11e7-9534-0a97ed59c75e/fe6cdaa2c542c90cbca951cd97952d35c8c42fcd5e8f452030369a98e27c9b3f",image="weaveworks/watch@sha256:bb113953e19fff158de017c447be337aa7a3709c3223aeeab4a5bae50ee6f159",name="k8s_watch_prometheus-2177618048-kgczb_monitoring_c7af9dff-8364-11e7-9534-0a97ed59c75e_0",namespace="monitoring",pod_name="prometheus-2177618048-kgczb"} 0.1

Different metrics in the same scrape will be fine, e.g. container_fs_inodes_free

@bboreham
Copy link
Contributor

I think I figured out what is going wrong.

The function DefaultContainerLabels() conditionally adds various metric labels from container labels - name, image, etc. When used inside kubelet this function is containerPrometheusLabels() but essentially the same.

However, when it receives the metrics, Prometheus checks that all metrics in the same family have the same label set, and rejects those that do not.

Since containers are collected in (somewhat) random order, depending on which kind is seen first you get one set of metrics or the other.

Changing the container labels function to always add the same set of labels, adding "" when it doesn't have a real value, eliminates the issue in my testing.

@dashpole
Copy link
Collaborator

Thanks @bboreham! Can you submit a PR with your fix? I will try and get this in the 1.8 release.

@mfournier
Copy link

For those stuck on 0.25.0 because of this issue, I've cherry-picked (04fc089) the patch to kube-state-metrics mentioned above (#1704 (comment)) onto cadvisor's local copy of client_golang/prometheus/registry.go. This simply voids the labels consistency checking introduced in 0.26.0. I also pushed an image with the workaround to docker.io/camptocamp/cadvisor:v0.27.1_with-workaround-for-1704

NB: this is merely a workaround until a proper fix is available in a release !

sylr pushed a commit to sylr/kubernetes that referenced this issue Sep 11, 2017
Prometheus requires that all metrics in the same family have the same
labels, so we arrange to supply blank strings for missing labels

See google/cadvisor#1704
sylr pushed a commit to sylr/kubernetes that referenced this issue Sep 11, 2017
Prometheus requires that all metrics in the same family have the same
labels, so we arrange to supply blank strings for missing labels

See google/cadvisor#1704
cofyc pushed a commit to cofyc/kubernetes that referenced this issue Sep 26, 2017
Prometheus requires that all metrics in the same family have the same
labels, so we arrange to supply blank strings for missing labels

See google/cadvisor#1704
@ghost
Copy link

ghost commented Sep 28, 2017

We're observing the same behavior with version 0.27.0 and Docker 17.06.1.
Metrics always contain cAdvisor, alertmanager and Prometheus, but every couple of minutes, our applications' containers metrics are missing.
Could you please update if (and when) a fix would be available?
@mfournier workaround URL is broken.
Thanks.

@beorn7
Copy link

beorn7 commented Oct 4, 2017

After several discussions I had with various people, I came to the conclusion we want to support "label filling" within the Prometheus Go client. You can track progress here: prometheus/client_golang#355

@brian-brazil
Copy link
Contributor

brian-brazil commented Nov 30, 2017

I've looked into this, and there looks to be a simpler solution.

I believe that using the approach at kubernetes/kubernetes#51473 in Cadvisor would be sufficient to resolve the issue here. That is in DefaultContainerLabels produce an empty string for the missing labels.

Is there something I'm missing?

@brian-brazil
Copy link
Contributor

Ah, I see. It's the container.Spec.Labels and container.Spec.Envs which need extra handling.

@brian-brazil
Copy link
Contributor

I've put together #1831 which I believe will fix this.

@dashpole
Copy link
Collaborator

dashpole commented Dec 7, 2017

The fix is released in version v0.28.3

@marcbachmann
Copy link

marcbachmann commented Dec 8, 2017

Thank you all ♥️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests