Potential memory leak with the prometheus metrics exporter #3621

bruce-y · 2023-02-17T22:49:45Z

What happened?

Steps to Reproduce

We've instrumented some metrics via the Otel metrics SDK and are exposing it with the prometheus exporter

Expected Result

This allows prometheus scrapes against this service without memory continuously increasing.

Actual Result

We are seeing a very steady increase of memory usage until it hits the heap limit where the service terminates and restarts.

Additional Details

OpenTelemetry Setup Code

import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Counter as OtelCounter, Histogram as OtelHistogram, metrics } from '@opentelemetry/api';

const prometheusExporter = new PrometheusExporter({
  preventServerStart: true
});

metrics.setGlobalMeterProvider(meterProvider);

meterProvider.addMetricReader(prometheusExporter);

const meter = metrics.getMeter('controller');

export const fooCounter1: OtelCounter = meter.createCounter('foo_counter_1', {
  description: 'foo'
});

export const fooCounter2: OtelCounter = meter.createCounter('foo_counter_2', {
  description: 'foo.'
});

export const fooHistogram1: OtelHistogram = meter.createHistogram('foo_histogram', {
  description: 'foo'
});

export const getMetrics = (req: IncomingMessage, res: Response, next: NextFunction) => {
  try {
    res.sendStatus(200);
    prometheusExporter.getMetricsRequestHandler(req, res);
  } catch (err) {
    next(err);
  }
};

const router = express.Router();

router.get('/metrics', getMetrics);

export default router;

package.json

{
  "dependencies": {
    "@opentelemetry/api": "1.3.0",
    "@opentelemetry/core": "1.8.0",
    "@opentelemetry/exporter-jaeger": "1.7.0",
    "@opentelemetry/exporter-metrics-otlp-http": "0.34.0",
    "@opentelemetry/exporter-prometheus": "0.34.0",
    "@opentelemetry/exporter-trace-otlp-http": "0.32.0",
    "@opentelemetry/propagator-jaeger": "1.7.0",
    "@opentelemetry/resources": "1.6.0",
    "@opentelemetry/sdk-metrics": "1.8.0",
    "@opentelemetry/sdk-trace-base": "1.6.0",
    "@opentelemetry/sdk-trace-node": "1.6.0",
    "@opentelemetry/semantic-conventions": "1.6.0",
    ...
    }
}

Relevant log output

No response

pichlermarc · 2023-02-27T15:03:15Z

That does indeed look like a memory leak; however, I'm having some trouble replicating it on my end. 🤔

Could you provide some more info on the metrics that you're recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see #2997. Could this be what's causing your issue?

pichlermarc · 2023-03-06T07:34:38Z

@bruce-y any update? 🤔

jdmarshall · 2023-03-10T03:16:29Z

I'm currently tracking an ever increasing CPU usage by aws-otel-collector while using the otlp exporter.

So I'm wondering if this is two bugs, the same bug in two implementations, or a continuous accumulation of data that's not being cleared during an export (and/or being sent multiple times). Spiralling data would result in memory increases, and spiralling data sets would result in increasing CPU.

jdmarshall · 2023-03-10T19:22:37Z

@bruce-y what's your export interval?

Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences.
In opentelemetry each counter, histogram, gauge ends up accumulating both memory and CPU time. I had to introduce a lookup table (and be careful with names vs attributes to keep the table from growing unbounded)

bruce-y · 2023-05-03T19:26:50Z

Apologies for the late reply.

Could you provide some more info on the metrics that you're recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see #2997. Could this be what's causing your issue?

We aren't dealing with super high cardinality for a single metric but we are exporting many histograms that contain large amounts of buckets each. I would say it's likely that there's potentially 80k lines of metrics when exported.

We're exporting the same amount via https://github.com/siimon/prom-client without issue though.

@bruce-y what's your export interval?

The scrape interval we have set locally is every 30 seconds.

Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences.
In opentelemetry each counter, histogram, gauge ends up accumulating both memory and CPU time. I had to introduce a lookup table (and be careful with names vs attributes to keep the table from growing unbounded)

We're exporting the metrics references as a singleton in our application. So anytime something is observed we're just using the existing metrics object and not instantiating a new one with the same name.

cyberw · 2023-08-09T12:53:44Z

I'm experiencing something similar, but only when I register TWO MetricsReaders (PrometheusExporter and PeriodicExportingMetricReader). Either one by itself seems to work fine.

pichlermarc · 2023-08-09T13:49:34Z

I'm experiencing something similar, but only when I register TWO MetricsReaders (PrometheusExporter and PeriodicExportingMetricReader). Either one by itself seems to work fine.

Hmm, interesting. Does it just never run out of memory with just one reader, or does it just take longer? 🤔

Which exporter are you using with the PeriodicExportingMetricReader? What's its temporality preference? Maybe there's some combination of temporalities that's causing a memory leak in the SDK 🤔

cyberw · 2023-08-09T14:13:45Z

It does look like it has completely stopped. The objects I'm usually seeing increase in the DevTools/Memory (when I have both Readers running) are mainly HistogramAccumulation and LastValueAccumulation (but it could be someone else holding them). With only PrometheusExporter running, I dont see a single new one of those after about 10 minutes, but with both there are typically thousands.

I dont think I've set a specific temporality preference for either, but let me check.

Here is an example of objects allocated during about 10 minutes with both Readers:

cyberw · 2023-08-09T14:16:05Z

PeriodicExportingMetricReader is using Cumulative temporality as well.

pichlermarc · 2023-08-09T14:19:16Z

Thanks for the additional info @cyberw, I'll try to reproduce.

jdmarshall · 2023-08-29T18:41:25Z

For my own investigations, I've seen that every combination of labels is saved as a data structure inside of a Metric, in perpetuity. So as our app warms up and all of the cluster eventually sees every variation of every combination of labels, the memory footprint climbs and climbs for hours to days.

From conversations with others (such as the one yesterday on Hacker News), some people are implying that this is maybe not the case for opentelemetry implementations in other programming languages. If that's true, I'd very much like for this implementation to reach feature parity.

I have some very not kind things to say about OTEL and most of them stem from the scalability implications of total persistent state tracking. (vs say statsd, which forgets everything the moment the stat is sent). The other being the antagonism toward GIL or single-thread-per-process languages which end up running 16-40 copies of the data structures per machine. The aggregate stat size is a huge blind spot and if that's an accident of implementation instead of a design goal, then it definitely needs to be addressed. But from the code I don't see how it could be an accident.

pichlermarc · 2023-09-05T14:40:55Z

@cyberw I was able to create a minimal reproducer for your reported problem over at #4115 - looks like shortening the export interval makes it very easy to see that memory is leaking. As OP's reported setup only has one exporter, I'll not close #4115 as a duplicate as they might be separate issues.

pichlermarc · 2023-09-05T14:47:40Z

For my own investigations, I've seen that every combination of labels is saved as a data structure inside of a Metric, in perpetuity. So as our app warms up and all of the cluster eventually sees every variation of every combination of labels, the memory footprint climbs and climbs for hours to days.

From conversations with others (such as the one yesterday on Hacker News), some people are implying that this is maybe not the case for opentelemetry implementations in other programming languages. If that's true, I'd very much like for this implementation to reach feature parity.

@jdmarshall that's true, some language implementations do have an experimental feature to evict unused metric streams. Agreed, having it in the JS SDK would be great too (opened #4095 a few days ago to track it).

jdmarshall · 2023-09-06T18:53:22Z

I fiddled with the delta based reporting, which seems like the most likely route to dealing with zebras, but Webstorm refuses to breakpoint in opentelemetry.js. And the code is very... Spring-like and so single stepping is an exercise in very deep patience, along with copious note-keeping or a truly cavernous short-term memory. I was unable to get to the bottom of why our collector went to 0 stats when I attempted to use it.

cyberw · 2023-10-17T06:26:48Z

is this fixed by #4163 ?

pichlermarc · 2023-10-17T07:32:30Z

is this fixed by #4163 ?

Unfortunately, no. It should solve the problem you had (#4115), but I was never able to reproduce this exact problem reported by @bruce-y (single Prometheus exporter/ metric reader). The fix from #4163 strictly addresses problems with 2+ metric readers.

Any help with reproducing this issue here would be greatly appreciated.

jdmarshall · 2023-10-19T19:02:22Z

I know in my case I had set up the memory reader for some debugging output. Since I had some data points from before adding that code, I had suspicions that it was involved, but took some pretty deep stepping into the code to see how it mattered.

I don't know about @bruce-y but I anticipate that this fixed my issue.

pichlermarc · 2024-02-13T08:44:28Z

I'm closing this issue as cannot reproduce (single Prometheus exporter setup).
If the same problem persists, please feel free to open a new issue and link this one.

bruce-y added bug Something isn't working triage labels Feb 17, 2023

dyladan assigned pichlermarc Feb 22, 2023

dyladan added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc and removed triage labels Feb 22, 2023

pichlermarc added the information-requested Bug is waiting on additional information from the user label Feb 27, 2023

pichlermarc added the needs:reproducer This bug/feature is in need of a minimal reproducer label Jun 21, 2023

pichlermarc closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak with the prometheus metrics exporter #3621

Potential memory leak with the prometheus metrics exporter #3621

bruce-y commented Feb 17, 2023 •

edited

Loading

pichlermarc commented Feb 27, 2023

pichlermarc commented Mar 6, 2023

jdmarshall commented Mar 10, 2023 •

edited

Loading

jdmarshall commented Mar 10, 2023

bruce-y commented May 3, 2023 •

edited

Loading

cyberw commented Aug 9, 2023

pichlermarc commented Aug 9, 2023

cyberw commented Aug 9, 2023

cyberw commented Aug 9, 2023 •

edited

Loading

pichlermarc commented Aug 9, 2023

jdmarshall commented Aug 29, 2023 •

edited

Loading

pichlermarc commented Sep 5, 2023

pichlermarc commented Sep 5, 2023

jdmarshall commented Sep 6, 2023

cyberw commented Oct 17, 2023

pichlermarc commented Oct 17, 2023

jdmarshall commented Oct 19, 2023

pichlermarc commented Feb 13, 2024

Potential memory leak with the prometheus metrics exporter #3621

Potential memory leak with the prometheus metrics exporter #3621

Comments

bruce-y commented Feb 17, 2023 • edited Loading

What happened?

Steps to Reproduce

Expected Result

Actual Result

Additional Details

OpenTelemetry Setup Code

package.json

Relevant log output

pichlermarc commented Feb 27, 2023

pichlermarc commented Mar 6, 2023

jdmarshall commented Mar 10, 2023 • edited Loading

jdmarshall commented Mar 10, 2023

bruce-y commented May 3, 2023 • edited Loading

cyberw commented Aug 9, 2023

pichlermarc commented Aug 9, 2023

cyberw commented Aug 9, 2023

cyberw commented Aug 9, 2023 • edited Loading

pichlermarc commented Aug 9, 2023

jdmarshall commented Aug 29, 2023 • edited Loading

pichlermarc commented Sep 5, 2023

pichlermarc commented Sep 5, 2023

jdmarshall commented Sep 6, 2023

cyberw commented Oct 17, 2023

pichlermarc commented Oct 17, 2023

jdmarshall commented Oct 19, 2023

pichlermarc commented Feb 13, 2024

bruce-y commented Feb 17, 2023 •

edited

Loading

jdmarshall commented Mar 10, 2023 •

edited

Loading

bruce-y commented May 3, 2023 •

edited

Loading

cyberw commented Aug 9, 2023 •

edited

Loading

jdmarshall commented Aug 29, 2023 •

edited

Loading