Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak with the prometheus metrics exporter #3621

Closed
bruce-y opened this issue Feb 17, 2023 · 18 comments
Closed

Potential memory leak with the prometheus metrics exporter #3621

bruce-y opened this issue Feb 17, 2023 · 18 comments
Assignees
Labels
bug Something isn't working information-requested Bug is waiting on additional information from the user needs:reproducer This bug/feature is in need of a minimal reproducer priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc

Comments

@bruce-y
Copy link

bruce-y commented Feb 17, 2023

What happened?

Steps to Reproduce

We've instrumented some metrics via the Otel metrics SDK and are exposing it with the prometheus exporter

Expected Result

This allows prometheus scrapes against this service without memory continuously increasing.

Actual Result

We are seeing a very steady increase of memory usage until it hits the heap limit where the service terminates and restarts.

image

Additional Details

OpenTelemetry Setup Code

import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Counter as OtelCounter, Histogram as OtelHistogram, metrics } from '@opentelemetry/api';

const prometheusExporter = new PrometheusExporter({
  preventServerStart: true
});

metrics.setGlobalMeterProvider(meterProvider);

meterProvider.addMetricReader(prometheusExporter);

const meter = metrics.getMeter('controller');

export const fooCounter1: OtelCounter = meter.createCounter('foo_counter_1', {
  description: 'foo'
});

export const fooCounter2: OtelCounter = meter.createCounter('foo_counter_2', {
  description: 'foo.'
});

export const fooHistogram1: OtelHistogram = meter.createHistogram('foo_histogram', {
  description: 'foo'
});

export const getMetrics = (req: IncomingMessage, res: Response, next: NextFunction) => {
  try {
    res.sendStatus(200);
    prometheusExporter.getMetricsRequestHandler(req, res);
  } catch (err) {
    next(err);
  }
};

const router = express.Router();

router.get('/metrics', getMetrics);

export default router;

package.json

{
  "dependencies": {
    "@opentelemetry/api": "1.3.0",
    "@opentelemetry/core": "1.8.0",
    "@opentelemetry/exporter-jaeger": "1.7.0",
    "@opentelemetry/exporter-metrics-otlp-http": "0.34.0",
    "@opentelemetry/exporter-prometheus": "0.34.0",
    "@opentelemetry/exporter-trace-otlp-http": "0.32.0",
    "@opentelemetry/propagator-jaeger": "1.7.0",
    "@opentelemetry/resources": "1.6.0",
    "@opentelemetry/sdk-metrics": "1.8.0",
    "@opentelemetry/sdk-trace-base": "1.6.0",
    "@opentelemetry/sdk-trace-node": "1.6.0",
    "@opentelemetry/semantic-conventions": "1.6.0",
    ...
    }
}

Relevant log output

No response

@bruce-y bruce-y added bug Something isn't working triage labels Feb 17, 2023
@dyladan dyladan added priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc and removed triage labels Feb 22, 2023
@pichlermarc
Copy link
Member

That does indeed look like a memory leak; however, I'm having some trouble replicating it on my end. 🤔

Could you provide some more info on the metrics that you're recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see #2997. Could this be what's causing your issue?

@pichlermarc pichlermarc added the information-requested Bug is waiting on additional information from the user label Feb 27, 2023
@pichlermarc
Copy link
Member

@bruce-y any update? 🤔

@jdmarshall
Copy link

jdmarshall commented Mar 10, 2023

I'm currently tracking an ever increasing CPU usage by aws-otel-collector while using the otlp exporter.

So I'm wondering if this is two bugs, the same bug in two implementations, or a continuous accumulation of data that's not being cleared during an export (and/or being sent multiple times). Spiralling data would result in memory increases, and spiralling data sets would result in increasing CPU.

@jdmarshall
Copy link

@bruce-y what's your export interval?

Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences.
In opentelemetry each counter, histogram, gauge ends up accumulating both memory and CPU time. I had to introduce a lookup table (and be careful with names vs attributes to keep the table from growing unbounded)

@bruce-y
Copy link
Author

bruce-y commented May 3, 2023

Apologies for the late reply.

Could you provide some more info on the metrics that you're recording? A known attribute configuration can cause memory leaks are Attributes with many possible values (for instance, a timestamp or user-id); see #2997. Could this be what's causing your issue?

We aren't dealing with super high cardinality for a single metric but we are exporting many histograms that contain large amounts of buckets each. I would say it's likely that there's potentially 80k lines of metrics when exported.

We're exporting the same amount via https://github.com/siimon/prom-client without issue though.

@bruce-y what's your export interval?

The scrape interval we have set locally is every 30 seconds.

Also are you deduping metrics with the same name? One of the first problems I ran into with this ecosystem was treating stats like StatsD, where you can just keep throwing the same name at the API over and over with no consequences.
In opentelemetry each counter, histogram, gauge ends up accumulating both memory and CPU time. I had to introduce a lookup table (and be careful with names vs attributes to keep the table from growing unbounded)

We're exporting the metrics references as a singleton in our application. So anytime something is observed we're just using the existing metrics object and not instantiating a new one with the same name.

@pichlermarc pichlermarc added the needs:reproducer This bug/feature is in need of a minimal reproducer label Jun 21, 2023
@cyberw
Copy link

cyberw commented Aug 9, 2023

I'm experiencing something similar, but only when I register TWO MetricsReaders (PrometheusExporter and PeriodicExportingMetricReader). Either one by itself seems to work fine.

@pichlermarc
Copy link
Member

I'm experiencing something similar, but only when I register TWO MetricsReaders (PrometheusExporter and PeriodicExportingMetricReader). Either one by itself seems to work fine.

Hmm, interesting. Does it just never run out of memory with just one reader, or does it just take longer? 🤔

Which exporter are you using with the PeriodicExportingMetricReader? What's its temporality preference? Maybe there's some combination of temporalities that's causing a memory leak in the SDK 🤔

@cyberw
Copy link

cyberw commented Aug 9, 2023

It does look like it has completely stopped. The objects I'm usually seeing increase in the DevTools/Memory (when I have both Readers running) are mainly HistogramAccumulation and LastValueAccumulation (but it could be someone else holding them). With only PrometheusExporter running, I dont see a single new one of those after about 10 minutes, but with both there are typically thousands.

I dont think I've set a specific temporality preference for either, but let me check.

Here is an example of objects allocated during about 10 minutes with both Readers:
image

@cyberw
Copy link

cyberw commented Aug 9, 2023

PeriodicExportingMetricReader is using Cumulative temporality as well.

@pichlermarc
Copy link
Member

Thanks for the additional info @cyberw, I'll try to reproduce.

@jdmarshall
Copy link

jdmarshall commented Aug 29, 2023

For my own investigations, I've seen that every combination of labels is saved as a data structure inside of a Metric, in perpetuity. So as our app warms up and all of the cluster eventually sees every variation of every combination of labels, the memory footprint climbs and climbs for hours to days.

From conversations with others (such as the one yesterday on Hacker News), some people are implying that this is maybe not the case for opentelemetry implementations in other programming languages. If that's true, I'd very much like for this implementation to reach feature parity.

I have some very not kind things to say about OTEL and most of them stem from the scalability implications of total persistent state tracking. (vs say statsd, which forgets everything the moment the stat is sent). The other being the antagonism toward GIL or single-thread-per-process languages which end up running 16-40 copies of the data structures per machine. The aggregate stat size is a huge blind spot and if that's an accident of implementation instead of a design goal, then it definitely needs to be addressed. But from the code I don't see how it could be an accident.

@pichlermarc
Copy link
Member

@cyberw I was able to create a minimal reproducer for your reported problem over at #4115 - looks like shortening the export interval makes it very easy to see that memory is leaking. As OP's reported setup only has one exporter, I'll not close #4115 as a duplicate as they might be separate issues.

@pichlermarc
Copy link
Member

For my own investigations, I've seen that every combination of labels is saved as a data structure inside of a Metric, in perpetuity. So as our app warms up and all of the cluster eventually sees every variation of every combination of labels, the memory footprint climbs and climbs for hours to days.

From conversations with others (such as the one yesterday on Hacker News), some people are implying that this is maybe not the case for opentelemetry implementations in other programming languages. If that's true, I'd very much like for this implementation to reach feature parity.

@jdmarshall that's true, some language implementations do have an experimental feature to evict unused metric streams. Agreed, having it in the JS SDK would be great too (opened #4095 a few days ago to track it).

@jdmarshall
Copy link

I fiddled with the delta based reporting, which seems like the most likely route to dealing with zebras, but Webstorm refuses to breakpoint in opentelemetry.js. And the code is very... Spring-like and so single stepping is an exercise in very deep patience, along with copious note-keeping or a truly cavernous short-term memory. I was unable to get to the bottom of why our collector went to 0 stats when I attempted to use it.

@cyberw
Copy link

cyberw commented Oct 17, 2023

is this fixed by #4163 ?

@pichlermarc
Copy link
Member

is this fixed by #4163 ?

Unfortunately, no. It should solve the problem you had (#4115), but I was never able to reproduce this exact problem reported by @bruce-y (single Prometheus exporter/ metric reader). The fix from #4163 strictly addresses problems with 2+ metric readers.

Any help with reproducing this issue here would be greatly appreciated.

@jdmarshall
Copy link

I know in my case I had set up the memory reader for some debugging output. Since I had some data points from before adding that code, I had suspicions that it was involved, but took some pretty deep stepping into the code to see how it mattered.

I don't know about @bruce-y but I anticipate that this fixed my issue.

@pichlermarc
Copy link
Member

I'm closing this issue as cannot reproduce (single Prometheus exporter setup).
If the same problem persists, please feel free to open a new issue and link this one.

@pichlermarc pichlermarc closed this as not planned Won't fix, can't repro, duplicate, stale Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working information-requested Bug is waiting on additional information from the user needs:reproducer This bug/feature is in need of a minimal reproducer priority:p1 Bugs which cause problems in end-user applications such as crashes, data inconsistencies, etc
Projects
None yet
Development

No branches or pull requests

5 participants