Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically seeing fatal Timeout error messages #3740

Closed
laurencefass opened this issue Apr 18, 2023 · 8 comments
Closed

Periodically seeing fatal Timeout error messages #3740

laurencefass opened this issue Apr 18, 2023 · 8 comments
Labels
bug Something isn't working information-requested Bug is waiting on additional information from the user needs:reproducer This bug/feature is in need of a minimal reproducer triage

Comments

@laurencefass
Copy link

What happened?

Steps to Reproduce

I am running my otel enabled app and I am seeing fatal timeout errors during periods of no activity. See error details below.

Expected Result

no fatal error

Actual Result

This happens after a long but undefined period of inactivity.

2023-04-18 12:11:55 {"stack":"Error: Timeout\n at Timeout._onTimeout (/app/node_modules/@opentelemetry/sdk-trace-base/src/export/BatchSpanProcessorBase.ts:153:16)\n at listOnTimeout (internal/timers.js:557:17)\n at processTimers (internal/timers.js:500:7)","message":"Timeout","name":"Error"}
2023-04-18 12:12:49
2023-04-18 12:12:49 <--- Last few GCs --->
2023-04-18 12:12:49
2023-04-18 12:12:49 [5388:0xffff86fac380] 8046853 ms: Scavenge (reduce) 2038.5 (2044.7) -> 2038.3 (2046.7) MB, 5.8 / 0.0 ms (average mu = 0.128, current mu = 0.113) allocation failure
2023-04-18 12:12:49 [5388:0xffff86fac380] 8046863 ms: Scavenge (reduce) 2039.4 (2050.7) -> 2039.2 (2051.2) MB, 7.2 / 0.0 ms (average mu = 0.128, current mu = 0.113) allocation failure
2023-04-18 12:12:49 [5388:0xffff86fac380] 8046872 ms: Scavenge (reduce) 2040.2 (2045.2) -> 2040.0 (2048.2) MB, 7.3 / 0.0 ms (average mu = 0.128, current mu = 0.113) allocation failure
2023-04-18 12:12:49
2023-04-18 12:12:49
2023-04-18 12:12:49 <--- JS stacktrace --->
2023-04-18 12:12:49
2023-04-18 12:12:49 FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

Additional Details

OpenTelemetry Setup Code

import { Span } from '@opentelemetry/api';
import { ExportResult } from '@opentelemetry/core';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { WinstonInstrumentation } from '@opentelemetry/instrumentation-winston';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { Resource } from '@opentelemetry/resources';
import {
	MetricExporter,
	MetricRecord,
	UngroupedProcessor,
} from '@opentelemetry/sdk-metrics-base';
import { NodeSDK } from '@opentelemetry/sdk-node';
import {
	BatchSpanProcessor,
	ReadableSpan,
	SpanExporter,
} from '@opentelemetry/sdk-trace-base';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { IncomingMessage, ServerResponse } from 'http';
import { TypeormInstrumentation } from 'opentelemetry-instrumentation-typeorm';
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
import { Strings } from 'src/common/strings';
import { DEFAULT_OPENTELEMETRY_INTERVAL } from 'src/common/constants';

if (process.env.NODE_ENV === Strings.DEVELOPMENT) {
	diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);
}

export const monitoringAttributes: Record<string, string> = {
	[SemanticResourceAttributes.SERVICE_NAMESPACE]:
		Strings.MICROSERVICE_NAMESPACE,
	[SemanticResourceAttributes.SERVICE_NAME]: Strings.EVENTS_SERVICE,
	[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]:
		process.env.NODE_ENV ?? Strings.PRODUCTION,
};

class NullSpanExporter implements SpanExporter {
	export(
		spans: ReadableSpan[],
		resultCallback: (result: ExportResult) => void,
	): void {
		resultCallback({
			code: 0,
		});
	}
	shutdown(): Promise<void> {
		return Promise.resolve();
	}
}

class NullMetricExporter implements MetricExporter {
	export(
		metrics: MetricRecord[],
		resultCallback: (result: ExportResult) => void,
	): void {
		resultCallback({
			code: 0,
		});
	}
	shutdown(): Promise<void> {
		return Promise.resolve();
	}
}

/**
 * Adding this processor will append static attributes to all metrics
 * This is necessary as observable counters/gauges used by the host
 * metrics package do not take attributes from the resource, while other
 * instrumentations do
 */
class AddStaticAttributesProcessor extends UngroupedProcessor {
	process(record: MetricRecord) {
		for (const [key, value] of Object.entries(monitoringAttributes)) {
			if (!record.attributes[key]) {
				record.attributes[key] = value;
			}
		}
		super.process(record);
	}
}

function getMetricInterval(): number | undefined {
	if (process.env.OPENTELEMETRY_METRIC_INTERVAL_MS) {
		return parseInt(process.env.OPENTELEMETRY_METRIC_INTERVAL_MS);
	}
	return DEFAULT_OPENTELEMETRY_INTERVAL;
}

function getTraceInterval(): number | undefined {
	if (process.env.OPENTELEMETRY_TRACE_INTERVAL_MS) {
		return parseInt(process.env.OPENTELEMETRY_TRACE_INTERVAL_MS);
	}
	return DEFAULT_OPENTELEMETRY_INTERVAL;
}

export async function startTracing() {
	let traceExporter: SpanExporter;
	let metricExporter: MetricExporter;

	if (!('OPENTELEMETRY_COLLECTOR_URL' in process.env)) {
		console.log('WARNING: OPENTELEMETRY_COLLECTOR_URL is not set.');
		traceExporter = new NullSpanExporter();
		metricExporter = new NullMetricExporter();
	} else {
		const collectorOptions = {
			url: `grpc://${process.env.OPENTELEMETRY_COLLECTOR_URL}`,
		};
		traceExporter = new OTLPTraceExporter(collectorOptions);
		metricExporter = new OTLPMetricExporter(collectorOptions);
	}

	const sdk = new NodeSDK({
		resource: new Resource(monitoringAttributes),
		traceExporter,
		metricInterval: getMetricInterval(),
		metricExporter,
		metricProcessor: new AddStaticAttributesProcessor(),
		spanProcessor: new BatchSpanProcessor(traceExporter, {
			scheduledDelayMillis: getTraceInterval(),
		}),
		instrumentations: [
			new WinstonInstrumentation(),
			new PgInstrumentation({
				enhancedDatabaseReporting: true,
			}),
			new HttpInstrumentation({
				responseHook(
					span: Span,
					response: IncomingMessage | ServerResponse,
				) {
					if ('setHeader' in response) {
						response.setHeader(
							'x-events-service-trace-id',
							span.spanContext().traceId,
						);
					}
				},
				ignoreIncomingPaths: [/\/docs\//],
			}),
			new ExpressInstrumentation(),
			new TypeormInstrumentation(),
		],
	});

	await sdk
		.start()
		.then(() => console.log('Tracing initialized'))
		.catch((error) => console.log('Error initializing tracing', error));

	// gracefully shut down the SDK on process exit
	process.on('SIGTERM', () => {
		sdk.shutdown()
			.then(() => console.log('Tracing terminated'))
			.catch((error) => console.log('Error terminating tracing', error))
			.finally(() => process.exit(0));
	});
}

package.json

"@opentelemetry/exporter-metrics-otlp-grpc": "^0.27.0",
		"@opentelemetry/exporter-trace-otlp-grpc": "^0.27.0",
		"@opentelemetry/instrumentation-express": "^0.27.0",
		"@opentelemetry/instrumentation-http": "^0.27.0",
		"@opentelemetry/instrumentation-pg": "^0.27.0",
		"@opentelemetry/instrumentation-winston": "^0.27.0",
		"@opentelemetry/sdk-node": "^0.27.0",
		"opentelemetry-instrumentation-typeorm": "^0.27.0",

Relevant log output

2023-04-18 12:11:55 {"stack":"Error: Timeout\n    at Timeout._onTimeout (/app/node_modules/@opentelemetry/sdk-trace-base/src/export/BatchSpanProcessorBase.ts:153:16)\n    at listOnTimeout (internal/timers.js:557:17)\n    at processTimers (internal/timers.js:500:7)","message":"Timeout","name":"Error"}
2023-04-18 12:12:49 
2023-04-18 12:12:49 <--- Last few GCs --->
2023-04-18 12:12:49 
2023-04-18 12:12:49 [5388:0xffff86fac380]  8046853 ms: Scavenge (reduce) 2038.5 (2044.7) -> 2038.3 (2046.7) MB, 5.8 / 0.0 ms  (average mu = 0.128, current mu = 0.113) allocation failure 
2023-04-18 12:12:49 [5388:0xffff86fac380]  8046863 ms: Scavenge (reduce) 2039.4 (2050.7) -> 2039.2 (2051.2) MB, 7.2 / 0.0 ms  (average mu = 0.128, current mu = 0.113) allocation failure 
2023-04-18 12:12:49 [5388:0xffff86fac380]  8046872 ms: Scavenge (reduce) 2040.2 (2045.2) -> 2040.0 (2048.2) MB, 7.3 / 0.0 ms  (average mu = 0.128, current mu = 0.113) allocation failure 
2023-04-18 12:12:49 
2023-04-18 12:12:49 
2023-04-18 12:12:49 <--- JS stacktrace --->
2023-04-18 12:12:49 
2023-04-18 12:12:49 FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
@laurencefass laurencefass added bug Something isn't working triage labels Apr 18, 2023
@Flarna
Copy link
Member

Flarna commented Apr 18, 2023

Looks a bit like a memory leak and the error is just the final thing before process dies and not the actual root cause.
I recommend to create a few heap dumps, compare them in dev tools to find out what leaks.

@laurencefass laurencefass changed the title Periodically seeing Timeout error messages Periodically seeing fatal Timeout error messages Apr 18, 2023
@pichlermarc
Copy link
Member

I see that the versions provided in the thread are also relatively old (0.27.0 was released more than 1 year ago), does this still happen with a more recent release? 🤔

@pichlermarc pichlermarc added the information-requested Bug is waiting on additional information from the user label Apr 19, 2023
@dyladan
Copy link
Member

dyladan commented Apr 26, 2023

I agree that this looks a like a memory leak, but I don't know if we're leaking or if we're just in the stack when the crash happens. Does disabling opentelemetry prevent the leak/process crash? A minimal reproduction would be extremely helpful.

@gblock0
Copy link

gblock0 commented Jun 8, 2023

Adding some extra info if it's helpful.

We're seeing the same issue intermittently throughout the day with the following package versions:

        "@google-cloud/opentelemetry-cloud-trace-exporter": "2.0.0",
        "@opentelemetry/api": "1.4.1",
        "@opentelemetry/exporter-prometheus": "0.37.0",
        "@opentelemetry/instrumentation-express": "0.32.3",
        "@opentelemetry/instrumentation-http": "0.39.1",
        "@opentelemetry/sdk-node": "0.37.0",
        "@opentelemetry/sdk-trace-base": "1.11.0",

I haven't had time to setup a minimum repro yet, but if I get some time, I'll try to get one together.

We did not see these errors before setting up OpenTelemetry. They started the same day we turned it on.

Some example stack traces:

Error: batchWriteSpans error: 14 UNAVAILABLE: The service is currently unavailable.
    at DiagAPI.error (/opt/code/node_modules/@opentelemetry/api/build/src/api/diag.js:40:40)
    at TraceExporter._batchWriteSpans (/opt/code/node_modules/@google-cloud/opentelemetry-cloud-trace-exporter/build/src/trace.js:109:24)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async TraceExporter.export (/opt/code/node_modules/@google-cloud/opentelemetry-cloud-trace-exporter/build/src/trace.js:77:24)
Error: {"stack":"Error: Timeout\n    at Timeout._onTimeout (/opt/code/node_modules/<internal-tool>/node_modules/@opentelemetry/sdk-trace-base/build/src/export/BatchSpanProcessorBase.js:131:24)\n    at listOnTimeout (node:internal/timers:569:17)\n    at process.processTimers (node:internal/timers:512:7)","message":"Timeout","name":"Error"}
    at DiagAPI.error (/opt/code/node_modules/@opentelemetry/api/build/src/api/diag.js:40:40)
    at /opt/code/node_modules/<internal-tool>/node_modules/@opentelemetry/core/build/src/common/logging-error-handler.js:26:20
    at globalErrorHandler (/opt/code/node_modules/<internal-tool>/node_modules/@opentelemetry/core/build/src/common/global-error-handler.js:36:9)
    at /opt/code/node_modules/<internal-tool>/node_modules/@opentelemetry/sdk-trace-base/build/src/export/BatchSpanProcessorBase.js:177:47

@pichlermarc
Copy link
Member

@gblock0 a minimum repro is always welcome. 🙂 Are you having the same problem as OP (with it actually running into the heap limit)? Otherwise, it may be better to open another issue to keep things separate. 🤔

If it is not running out of memory, the problem could be that sometimes the server/collector is unavailable for export or that it would timeout occasionally (due to network issues, for instance). 🤔

If that is the case, and it's a network issue, and you'd like just to get rid of the logs, then there are two options:

  • registering a diag logger that drops those logs (via diag.setDiagLogger() from @opentelemetry/api
  • registering an error handler that drops the error/does not log (via setGlobalErrorHandler() from @opentelemetry/core)

Looking closer at the error messages, it looks like you're using an exporter that is not hosted in this repository but over at https://github.com/GoogleCloudPlatform/opentelemetry-operations-js

@pichlermarc pichlermarc added the needs:reproducer This bug/feature is in need of a minimal reproducer label Jun 12, 2023
@gblock0
Copy link

gblock0 commented Jun 12, 2023

Thanks for the quick reply @pichlermarc! I don't think we are running out of memory, but I'll double check. Sorry for making things confusing.

I'll dig into the diag logger some and take a look the GCP exporter. Thanks for the pointers!

@pichlermarc
Copy link
Member

@laurencefass are you still seeing the same problem with a more recent version? We've fixed a memory leak that sounds similar to this a while ago: #4115

@pichlermarc
Copy link
Member

@laurencefass closing this issue as we can't reproduce it on the current release. If you find that this is still happening with the most recent versions, please open a new issue and link this one.

@pichlermarc pichlermarc closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working information-requested Bug is waiting on additional information from the user needs:reproducer This bug/feature is in need of a minimal reproducer triage
Projects
None yet
Development

No branches or pull requests

5 participants