Are user-land callback queues in uninstrumented modules the responsibility of the APM agent? #2498

trentm · 2021-12-04T01:20:25Z

(This stems from #2487 (comment) and the set of changes to instrumentations to change/fix some run-context handling.)

background

Many node.js packages, especially clients for databases, support internal queuing: The user-code makes one or more client requests, e.g.:

mysqlClient = new Client(...)
mysqlClient.connect()
// ...
mysqlClient.query('select foo', function onFoo (err, results) { ... })
mysqlClient.query('select bar', function onBar (err, results) { ... })

and the client internally queues those up, gets a connection to the database, dispatches the queued commands (serially or in parallel), and calls the given callbacks when results are in. Thomas dubbed these "user-land callback queues".

One of the jobs of the Node.js APM agent is to track asynchronous context, so that when the result of some action (say, reading a file via fs.readFile()) is ready sometime later, the user code that receives the results have a "current run context" the follows from the when the action was inititiated:

// If APM Transaction "t0" is current here...
fs.readFile('myfile.txt', function (err, results) {
    // ... then APM Transaction "t0" should be current here too.
})

The user-land callback queues mentioned above can break expectations. Using the MySQL example above, if the package creates its queue dispatcher when .connect()'ing, then the callbacks onFoo and onBar will be in the context of that .connect() call.

Using this "play-mysql2-run-context.js" as a demo:

const apm = require('./').start({ // elastic-apm-node
  captureSpanStackTraces: false,
  serviceName: 'play-mysql2-run-context',
  // disableInstrumentations: 'mysql2,mysql',
  // asyncHooks: false
})

var t0, t1, t2

// const mysql = require('mysql')
const mysql = require('mysql2')

t0 = apm.startTransaction('t0-connect')
const conn = mysql.createConnection({ user: 'root' })

t1 = apm.startTransaction('t1-first-select')
conn.query('SELECT 1 + 1 AS solution', function (err, results) {
  console.log('SELECT 1+1: err=%s results=%o %s', err, results, apm._instrumentation._runCtxMgr)
  var s = apm.startSpan('s1'); if (s) s.end() // manual span
})

t2 = apm.startTransaction('t2-second-select')
conn.query('SELECT 2 + 2 AS solution', function (err, results) {
  console.log('SELECT 2+2: err=%s results=%o %s', err, results, apm._instrumentation._runCtxMgr)
  var s = apm.startSpan('s2'); if (s) s.end() // manual span
})

// Lazily shutdown after everything above is finished.
setTimeout(() => {
  console.log('Done')
  t2.end()
  t1.end()
  t0.end()
  conn.end()
}, 1000)

and running with the APM agent disabled, the manual spans s1 and s2 end up on the transaction t0 that was active at the time the client connected, rather than the transactions active when the query calls were made:

transaction "t0-connect"
`- span "s1"
`- span "s2"
transaction "t1-first-select"
transaction "t2-second-select"

Normally the APM agent handles this when it instruments the module by "binding" the context to the callback, resulting in:

transaction "t0-connect"
transaction "t1-first-select"
`- span "SELECT"
`- span "s1"
transaction "t2-second-select"
`- span "SELECT"
`- span "s2"

the question

Now what should the APM agent do if the user disabled instrumentation for this mysql module?

disableInstrumentations: 'mysql2,mysql'

Should disableInstrumentations mean:

don't generate spans/errors for this module; or
don't monkey-patch or touch this module at all because I know or suspect the instrumentation is getting in my way (bugs, perf, lack-of-trust, binary-search debugging)?

The equivalent in OTel (not explicitly setting up a particular opentelemetry-plugin-<module>) is (2).

current state

Why this is coming up now, is that the current Elastic Node.js APM agent is doing user-land callback queue handling for some modules even when that module is listed in disableInstrumentations, but not for others. For example it does (or did) so for the mysql, pg (I think), ioredis, redis, and mysql2 packages (sometimes buggily); but not for express-queue even though its instrumentation was explicitly added for user-land callback queue handling in #373. I say "or did" because as part of #2430 work I had (somewhat unwittingly) been removing this handling because I did not fully understand it. There was a mention of "continuation patches" in the PR that added disableInstrumentations support, but no details. There are also no tests for these cases so I didn't notice breakage.

Hence this issue to discuss it and address it consistently.

At time of writing it has been removed for mysql, pg, ioredis, and redis; and PR #2487 is open which removes it for mysql2.

The text was updated successfully, but these errors were encountered:

AlexanderWert · 2021-12-06T06:53:41Z

My expectation would be option (2) as well. And this is also how for example the Java agent deals with certain instrumentation modules. For instance, if in Java a user disables Vert.x instrumentation this also breaks the context propagation, not only stops capturing spans.

felixbarny · 2021-12-07T17:29:04Z

+1 on option 2

felixbarny · 2021-12-07T18:37:59Z

If an important use-case for disabling just the mysql span creation but enabling the context propagation, we could think about splitting the instrumentation into multiple parts.

disable_instrumentations=mysql disables both context propagation and span creation.
disable_instrumentations=mysql-spans disables span creation, context propagation is enabled
disable_instrumentations=mysql-context disables context propagation, span creation is enabled (not really useful though)
disable_instrumentations=mysql-spans,mysql-context is equivalent to disable_instrumentations=mysql

trentm · 2021-12-07T18:45:26Z

or disabling just the mysql span creation but enabling the context propagation

For interest, Thomas thought out loud about something similar (adding more structure to disable_instrumentations) for the issue that led to the Node.js APM agent's instrumentIncomingHTTPRequests config var: #434

As an alternative we could consider if the config option instead should focus on span types, so instead of disabling pg, you would disable db.postgresql (the full type is db.postgresql.query, but I guess we can just match against the start of the string, which also means that we could write db to disable all db spans).

As a workaround for a user that was relying on "no mysql span creation, but mysql run context propagation", they could use a span filter to filter out the created MySQL spans after the fact. That works, but still implies the overhead of that span creation.

trentm · 2021-12-07T22:16:22Z

[Alan (on private chat)]

no strong opinions

trentm · 2021-12-07T22:43:45Z

To re-state:

Q: Are user-land callback queues in uninstrumented modules the responsibility of the APM agent?
A: No.

I will add a note to the changelog for the 3.26.0 release about this and resolve this issue.

Fixes: #2498

When disableInstrumentations support was added in #353 it only half-disabled these instrumentations, briefly mentioning "continuation patches". I believe this was about user-land callback queue handling that should no longer be necessary after run-context and #2430 work. In #2498 it was discussed and decided that if listed in disableInstrumentations the agent should not touch the module at all. Refs: #2498

trentm added context propagation 8.1-candidate labels Dec 4, 2021

trentm self-assigned this Dec 4, 2021

github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Dec 4, 2021

trentm mentioned this issue Dec 4, 2021

fix: ensure correct run context for mysql2 instrumentation #2487

Merged

3 tasks

trentm changed the title ~~Are user-land callback queues in *uninstrumented* modules to responsibility of the APM agent?~~ Are user-land callback queues in *uninstrumented* modules the responsibility of the APM agent? Dec 4, 2021

trentm mentioned this issue Dec 6, 2021

docs: Fix broken link to 7.16 docs #2499

Merged

trentm added a commit that referenced this issue Dec 7, 2021

3.26.0

b631ee9

Fixes: #2498

trentm mentioned this issue Dec 7, 2021

3.26.0 #2505

Merged

3 tasks

trentm closed this as completed in #2505 Dec 7, 2021

trentm added a commit that referenced this issue Dec 7, 2021

3.26.0 (#2505)

16479c5

Fixes: #2498

trentm mentioned this issue Jan 28, 2022

fix: drop no-longer-necessary half-disabling of instrumentations #2444

Merged

3 tasks

trentm mentioned this issue Feb 24, 2023

Mongoose spans are getting attached to wrong parent transaction #3161

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are user-land callback queues in uninstrumented modules the responsibility of the APM agent? #2498

Are user-land callback queues in uninstrumented modules the responsibility of the APM agent? #2498

trentm commented Dec 4, 2021 •

edited by zube bot

Loading

AlexanderWert commented Dec 6, 2021

felixbarny commented Dec 7, 2021

felixbarny commented Dec 7, 2021

trentm commented Dec 7, 2021 •

edited

Loading

trentm commented Dec 7, 2021

trentm commented Dec 7, 2021

Are user-land callback queues in *uninstrumented* modules the responsibility of the APM agent? #2498

Are user-land callback queues in *uninstrumented* modules the responsibility of the APM agent? #2498

Comments

trentm commented Dec 4, 2021 • edited by zube bot Loading

background

the question

current state

AlexanderWert commented Dec 6, 2021

felixbarny commented Dec 7, 2021

felixbarny commented Dec 7, 2021

trentm commented Dec 7, 2021 • edited Loading

trentm commented Dec 7, 2021

trentm commented Dec 7, 2021

Are user-land callback queues in uninstrumented modules the responsibility of the APM agent? #2498

Are user-land callback queues in uninstrumented modules the responsibility of the APM agent? #2498

trentm commented Dec 4, 2021 •

edited by zube bot

Loading

trentm commented Dec 7, 2021 •

edited

Loading