Allow file name substitutions when generating symcache files #422

mstange · 2021-09-07T22:23:30Z

I would like to request the capability to specify a file path substitution map when calling Object.make_symcache.

Feature motivation

At Mozilla, our symbol pipeline for the in-progress Tecken symbolication service rewrite is currently as follows:

Raw build artifact -> breakpad .sym file -> symbolic symcache

Unfortunately, the .sym file discards inline stack information, so we'd like to switch to the following instead:

Raw build artifact -> symbolic symcache

(and we'd also keep generating a .sym file, because we still need it for crash report stack unwinding)

However, there's one crucial bit in the .sym file generation pipeline that this direct approach would lose: filename substitution. Here are some examples of the file substitution we do during .sym file generation:

/builds/worker/workspace/build/src/view/nsViewManager.cpp
becomes
hg:hg.mozilla.org/mozilla-central:view/nsViewManager.cpp:92ff4d99f814bab4ddcb5ad7077e2c407b41c9d0

/builds/worker/workspace/obj-build/dist/include/mozilla/RefPtr.h
becomes
hg:hg.mozilla.org/mozilla-central:mfbt/RefPtr.h:f31a22da96294e7f31af4bf9a5a88c68a55c3705

/builds/worker/workspace/build/src/obj-firefox/dom/bindings/AddonManagerBinding.cpp
becomes
s3:gecko-generated-sources:d89a77356015cf19cbe2488905a8e878d105c1d2b8beaf233409a976cce65a065b8319b87b7bb6e2a1dd4c5b7c0c594b9e2c171c47b6dbfe17ec0b08dc453730/dom/bindings/AddonManagerBinding.cpp

/rustc/625451e376bb2e5283fc4741caa0a3e8a2ca4d54/src/liballoc/boxed.rs
becomes
git:github.com/rust-lang/rust:src/liballoc/boxed.rs:625451e376bb2e5283fc4741caa0a3e8a2ca4d54

This substitution effectively creates permalinks for all file paths. It allows us to look up the exact file contents long after the symbol file was processed.

It would be great if there was a way to create a symcache with these substitutions applied.

Specifics

Our current .sym file generation is controlled by a python script, so it would be best if we could use the symbolic python API for this. For example, make_symcache on Object seems like it would need to accept some kind of substitution map.

Something like the following could work:

[
  ["/builds/worker/workspace/build/src/(.*)", "hg:hg.mozilla.org/mozilla-central:{1}:92ff4d99f814bab4ddcb5ad7077e2c407b41c9d0"],
  ["/rustc/([^/]*)/(.*)", "git:github.com/rust-lang/rust:{2}:{1}"],
  ["/builds/worker/workspace/build/src/obj-firefox/(.*)", "s3:gecko-generated-sources:{DIGEST}/{1}"]
]

(This is based on a similar capability in mozilla/dump_syms which is currently unused.)

The trickiest part here is the {DIGEST} part: The digest is the sha-512 hash of the file contents. So, for any paths that match a substitution rule that uses {DIGEST}, we would need to read the file at that path from disk during symcache generation.

Alternatives

These questions explore whether the substitution really needs to happen during symcache generation.

Could the substitution be applied at a different time? Maybe before or after symcache generation?

I think it's not practical to do it before: You'd need to substitute inside the pdb file or inside the DWARF information of an ELF or mach-O file.

Maybe it could be done after, if there was an API to transform an unsubstituted symcache into a substituted symcache.

Could you use an unsubstituted symcache and fix up the paths after each address lookup?

This might be possible if we have an extra artifact on the side, which contains the following information:

the repo root path
the hg revision
the sha-512 hash for every generated file that is referred to by the symcache

But it would be better if the symcache file was self-sufficient and we wouldn't need to keep track of an extra file along with it.

Couldn't the substitution map just be a callback instead?

Making the user supply a callback function instead of a regex map might be simpler in some ways. But it could be annoying to wire up through the python API.
However, it would have the following advantages:

The user would be notified about all paths that required substitution. We could use this to generate the Microsoft source server sourcemap. Currently, this source map only maps paths that are encountered in the .sym file, so it misses any file that only occurs in inlined frames.
The digest generation code wouldn't need to live inside symbolic.

The text was updated successfully, but these errors were encountered:

mstange · 2021-09-08T00:56:11Z

Hmm, I just saw the following remark in issue 330:

SymCaches are only meant to be caches, unlike Breakpad symbols. We usually re-compute them on the fly when the version changes and never read old versions.

So maybe we need an auxiliary data file anyway, to store the hg revision and the sha-512 hashes.
Having such a file would also address issue 354.

Swatinem · 2021-09-08T07:39:27Z

I think a callback solution would be the most flexible here, and also completely outside symbolic.
And yes, symcaches are caches, and not the single source of truth. We tend to apply fixes that wildly change what is being written.

flub · 2021-09-08T08:09:57Z

+1 on the callback. though getting that across the Python API will be some extra fun but probably worth doing.

On whether this API should exist if you consider symcaches to be purely cached info. That's an interesting question, I think there's no problem adding it anyway it's up to you to re-generate the same symcache if you're regenerating it and it makes sense given the purpose of the symcache that it has all the info ready to use, needing a post-processing step there would not be great. That's also my view for #354, I see no problem with including the PE name into the symcache.

With a callback you'd still need a means to somehow store the transformations for that specific DWARF/PDB file if you want to be able to re-generate the symcache I guess, so it doesn't make the auxiliary file entirely obsolete for you if you need reproducability.

mstange · 2021-09-08T19:47:50Z

Thanks for the feedback!

So then the next step would be to decide on on API. I'd need help with that. For example, should the substitution callback only be specified when generating a symcache, or should the callback be a property of the Object, so that the substitution is also applied when function info is queried from the object without going through a symcache? (And then generating the symcache would probably respect the substitution automatically.)

flub · 2021-09-09T10:13:23Z

So there is precedent for doing it on the object for MachO BCSymbolMap files: https://docs.rs/symbolic/8.3.0/symbolic/debuginfo/macho/struct.MachObject.html#method.load_symbolmap. I think the same reasoning applies here and doing it on the object-level is probably preferred. I guess this one applies to all objects though, and not just to one kind.

luser · 2021-12-01T15:05:25Z

Could the substitution be applied at a different time? Maybe before or after symcache generation?

I think it's not practical to do it before: You'd need to substitute inside the pdb file or inside the DWARF information of an ELF or mach-O file.

As it happens, Firefox PDB files already include this information in the .srcsrv stream:
https://searchfox.org/mozilla-central/rev/e9cd2997be1071b9bb76fc14df0f01a2bd721c30/toolkit/crashreporter/tools/symbolstore.py#425

Given that you wrote a parser for this stream I suspect you know this already. :) Since the format is based around variable definitions and substitution, it seems like it would be straightforward to also define a variable that you could evaluate to get a URL to link directly to a specific line in an HTML view of a source file.

In the past I had pondered the idea of generating functionally-equivalent data for Linux/macOS binaries and inserting it into the debug info as an additional section, but never quite got around to it. For a simple proof of concept you could just generate .srcsrv-compatible data and stuff it into the ELF/Mach-O debug info files alongside the DWARF and use your crate to handle it. After writing that down it sounds less ridiculous than I expected, and maybe you should just do that? Then the only requirement here would be "copy .srcsrv data into symcache files", and either an API to just fetch the contents, or (fancier), a way to ask symbolic to use the data to remap filenames for you. The latter might be hard to generalize, since different consumers have different requirements: crash reporting systems likely want a URL to link to a specific line, whereas debuggers and other tools (like my disasm tool for printing source-interleaved disassembly) want the contents of the source file.

debuginfod aims to provide similar functionality, along with functionality that overlaps Microsoft-style symbol servers, but AFAICT it provides source file contents wholesale, not a canonical URL from which you could fetch the source. This is probably due to Linux distros' habit of patching upstreams and not keeping the sources they actually build packages from in source control anywhere. Symbolic has some support for debuginfod, but not support for its source file API.

The trickiest part here is the {DIGEST} part: The digest is the sha-512 hash of the file contents. So, for any paths that match a substitution rule that uses {DIGEST}, we would need to read the file at that path from disk during symcache generation.

Do note that I invented the generated-sources scheme out of whole cloth, so I don't think it's worthwhile to bake support for it directly into symbolic. It does require gathering the info at build-time, since that's the only time tools have access to the generated source files, but the more general solution would be something like the "include .srcsrv data in symcache files" idea I floated above—generate rules for mapping source filenames from the debug info to URLs or whatever using information known at build-time. The Firefox build does this while running dump_syms, since dump_syms does the heavy lifting of parsing the debug info and listing all the source files it references, but one could imagine writing a much smaller single-purpose tool that takes a PDB or DWARF-containing file + some substitution rules, iterates over all the source files from the debug info (which should be much faster than the work dump_syms does), and generates a .srcsrv stream (or equivalent).

mstange · 2021-12-01T19:41:16Z

debuginfod aims to provide similar functionality, along with functionality that overlaps Microsoft-style symbol servers, but AFAICT it provides source file contents wholesale, not a canonical URL from which you could fetch the source.

The canonical URL is <debuginfodserver>/buildid/<buildid>/source/<sourcepath>. Example for libpthread.so.0 from Fedora 34:
https://debuginfod.fedoraproject.org/buildid/7b0cdaf878ab4f99078439d864af70a5fd7b5a2c/source/usr/src/debug/glibc-2.33/nptl/pthread_cond_wait.c

mstange · 2021-12-01T20:02:25Z

Could the substitution be applied at a different time? Maybe before or after symcache generation?

I think it's not practical to do it before: You'd need to substitute inside the pdb file or inside the DWARF information of an ELF or mach-O file.

As it happens, Firefox PDB files already include this information in the .srcsrv stream: https://searchfox.org/mozilla-central/rev/e9cd2997be1071b9bb76fc14df0f01a2bd721c30/toolkit/crashreporter/tools/symbolstore.py#425

Given that you wrote a parser for this stream I suspect you know this already. :)

True! There's a small but surmountable problem here, though, and that's the fact that the Firefox PDBs don't contain mappings for generated source files in the srcsrv stream. The stream only has entries for hg.mozilla.org files, not for files stored in S3. ~~And since the srcsrv format doesn't support conditionals, we can probably only address this by getting rid of the fancy substitution and just putting the raw URL for every file directly in the list.~~ (Edit: Ted reminded me of the existence of %fnvar% which can indeed be used to apply different substitution templates for different entries.)

Since the format is based around variable definitions and substitution, it seems like it would be straightforward to also define a variable that you could evaluate to get a URL to link directly to a specific line in an HTML view of a source file.

I suppose! But the raw source URL would be very much sufficient for my purposes. We can still detect known URLs later on in the pipeline, and map them to known "web viewer" URLs.

In the past I had pondered the idea of generating functionally-equivalent data for Linux/macOS binaries and inserting it into the debug info as an additional section, but never quite got around to it. For a simple proof of concept you could just generate .srcsrv-compatible data and stuff it into the ELF/Mach-O debug info files alongside the DWARF and use your crate to handle it. After writing that down it sounds less ridiculous than I expected, and maybe you should just do that?

It actually doesn't sound terrible. It has a bit of an ad-hoc home-grown feel to it, so we'd need to document it in an easily accessible place.

[...] The Firefox build does this while running dump_syms, since dump_syms does the heavy lifting of parsing the debug info and listing all the source files it references, but one could imagine writing a much smaller single-purpose tool that takes a PDB or DWARF-containing file + some substitution rules, iterates over all the source files from the debug info (which should be much faster than the work dump_syms does), and generates a .srcsrv stream (or equivalent).

Yeah, that sounds mostly reasonable. It does mean that the file list needs to enumerated twice: Once when creating this mapping, and then another time when generating a symcache from the augmented debug info file. The "callback" proposal further up in this issue would only have one traversal. But that's probably not a big deal anyway.

luser · 2021-12-01T21:30:02Z

True! There's a small but surmountable problem here, though, and that's the fact that the Firefox PDBs don't contain mappings for generated source files in the srcsrv stream. The stream only has entries for hg.mozilla.org files, not for files stored in S3. And since the srcsrv format doesn't support conditionals, we can probably only address this by getting rid of the fancy substitution and just putting the raw URL for every file directly in the list.

You'll never guess who punted on implementing that, choosing to file a bug so someone could fix it later:
https://bugzilla.mozilla.org/show_bug.cgi?id=1389225

getsantry bot added the Status: Untriaged label Sep 7, 2021

Swatinem added enhancement and removed Status: Untriaged labels Sep 8, 2021

Swatinem mentioned this issue Feb 3, 2022

feat: Add SymCache Transformers (NATIVE-463) #496

Merged

Swatinem closed this as completed in #496 Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow file name substitutions when generating symcache files #422

Allow file name substitutions when generating symcache files #422

mstange commented Sep 7, 2021 •

edited

Loading

mstange commented Sep 8, 2021 •

edited

Loading

Swatinem commented Sep 8, 2021

flub commented Sep 8, 2021

mstange commented Sep 8, 2021

flub commented Sep 9, 2021

luser commented Dec 1, 2021

Could the substitution be applied at a different time? Maybe before or after symcache generation?

mstange commented Dec 1, 2021 •

edited

Loading

mstange commented Dec 1, 2021 •

edited

Loading

Could the substitution be applied at a different time? Maybe before or after symcache generation?

luser commented Dec 1, 2021

Allow file name substitutions when generating symcache files #422

Allow file name substitutions when generating symcache files #422

Comments

mstange commented Sep 7, 2021 • edited Loading

Feature motivation

Specifics

Alternatives

Could the substitution be applied at a different time? Maybe before or after symcache generation?

Could you use an unsubstituted symcache and fix up the paths after each address lookup?

Couldn't the substitution map just be a callback instead?

mstange commented Sep 8, 2021 • edited Loading

Swatinem commented Sep 8, 2021

flub commented Sep 8, 2021

mstange commented Sep 8, 2021

flub commented Sep 9, 2021

luser commented Dec 1, 2021

Could the substitution be applied at a different time? Maybe before or after symcache generation?

mstange commented Dec 1, 2021 • edited Loading

mstange commented Dec 1, 2021 • edited Loading

Could the substitution be applied at a different time? Maybe before or after symcache generation?

luser commented Dec 1, 2021

mstange commented Sep 7, 2021 •

edited

Loading

mstange commented Sep 8, 2021 •

edited

Loading

mstange commented Dec 1, 2021 •

edited

Loading

mstange commented Dec 1, 2021 •

edited

Loading