Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Support a 'hermetic' compile option #1538

Closed
heisencoder opened this issue Jun 13, 2019 · 4 comments
Closed

Feature: Support a 'hermetic' compile option #1538

heisencoder opened this issue Jun 13, 2019 · 4 comments

Comments

@heisencoder
Copy link
Contributor

Feature description

We would like to use the 'dbt compile' command to produce a hermetic manifest.json file. Specifically, the manifest.json file should be identical regardless of external factors, such as build time and the full path of the build inputs. This is useful when using 'dbt compile' as a step in a larger build process, particularly when the build can be distributed to different machines or cached between builds.

The troublesome fields in manifest.json include some of these:

{
  "generated_at": "2019-06-13T16:00:30.824973Z",
  "metadata": {
    "user_id": "d16d937f-6e18-4c9e-b86f-5513ff2b2686",
    "project_id": "835a08e1ed4824386ca1c8b56bc6c0ae"
  },
  "nodes|docs": {
    "some_name....": {
     "root_path": "/myproject/951f87ac8debbf7415b201a9431d2027c9bd",
    }
  }
}

One proposal is to add a --hermetic flag to the compile command that would remove the "generated_at", "user_id", "project_id", and all "root_path" fields (and anything else that uses information other than the project files in relative paths).

If these fields are omitted, will this break anything downstream that reads these?

Thanks!
-Matt

@drewbanin
Copy link
Contributor

Hey @heisencoder - I think your best bet is going to be to post-process these artifacts, popping out any fields that are environment-dependent. If you can tell me a little bit more about why these fields are problematic in your build process, I might be able to make some other recommendations here.

I'll just say: I think fundamentally, dbt can't do this. You can make a model with the code:

-- {{ run_started_at }}

select ...

in which case, the compiled SQL for the model will always be dependent on environmental factors. There are other cases too, like operations using modules.datetime, or environment variables that might differ between machines.

What do you think about that?

@heisencoder
Copy link
Contributor Author

I'm okay with things that change during runtime (like when the SQL templates are evaluated before being executed), I just don't want things to vary for the compile step.

Post-processing is my current favorite solution. I just figured that before starting on this work, I could produce something that could be upstreamed. However, my current blocker is if something needs the four fields I mentioned in the manifest.json file, or whether these are just informational and could be safely omitted.

@drewbanin
Copy link
Contributor

oh, sure. The only consumer of these artifacts right now is the dbt docs website. I don't believe these fields are used currently, though we do have plans to use them in eg. this issue:
https://github.com/fishtown-analytics/dbt/issues/1397

These fields are part of the contract for the manifest.json file, so it's totally possible that we'll build things in the future that make use of them!

@heisencoder
Copy link
Contributor Author

Sounds good -- I'll go ahead and proceed via post-processing for now. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants