Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: command to sync DBT to Superset #18098

Closed
wants to merge 1 commit into from

Conversation

betodealmeida
Copy link
Member

SUMMARY

This PR introduces a new command to sync metadata from DBT to Superset. The command reads the profile and manifest files, creating/updating databases and datasets in Superset based on them.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Given this ~/.dbt/profiles.yml:

superset_examples:
  outputs:

    dev:
      type: postgres
      threads: 1
      host: localhost
      port: 5432
      user: beto
      pass: ''
      dbname: examples_dev
      schema: public
      meta:
        superset:
          cache_timeout: 300  # arbitrary metadata for our DB

  target: dev

The file messages_channels.sql:

SELECT
  messages.ts,
  channels.name,
  messages.text
FROM
  {{ source ('public', 'messages') }} messages
  JOIN {{ source ('public', 'channels') }} channels ON messages.channel_id = channels.id

And schema.yaml:

version: 2

sources:
  - name: public
    tables:
      - name: messages
        description: 'Messages in the Slack channel'
      - name: channels
        description: 'Information about Slack channels'

metrics:
  - name: cnt 
    label: ''
    model: ref('messages_channels')
    description: ''
    type: count
    sql: '*'

We can run:

$ superset sync dbt \
> ~/Projects/dbt-examples/superset_examples/target/manifest.json \
> --project superset_examples \
> --target dev  # not needed, default is already "dev"

This will (1) create (or update) the a new database connection based on Postgres:

Screenshot 2022-01-19 at 14-46-25 Superset

It will also (2) create/update three datasets owner by the admin:

Screen Shot 2022-01-19 at 2 47 47 PM

(Note that the dataset description comes from the DBT config.)

It will also populate metrics:

Screenshot 2022-01-19 at 14-48-38 Superset

TESTING INSTRUCTIONS

  1. Create a DBT project.
  2. Run superset sync dbt /path/to/project/target/manifest.json --project PROJECT --target TARGET

Check that everything is imported correctly.

Currently, this only works for Postgres, but adding other profile types is straightforward.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Jan 19, 2022

Codecov Report

Merging #18098 (4035ccb) into master (9e2bc72) will decrease coverage by 0.12%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18098      +/-   ##
==========================================
- Coverage   65.85%   65.73%   -0.13%     
==========================================
  Files        1577     1581       +4     
  Lines       61828    61942     +114     
  Branches     6244     6244              
==========================================
  Hits        40719    40719              
- Misses      19509    19623     +114     
  Partials     1600     1600              
Flag Coverage Δ
hive 51.95% <0.00%> (-0.20%) ⬇️
mysql 80.73% <0.00%> (-0.31%) ⬇️
postgres 80.78% <0.00%> (-0.31%) ⬇️
presto 51.79% <0.00%> (-0.20%) ⬇️
python 81.22% <0.00%> (-0.31%) ⬇️
sqlite 80.47% <0.00%> (-0.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/cli/celery.py 0.00% <ø> (ø)
superset/cli/main.py 0.00% <0.00%> (ø)
superset/cli/sync/dbt/command.py 0.00% <0.00%> (ø)
superset/cli/sync/dbt/databases.py 0.00% <0.00%> (ø)
superset/cli/sync/dbt/datasets.py 0.00% <0.00%> (ø)
superset/cli/sync/main.py 0.00% <0.00%> (ø)
superset/cli/update.py 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e2bc72...4035ccb. Read the comment docs.

@betodealmeida betodealmeida mentioned this pull request Jan 20, 2022
9 tasks
@rumbin
Copy link
Contributor

rumbin commented Jan 22, 2022

Beto, are you aware of this project?
https://github.com/slidoapp/dbt-superset-lineage
I haven't tried it, though...

Regarding your approach, I have some doubts on automatically updating the DB connection.
In many cases people may want to use a different user here, or different connection strings for the sake of things like user impersonation, proxy settings, roles, warehouse settings (Snowflake), etc.
I know that you are currently only covering Postgres, but for other databases it may become troublesome.
Maybe this step could be made optional?

@betodealmeida
Copy link
Member Author

Beto, are you aware of this project? https://github.com/slidoapp/dbt-superset-lineage I haven't tried it, though...

Yes, it was the inspiration! I should've mentioned it in the summary, I'll update it. I think both approaches are valid — this one here is simpler because you're using the CLI, while the other uses the API and is valuable for cases where you don't have direct access.

Regarding your approach, I have some doubts on automatically updating the DB connection. In many cases people may want to use a different user here, or different connection strings for the sake of things like user impersonation, proxy settings, roles, warehouse settings (Snowflake), etc. I know that you are currently only covering Postgres, but for other databases it may become troublesome. Maybe this step could be made optional?

That's a great point! I'll make it optional, allowing the user to reuse an existing DB.

@rumbin
Copy link
Contributor

rumbin commented Jan 22, 2022

Sounds great.
What I haven't understood so far is, how you envision the setup of this solution. Where is the superset sync dbt run ideally? I suppose that this would be part of a CI/CD pipeline. So, what components of Superset need to be installed in the container?
We should also consider what would be suitable scenarios for dbt Cloud users who would first need to fetch the dbt artifacts via API calls.

BTW, I am not associated with the dbt-superset-lineage project in any way. I was just planning to give it a try in the near future. However, now I am curious to wait for your solution.

@mrshu
Copy link
Contributor

mrshu commented Jan 22, 2022

BTW, I am not associated with the dbt-superset-lineage project in any way. I was just planning to give it a try in the near future. However, now I am curious to wait for your solution.

@rumbin Being at least somewhat associated with it, although not one of the authors, please do not hesitate to reach out with feedback!

@betodealmeida
Copy link
Member Author

Sounds great. What I haven't understood so far is, how you envision the setup of this solution. Where is the superset sync dbt run ideally? I suppose that this would be part of a CI/CD pipeline. So, what components of Superset need to be installed in the container? We should also consider what would be suitable scenarios for dbt Cloud users who would first need to fetch the dbt artifacts via API calls.

To run it in CI/CD you would need to pip install superset, set SUPERSET__SQLALCHEMY_DATABASE_URI, and run superset sync dbt.

@noel
Copy link

noel commented Mar 3, 2022

Based on a conversation in the dbt Slack #tools-superset channel

It was suggested that I add a comment to this PR.
The idea is to create datasets automatically via a configuration in the dbt_profiles.yml

models:
  project:
    marts:
      +superset_export: true

The above would create datasets for all the models in under marts

@betodealmeida
Copy link
Member Author

I'm working on a better solution for this.

@rumbin
Copy link
Contributor

rumbin commented May 20, 2022

@betodealmeida Is there any resource for your new approach available?

@betodealmeida
Copy link
Member Author

@rumbin take a look at https://github.com/preset-io/backend-sdk

@GeorgePearse
Copy link

Really interested in any work to make DBT and Superset work better together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants