Change metadata_vars `if not` to `if ... is None` #9376

truls-p · 2024-01-15T10:50:34Z

Both None and {} evaluate to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

resolves #

Relates to #6073

Problem

Poor performance with large models.

I was trying to profile a large model and noticed get_metadata_vars is called many times and we were spending quite a bit of time in in a dictcomp. I then realised that whenever we do not have any environment variables with the "DBT_ENV_CUSTOM_ENV_" prefix the metadata_vars dictionary will be set to {} the first time get_metadata_vars is called. On subsequent calls, we continue to loop over the environment variables due to the if condition, if not metadata_vars, which evaluates to True for both None and {}.

Solution

Solution was to change if not metadata_vars to if metadata_vars is None. This ensures we only loop through the environment variables looking for the prefix once.

Here is a test that can be run and the pytest-benchmark results on my machine. Note that this test may take up to 30 min to complete. Please let me know if you think a shorter test (perhaps one that just ensures we do not unnecessarily loop through os.environ multiple times) should be included in this PR.

import pytest
import os

from dbt.tests.util import run_dbt
from dbt.tests.fixtures.project import write_project_files

import dbt.common.events.functions

from tests.fixtures.dbt_integration_project import dbt_integration_project  # noqa: F401
from tests.functional.graph_selection.fixtures import SelectionFixtures

NUM_TABLES = 500
metadata_vars = None
_METADATA_ENV_PREFIX = "DBT_ENV_CUSTOM_ENV_"


def run_schema_and_assert(project, include, exclude, expected_tests):
    # deps must run before seed
    run_dbt(["deps"])
    run_dbt(["seed"])
    results = run_dbt(["build", "--exclude", "never_selected"])
    assert len(results) == 18 + NUM_TABLES

    test_args = ["test"]
    if include:
        test_args += ["--select", include]
    if exclude:
        test_args += ["--exclude", exclude]
    test_results = run_dbt(test_args)

    ran_tests = sorted([test.node.name for test in test_results])
    expected_sorted = sorted(expected_tests)

    assert ran_tests == expected_sorted

def get_metadata_vars_is():
    global metadata_vars
    if metadata_vars is None:
        metadata_vars = {
            k[len(_METADATA_ENV_PREFIX) :]: v
            for k, v in os.environ.items()
            if k.startswith(_METADATA_ENV_PREFIX)
        }
    return metadata_vars

def get_metadata_vars_not():
    global metadata_vars
    if not metadata_vars:
        metadata_vars = {
            k[len(_METADATA_ENV_PREFIX) :]: v
            for k, v in os.environ.items()
            if k.startswith(_METADATA_ENV_PREFIX)
        }
    return metadata_vars

class TestPerformanceManyTables(SelectionFixtures):
    @pytest.fixture(scope="class", autouse=True)
    def setUp(self, project_root, dbt_integration_project):  # noqa: F811

        sql_string = "select * from {{ this.schema }}.seed"
        for i in range(NUM_TABLES):
            dbt_integration_project["models"][f"{i}.sql"] = sql_string
        write_project_files(project_root, "dbt_integration_project", dbt_integration_project)

    @pytest.fixture(scope="class")
    def packages(self):
        return {"packages": [{"local": "dbt_integration_project"}]}


    @pytest.mark.skipif(os.environ.get('DBT_PERF_TEST') is None, reason="DBT_PERF_TEST not set")
    def test_performance_many_tables_specify_tag_not(self, project, benchmark, monkeypatch):
        monkeypatch.setattr(dbt.common.events.functions, 'get_metadata_vars', get_metadata_vars_not)
        benchmark.pedantic(run_schema_and_assert, args=(project, "tag:bi", None, ["unique_users_id", "unique_users_rollup_gender"]), iterations=5)


    @pytest.mark.skipif(os.environ.get('DBT_PERF_TEST') is None, reason="DBT_PERF_TEST not set")
    def test_performance_many_tables_specify_tag_is(self, project, benchmark, monkeypatch):
        monkeypatch.setattr(dbt.common.events.functions, 'get_metadata_vars', get_metadata_vars_is)
        benchmark.pedantic(run_schema_and_assert, args=(project, "tag:bi", None, ["unique_users_id", "unique_users_rollup_gender"]), iterations=5)

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
This PR includes type annotations for new and modified functions

Both None and {} evaluates to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

github-actions · 2024-01-15T10:50:47Z

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

truls-p · 2024-01-23T19:53:37Z

Closing this in favour of dbt-labs/dbt-common#37

Change metadata_vars if not to if ... is None

b89c527

Both None and {} evaluates to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

truls-p requested a review from a team as a code owner January 15, 2024 10:50

truls-p requested a review from gshank January 15, 2024 10:50

cla-bot bot added the cla:yes label Jan 15, 2024

Add changelog entry

78a2693

truls-p mentioned this pull request Jan 23, 2024

Change metadata_vars if not to if ... is None dbt-labs/dbt-common#37

Merged

5 tasks

truls-p closed this Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change metadata_vars `if not` to `if ... is None` #9376

Change metadata_vars `if not` to `if ... is None` #9376

truls-p commented Jan 15, 2024

github-actions bot commented Jan 15, 2024

truls-p commented Jan 23, 2024

Change metadata_vars if not to if ... is None #9376

Change metadata_vars if not to if ... is None #9376

Conversation

truls-p commented Jan 15, 2024

Problem

Poor performance with large models.

Solution

Checklist

github-actions bot commented Jan 15, 2024

truls-p commented Jan 23, 2024

Change metadata_vars `if not` to `if ... is None` #9376

Change metadata_vars `if not` to `if ... is None` #9376