Support Jinja2 templating in config #583

deepyaman · 2020-10-22T15:50:29Z

Description

With the advent of reusable modular pipelines and namespacing (but even previously with dynamic pipeline creation), it's common to need near-duplicate catalog entries. For example, with primary data models for COVID-19 data in Europe, Asia, and Africa, I may want to reuse the same feature generation and master table creation pipeline, resulting in the following (subset of the) data catalog:

europe.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/europe/cases.csv

europe.death_counts:
europe.demographic_data:
europe.master_table:

asia.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/asia/cases.csv

asia.death_counts:
asia.demographic_data:
asia.master_table:

africa.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/africa/cases.csv

africa.death_counts:
africa.demographic_data:
africa.master_table:

Having to write it out explicitly is inconvenient (borderline painful) and error-prone. Being forced to use the code API to define configuration is not ideal, either.

Context

Jinja2 is a widely-used templating language, already supported by the backend technology used by Kedro for configuration parsing, anyconfig. Turning it on lets (power) users leverage templating without affecting existing functionality. The above config would become:

{% for continent in ['europe', 'asia', 'africa'] %}
{{ continent }}.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/{{ continent }}/cases.csv

{{ continent }}.death_counts:
{{ continent }}.demographic_data:
{{ continent }}.master_table:
{% endfor %}

Possible Implementation

#578

Possible Alternatives

Individual users can make this change themselves, but it's annoying. Since anyconfig is imported in _load_config, it can't be easily monkeypatched and requires redefining/retesting a lot of functionality. For example, from my current project:

src/package_name/run.py:

...

def _load_config(config_files: List[Path]) -> Dict[str, Any]:
    """Recursively load all configuration files, which satisfy
    a given list of glob patterns from a specific path.

    Args:
        config_files: Configuration files sorted in the order of precedence.

    Raises:
        ValueError: If 2 or more configuration files contain the same key(s).

    Returns:
        Resulting configuration dictionary.

    """
    # for performance reasons
    import anyconfig  # pylint: disable=import-outside-toplevel

    config = {}
    keys_by_filepath = {}  # type: Dict[Path, AbstractSet[str]]

    def _check_dups(file1: Path, conf: Dict[str, Any]) -> None:
        dups = set()
        for file2, keys in keys_by_filepath.items():
            common = ", ".join(sorted(conf.keys() & keys))
            if common:
                if len(common) > 100:
                    common = common[:100] + "..."
                dups.add("{}: {}".format(str(file2), common))

        if dups:
            msg = "Duplicate keys found in {0} and:\n- {1}".format(
                file1, "\n- ".join(dups)
            )
            raise ValueError(msg)

    for config_file in config_files:
        cfg = {
            k: v
            for k, v in anyconfig.load(config_file, ac_template=True).items()
            if not k.startswith("_")
        }
        _check_dups(config_file, cfg)
        keys_by_filepath[config_file] = cfg.keys()
        config.update(cfg)
    return config


class ProjectContext(KedroContext):
    ...

    def _create_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader:
        import kedro.config.config

        kedro.config.config._load_config = _load_config
        return TemplatedConfigLoader(conf_paths, globals_pattern="*globals.yml")

src/tests/test_run.py:

@pytest.fixture
def project_context(mocker):
    # Don't configure the logging module. If it's configured, tests that
    # check logs using the ``caplog`` fixture depend on execution order.
    mocker.patch.object(ProjectContext, "_setup_logging")

    return ProjectContext(str(Path.cwd()))


def _write_yaml(filepath: Path, config: Dict, preamble: str = "", postamble: str = ""):
    filepath.parent.mkdir(parents=True, exist_ok=True)
    yaml_str = yaml.dump(config)
    filepath.write_text(preamble + yaml_str + postamble)


@pytest.fixture
def conf_paths(tmp_path):
    return [str(tmp_path / "base"), str(tmp_path / "local")]


@pytest.fixture
def param_config():
    return {
        "boats": {
            "type": "${boat_data_type}",
            "filepath": "${s3_bucket}/${raw_data_folder}/${boat_file_name}",
            "columns": {
                "id": "${string_type}",
                "name": "${string_type}",
                "top_speed": "${float_type}",
            },
            "rows": 5,
            "users": ["fred", "${write_only_user}"],
        }
    }


@pytest.fixture
def template_config():
    return {
        "s3_bucket": "s3a://boat-and-car-bucket",
        "raw_data_folder": "01_raw",
        "boat_file_name": "boats.csv",
        "boat_data_type": "SparkDataSet",
        "string_type": "VARCHAR",
        "float_type": "FLOAT",
        "write_only_user": "ron",
    }


@pytest.fixture
def proj_catalog_param(tmp_path, param_config):
    proj_catalog = tmp_path / "base" / "catalog.yml"
    _write_yaml(proj_catalog, param_config)


@pytest.fixture
def proj_catalog_param_w_jinja2_for(tmp_path, param_config):
    proj_catalog = tmp_path / "base" / "catalog.yml"
    _write_yaml(
        proj_catalog,
        param_config,
        "{% for boat_type in ['house', 'paddle'] %}\n{{ boat_type }}.",
        "{% endfor %}\n",
    )


@pytest.fixture
def proj_catalog_globals(tmp_path, template_config):
    global_yml = tmp_path / "base" / "globals.yml"
    _write_yaml(global_yml, template_config)


class TestProjectContext:
    def test_project_name(self, project_context):
        assert project_context.project_name == "Engineering Reimagined - Schedule"

    def test_project_version(self, project_context):
        assert project_context.project_version == "0.16.4"

    @pytest.mark.usefixtures("proj_catalog_param", "proj_catalog_globals")
    def test_create_config_loader(self, project_context, tmp_path, conf_paths):
        """Test parameterized config with globals yaml file"""
        (tmp_path / "local").mkdir(exist_ok=True)

        catalog = project_context._create_config_loader(conf_paths).get("catalog*.yml")

        assert catalog["boats"]["type"] == "SparkDataSet"
        assert (
            catalog["boats"]["filepath"] == "s3a://boat-and-car-bucket/01_raw/boats.csv"
        )
        assert catalog["boats"]["columns"]["id"] == "VARCHAR"
        assert catalog["boats"]["columns"]["name"] == "VARCHAR"
        assert catalog["boats"]["columns"]["top_speed"] == "FLOAT"
        assert catalog["boats"]["users"] == ["fred", "ron"]

    @pytest.mark.usefixtures("proj_catalog_param_w_jinja2_for", "proj_catalog_globals")
    def test_create_config_loader_w_jinja2_for(
        self, project_context, tmp_path, conf_paths
    ):
        """Test parameterized config with globals yaml file"""
        (tmp_path / "local").mkdir(exist_ok=True)

        catalog = project_context._create_config_loader(conf_paths).get("catalog*.yml")

        for boat_type in ["house", "paddle"]:
            assert catalog[f"{boat_type}.boats"]["type"] == "SparkDataSet"
            assert (
                catalog[f"{boat_type}.boats"]["filepath"]
                == "s3a://boat-and-car-bucket/01_raw/boats.csv"
            )
            assert catalog[f"{boat_type}.boats"]["columns"]["id"] == "VARCHAR"
            assert catalog[f"{boat_type}.boats"]["columns"]["name"] == "VARCHAR"
            assert catalog[f"{boat_type}.boats"]["columns"]["top_speed"] == "FLOAT"
            assert catalog[f"{boat_type}.boats"]["users"] == ["fred", "ron"]

    @pytest.mark.usefixtures("proj_catalog_globals")
    def test_lots_of_duplicates(self, project_context, tmp_path, conf_paths):
        """Check that the config key starting with `_` are ignored and also
        don't cause a config merge error"""
        data = {str(i): i for i in range(100)}
        _write_yaml(tmp_path / "base" / "catalog1.yml", data)
        _write_yaml(tmp_path / "base" / "catalog2.yml", data)
        (tmp_path / "local").mkdir(parents=True, exist_ok=True)

        conf = project_context._create_config_loader(conf_paths)
        pattern = r"^Duplicate keys found in .*catalog2\.yml and\:\n\- .*catalog1\.yml\: .*\.\.\.$"
        with pytest.raises(ValueError, match=pattern):
            conf.get("**/catalog*")

The text was updated successfully, but these errors were encountered:

WaylonWalker · 2020-10-27T17:12:38Z

This is awesome! There are very few pipelines that I write that do not have a super repetative pattern in the catalog. I often create mine through a python script. Pardon my lack of understanding jinja. Can you use things like itertools.product within the template?

I often need something along the lines of

continents = ['europe', 'asia', 'africa']
layers = ['raw', 'pri', 'int']
for continent, layer in itertools.product(continents, layers):
     ...

I often have a very similar pattern that creates nodes, rather than maintaining duplicate lists, the one that generates the catalog actually imports from nodes module. Is it possible to access the same lists from both jinja and my nodes easily?

deepyaman · 2020-10-27T19:36:51Z

@WaylonWalker I think it's possible to use Python module code in Jinja2 templates (see https://stackoverflow.com/a/11856935/1093967), and I don't see any reason why anyconfig wouldn't allow it (looking at https://github.com/ssato/python-anyconfig/blob/9d1be86bed695c89fb8438c5c23262f6049a93d9/src/anyconfig/api.py#L298-L299 and https://github.com/ssato/python-anyconfig/blob/9d1be86bed695c89fb8438c5c23262f6049a93d9/src/anyconfig/template.py), but it would be worth actually trying it out to be sure.

kislerdm · 2020-11-12T17:23:58Z

Another proposal may be something like the following config template (somewhat inspired by terraform):

counts:
  variables:
    geo:
      - africa
      - asia
      - europe
    dataset:
      - cases.csv
      - demographic.csv
  for_each:
    lists:
      - var.geo
      - var.dataset
    function: 
      - join: 
          delimiter: "/"   
  type: pandas.CSVDataSet
  filepath: data/04_feature/{each.key}

lorenabalan · 2020-11-19T16:14:19Z

I believe this has been now addressed in c466c8a - TemplatedConfigLoader will support Jinja2 syntax starting with version 0.17.0, so I will close this issue. Thanks everyone for sharing your thoughts on this!

deepyaman added the Issue: Feature Request New feature or improvement to existing feature label Oct 22, 2020

deepyaman mentioned this issue Oct 22, 2020

[KED-2214] Enable loading config files with Jinja2 templating #578

Closed

6 tasks

deepyaman self-assigned this Oct 23, 2020

lorenabalan closed this as completed Nov 19, 2020

deepyaman mentioned this issue Sep 14, 2021

Synthesis of user research when using configuration in Kedro #891

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Jinja2 templating in config #583

Support Jinja2 templating in config #583

deepyaman commented Oct 22, 2020 •

edited

Loading

WaylonWalker commented Oct 27, 2020 •

edited

Loading

deepyaman commented Oct 27, 2020

kislerdm commented Nov 12, 2020 •

edited

Loading

lorenabalan commented Nov 19, 2020

Support Jinja2 templating in config #583

Support Jinja2 templating in config #583

Comments

deepyaman commented Oct 22, 2020 • edited Loading

Description

Context

Possible Implementation

Possible Alternatives

WaylonWalker commented Oct 27, 2020 • edited Loading

deepyaman commented Oct 27, 2020

kislerdm commented Nov 12, 2020 • edited Loading

lorenabalan commented Nov 19, 2020

deepyaman commented Oct 22, 2020 •

edited

Loading

WaylonWalker commented Oct 27, 2020 •

edited

Loading

kislerdm commented Nov 12, 2020 •

edited

Loading