Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Jinja2 templating in config #583

Closed
deepyaman opened this issue Oct 22, 2020 · 4 comments
Closed

Support Jinja2 templating in config #583

deepyaman opened this issue Oct 22, 2020 · 4 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@deepyaman
Copy link
Member

deepyaman commented Oct 22, 2020

Description

With the advent of reusable modular pipelines and namespacing (but even previously with dynamic pipeline creation), it's common to need near-duplicate catalog entries. For example, with primary data models for COVID-19 data in Europe, Asia, and Africa, I may want to reuse the same feature generation and master table creation pipeline, resulting in the following (subset of the) data catalog:

europe.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/europe/cases.csv

europe.death_counts:
europe.demographic_data:
europe.master_table:

asia.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/asia/cases.csv

asia.death_counts:
asia.demographic_data:
asia.master_table:

africa.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/africa/cases.csv

africa.death_counts:
africa.demographic_data:
africa.master_table:

Having to write it out explicitly is inconvenient (borderline painful) and error-prone. Being forced to use the code API to define configuration is not ideal, either.

Context

Jinja2 is a widely-used templating language, already supported by the backend technology used by Kedro for configuration parsing, anyconfig. Turning it on lets (power) users leverage templating without affecting existing functionality. The above config would become:

{% for continent in ['europe', 'asia', 'africa'] %}
{{ continent }}.case_counts:
  type: pandas.CSVDataSet
  filepath: data/04_feature/{{ continent }}/cases.csv

{{ continent }}.death_counts:
{{ continent }}.demographic_data:
{{ continent }}.master_table:
{% endfor %}

Possible Implementation

#578

Possible Alternatives

Individual users can make this change themselves, but it's annoying. Since anyconfig is imported in _load_config, it can't be easily monkeypatched and requires redefining/retesting a lot of functionality. For example, from my current project:

src/package_name/run.py:

...

def _load_config(config_files: List[Path]) -> Dict[str, Any]:
    """Recursively load all configuration files, which satisfy
    a given list of glob patterns from a specific path.

    Args:
        config_files: Configuration files sorted in the order of precedence.

    Raises:
        ValueError: If 2 or more configuration files contain the same key(s).

    Returns:
        Resulting configuration dictionary.

    """
    # for performance reasons
    import anyconfig  # pylint: disable=import-outside-toplevel

    config = {}
    keys_by_filepath = {}  # type: Dict[Path, AbstractSet[str]]

    def _check_dups(file1: Path, conf: Dict[str, Any]) -> None:
        dups = set()
        for file2, keys in keys_by_filepath.items():
            common = ", ".join(sorted(conf.keys() & keys))
            if common:
                if len(common) > 100:
                    common = common[:100] + "..."
                dups.add("{}: {}".format(str(file2), common))

        if dups:
            msg = "Duplicate keys found in {0} and:\n- {1}".format(
                file1, "\n- ".join(dups)
            )
            raise ValueError(msg)

    for config_file in config_files:
        cfg = {
            k: v
            for k, v in anyconfig.load(config_file, ac_template=True).items()
            if not k.startswith("_")
        }
        _check_dups(config_file, cfg)
        keys_by_filepath[config_file] = cfg.keys()
        config.update(cfg)
    return config


class ProjectContext(KedroContext):
    ...

    def _create_config_loader(self, conf_paths: Iterable[str]) -> TemplatedConfigLoader:
        import kedro.config.config

        kedro.config.config._load_config = _load_config
        return TemplatedConfigLoader(conf_paths, globals_pattern="*globals.yml")

src/tests/test_run.py:

@pytest.fixture
def project_context(mocker):
    # Don't configure the logging module. If it's configured, tests that
    # check logs using the ``caplog`` fixture depend on execution order.
    mocker.patch.object(ProjectContext, "_setup_logging")

    return ProjectContext(str(Path.cwd()))


def _write_yaml(filepath: Path, config: Dict, preamble: str = "", postamble: str = ""):
    filepath.parent.mkdir(parents=True, exist_ok=True)
    yaml_str = yaml.dump(config)
    filepath.write_text(preamble + yaml_str + postamble)


@pytest.fixture
def conf_paths(tmp_path):
    return [str(tmp_path / "base"), str(tmp_path / "local")]


@pytest.fixture
def param_config():
    return {
        "boats": {
            "type": "${boat_data_type}",
            "filepath": "${s3_bucket}/${raw_data_folder}/${boat_file_name}",
            "columns": {
                "id": "${string_type}",
                "name": "${string_type}",
                "top_speed": "${float_type}",
            },
            "rows": 5,
            "users": ["fred", "${write_only_user}"],
        }
    }


@pytest.fixture
def template_config():
    return {
        "s3_bucket": "s3a://boat-and-car-bucket",
        "raw_data_folder": "01_raw",
        "boat_file_name": "boats.csv",
        "boat_data_type": "SparkDataSet",
        "string_type": "VARCHAR",
        "float_type": "FLOAT",
        "write_only_user": "ron",
    }


@pytest.fixture
def proj_catalog_param(tmp_path, param_config):
    proj_catalog = tmp_path / "base" / "catalog.yml"
    _write_yaml(proj_catalog, param_config)


@pytest.fixture
def proj_catalog_param_w_jinja2_for(tmp_path, param_config):
    proj_catalog = tmp_path / "base" / "catalog.yml"
    _write_yaml(
        proj_catalog,
        param_config,
        "{% for boat_type in ['house', 'paddle'] %}\n{{ boat_type }}.",
        "{% endfor %}\n",
    )


@pytest.fixture
def proj_catalog_globals(tmp_path, template_config):
    global_yml = tmp_path / "base" / "globals.yml"
    _write_yaml(global_yml, template_config)


class TestProjectContext:
    def test_project_name(self, project_context):
        assert project_context.project_name == "Engineering Reimagined - Schedule"

    def test_project_version(self, project_context):
        assert project_context.project_version == "0.16.4"

    @pytest.mark.usefixtures("proj_catalog_param", "proj_catalog_globals")
    def test_create_config_loader(self, project_context, tmp_path, conf_paths):
        """Test parameterized config with globals yaml file"""
        (tmp_path / "local").mkdir(exist_ok=True)

        catalog = project_context._create_config_loader(conf_paths).get("catalog*.yml")

        assert catalog["boats"]["type"] == "SparkDataSet"
        assert (
            catalog["boats"]["filepath"] == "s3a://boat-and-car-bucket/01_raw/boats.csv"
        )
        assert catalog["boats"]["columns"]["id"] == "VARCHAR"
        assert catalog["boats"]["columns"]["name"] == "VARCHAR"
        assert catalog["boats"]["columns"]["top_speed"] == "FLOAT"
        assert catalog["boats"]["users"] == ["fred", "ron"]

    @pytest.mark.usefixtures("proj_catalog_param_w_jinja2_for", "proj_catalog_globals")
    def test_create_config_loader_w_jinja2_for(
        self, project_context, tmp_path, conf_paths
    ):
        """Test parameterized config with globals yaml file"""
        (tmp_path / "local").mkdir(exist_ok=True)

        catalog = project_context._create_config_loader(conf_paths).get("catalog*.yml")

        for boat_type in ["house", "paddle"]:
            assert catalog[f"{boat_type}.boats"]["type"] == "SparkDataSet"
            assert (
                catalog[f"{boat_type}.boats"]["filepath"]
                == "s3a://boat-and-car-bucket/01_raw/boats.csv"
            )
            assert catalog[f"{boat_type}.boats"]["columns"]["id"] == "VARCHAR"
            assert catalog[f"{boat_type}.boats"]["columns"]["name"] == "VARCHAR"
            assert catalog[f"{boat_type}.boats"]["columns"]["top_speed"] == "FLOAT"
            assert catalog[f"{boat_type}.boats"]["users"] == ["fred", "ron"]

    @pytest.mark.usefixtures("proj_catalog_globals")
    def test_lots_of_duplicates(self, project_context, tmp_path, conf_paths):
        """Check that the config key starting with `_` are ignored and also
        don't cause a config merge error"""
        data = {str(i): i for i in range(100)}
        _write_yaml(tmp_path / "base" / "catalog1.yml", data)
        _write_yaml(tmp_path / "base" / "catalog2.yml", data)
        (tmp_path / "local").mkdir(parents=True, exist_ok=True)

        conf = project_context._create_config_loader(conf_paths)
        pattern = r"^Duplicate keys found in .*catalog2\.yml and\:\n\- .*catalog1\.yml\: .*\.\.\.$"
        with pytest.raises(ValueError, match=pattern):
            conf.get("**/catalog*")
@deepyaman deepyaman added the Issue: Feature Request New feature or improvement to existing feature label Oct 22, 2020
@deepyaman deepyaman self-assigned this Oct 23, 2020
@WaylonWalker
Copy link
Contributor

WaylonWalker commented Oct 27, 2020

This is awesome! There are very few pipelines that I write that do not have a super repetative pattern in the catalog. I often create mine through a python script. Pardon my lack of understanding jinja. Can you use things like itertools.product within the template?

I often need something along the lines of

continents = ['europe', 'asia', 'africa']
layers = ['raw', 'pri', 'int']
for continent, layer in itertools.product(continents, layers):
     ...

I often have a very similar pattern that creates nodes, rather than maintaining duplicate lists, the one that generates the catalog actually imports from nodes module. Is it possible to access the same lists from both jinja and my nodes easily?

@deepyaman
Copy link
Member Author

@WaylonWalker I think it's possible to use Python module code in Jinja2 templates (see https://stackoverflow.com/a/11856935/1093967), and I don't see any reason why anyconfig wouldn't allow it (looking at https://github.com/ssato/python-anyconfig/blob/9d1be86bed695c89fb8438c5c23262f6049a93d9/src/anyconfig/api.py#L298-L299 and https://github.com/ssato/python-anyconfig/blob/9d1be86bed695c89fb8438c5c23262f6049a93d9/src/anyconfig/template.py), but it would be worth actually trying it out to be sure.

@kislerdm
Copy link
Contributor

kislerdm commented Nov 12, 2020

Another proposal may be something like the following config template (somewhat inspired by terraform):

counts:
  variables:
    geo:
      - africa
      - asia
      - europe
    dataset:
      - cases.csv
      - demographic.csv
  for_each:
    lists:
      - var.geo
      - var.dataset
    function: 
      - join: 
          delimiter: "/"   
  type: pandas.CSVDataSet
  filepath: data/04_feature/{each.key}

@lorenabalan
Copy link
Contributor

I believe this has been now addressed in c466c8a - TemplatedConfigLoader will support Jinja2 syntax starting with version 0.17.0, so I will close this issue. Thanks everyone for sharing your thoughts on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants