Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support for Hydra in Kedro #1303

Closed
bergalli opened this issue Mar 1, 2022 · 14 comments
Closed

[Feature Request] Support for Hydra in Kedro #1303

bergalli opened this issue Mar 1, 2022 · 14 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@bergalli
Copy link

bergalli commented Mar 1, 2022

Description

Hydra is a framework for elegantly configuring complex applications. It is used to create a hierarchical configuration by splitting it in different yaml files, making it easier to organise.
Project description here : https://github.com/facebookresearch/hydra

When trying to use Hydra via the hydra.main() decorator applied on register_pipelines(), an error occurs.

Context

Having Kedro and Hydra working together would make it easier to maintain complex pipelines.

Reproducing issue

python version: 3.8.12
kedro version: 0.17.7
hydra version: 1.1.1

The bug appears when trying to set hydra.main() decorator on register_pipelines(). This decorator is used to build an Omegaconf config from the /conf directory. Steps to reproduce :

  • Setup the iris_dataset toy project
  • Add the files required by Hydra in the conf folder (config.yaml and base/master.yaml):
    • conf/config.yaml:
      defaults:
        - base: master
    • conf/base/master.yaml:
       defaults:
         - ./catalog
         - ./logging
         - ./parameters
  • Rewrite existing files extension (yml->yaml)
  • Add the hydra.main decorator in src/[package_name]/pipeline_registry.py :
import hydra
@hydra.main(config_path="../../conf", config_name="config")
def register_pipelines(cfg: DictConfig) -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.

    """

    data_engineering_pipeline = de.create_pipeline()
    data_science_pipeline = ds.create_pipeline()

    return {
        "de": data_engineering_pipeline,
        "ds": data_science_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

This will result in the following error:

Primary config module 'get_started.conf' not found.
Check that it's correct and contains an __init__.py file

note: get_started is the name of the package in /src

Cause of the issue

After some digging, it appears that the configuration path resolved by hydra.main does not exist. The following info is obtained by running in debug mode, and setting a breakpoint on the first line of the function ensure_main_config_source_available(). Full path: hydra/_internal/config_loader_impl.py/ConfigLoaderImpl.ensure_main_config_source_available()

  • When the bug appears, calling self.get_sources() while being in ConfigLoaderImpl.ensure_main_config_source_available returns this : [provider=hydra, path=pkg://hydra.conf, provider=main, path=pkg://conf, provider=schema, path=structured://]
  • It should actually be this : [provider=hydra, path=pkg://hydra.conf, provider=main, path=file:///PATH_TO_PROJECT//conf, provider=schema, path=structured://]
    It appears that Hydra doesn't know how to get file:///PATH_TO_PROJECT, and replaces it by pkg://

Possible Implementation

Not really sure how to solve and which library should be adapted to correct this bug, so I wrote a similar post on Hydra's issues tracker.
Hydra requires that the script is launched by calling it manually in the terminal, and I don't know what happens when executing kedro run but I guess it comes from somewhere here.

Possible Alternatives

Right now i'm using a workaround by generating the conf via initialize() and compose() :

from hydra import compose, initialize

def register_pipelines() -> Dict[str, Pipeline]:
    """Register the project's pipelines.

    Returns:
        A mapping from a pipeline name to a ``Pipeline`` object.

    """
    initialize(config_path="../../conf")
    cfg = compose(config_name="config")

    data_engineering_pipeline = de.create_pipeline()
    data_science_pipeline = ds.create_pipeline()

    return {
        "de": data_engineering_pipeline,
        "ds": data_science_pipeline,
        "__default__": data_engineering_pipeline + data_science_pipeline,
    }

Follow up question

This also raises the question on how to do config overrides from the command line, a feature of Hydra possible when the user calls the script himself from command line. I guess it would be possible via the --config argument of kedro run, but I haven't tested it yet.

tl;dr: hydra.main() called in a unusual way, leading to the impossibility for Hydra to find the config folder.

@bergalli
Copy link
Author

bergalli commented Mar 1, 2022

To win some time, here is the repository containing the example I described : https://github.com/neltacigreb/kedroXhydra

@datajoely
Copy link
Contributor

datajoely commented Mar 1, 2022

Hi @neltacigreb thanks for raising this! This is something we'd love to support in Kedro, as we progress to 1.0.0 this is the exactly the sort of thing that should be simple, that is to say the core parts of the Kedro framework should be entirely hot-swappable with alternatives.

In terms of implementation above the issues you post feel working directory related so I would consider using a breakpoint() to identify where you your code is running and what relative paths work.

That being said, I do think there are two slightly more 'Kedrific' approaches at our disposal:

(1) Using a lifecycle hook

Our lifecycle hooks are the simplest way to extend parts of Kedro's run lifecycle.

I'm pretty sure the before_pipeline_hook in particular exposes all the live objects you need to mutate in order to pass hyra configuration before a run starts. You can inspect the specification here.

(2) Defining your own config loader

Here you would define your own HydraConfigLoader class that would be registered in hooks.py. That would be more involved that the hook approach but you would have complete control over how things work. In the current version of Kedro 0.17.7 you would need to adapt an existing config loader like the basic one here.

In the next release of Kedro 0.18.0 we will introduce an AbstractConfigLoader class which will make this specific exercise simpler in the future.

Closing thoughts

Kedro is going through a long running exercise where we're researching and thinking about potential solutions to config overhead, complexity and mental model. This is being tracked on issue #891 so any thoughts you have on this topic will steer our future direction.

@bergalli
Copy link
Author

bergalli commented Mar 3, 2022

Thank you for the directions, I tried different ways to implement the loading of a Hydra config and here are some notes I collected. Disclaimer I am a fresh Kedro user, and quite experienced Hydra user.

A solution to use Hydra in Kedro

I ended up using a hook, which will load the config before a pipeline execution and store it in the data catalog. A repo with a working example of using Hydra in the iris toy project is here, and the source code for the hook specifically is here. When looking at projects from the community, I stumbled on this package from @Minyus, which is used in this implementation.

  • Pros:

    • Makes the Hydra config available to every nodes via the catalog under the keys :

      • config: OmegaConf object containing the config
      • cfg:path>to>parameter: individual parameters from the config stored under their path
      • cfg:path>to>config_group: OmegaConf object containing a subset of the config
    • Can overwrite parts of the config from the command line using the Hydra override syntax. Overrides must be provided as additionnals params at launch with the key hydra_overrides:.... Ex: kedro run --params "hydra_overrides:+path.to.param1=true ~path.to.param2=0.3"

  • Cons:

    • Doesn't make use of the @hydra.main decorator, which is the recommended way to use Hydra. From the documentation :

      Please avoid using the Compose API in cases where @hydra.main() can be used. Doing so forfeits many of the benefits of Hydra (e.g., Tab completion, Multirun, Working directory management, Logging management and more)

    • May create a LOT of entry keys in the catalog, to account for each depth of config group, and the individual parameters at the extremities.

  • Limitations:

    • The config.yaml file must be placed in the root config folder, and cannot have another name.

    • Config is reloaded from the root folder before each pipeline run, implying that any modification on some config parameters in a pipeline will be overwritten in the next ones. However in my tests the hook was never called twice, even if the default pipeline is composed of 2 pipelines. Is it expected behavior ?

Limitations encountered when trying different methods

  • I tried to make a hook runnning after kedro run, with the method before_command_run and the decorator kedro.framework.cli.hooks.markers.cli_hook_spec, but the code was never reached. Did I do something wrong and it's possible to use before_command_run in a Hook ?

  • I ended up creating a plugin that did run before the command. However, if I understood correctly you need to install a plugin before using it ? so I found it too complicated to use regarding the application, and switched to another method.

  • settings.project_root returns None while being inside of before_pipeline_run() in a hook. Is it a bug or this is the correct way ? :

     from kedro.framework.session import get_current_session
     kedro_session = get_current_session()
     project_root = kedro_session.load_context().project_path
  • In ProjectHooks().register_config_loader, the parameter conf_paths can be a list (and it is in the default iris project). The way Hydra loads the config is only by a string pointing to the config root path relative to the current script path. So using a custom config loader in ProjectHooks was not possible.

Thoughts on the config in Kedro, and how to improve compatibility with Hydra

Again, I'm very new to Kedro so I don't have a clear view on what these changes would impact down the line.

  • With my approach I skip the use of the parameters.yml file entirely. Problem is that it result in a warning message for missing parameter.yml file when doing a kedro run.

  • When calling catalog.add_feed_dict() with replace=True, a log is displayed making a huge chunk of unwanted warnings in the console (even tho from my tests the run_before_pipeline hook is executed only once after kedro run)

  • Hydra doesn't allow .yml extension for config files, only .yaml . Since default Kedro config files heve a .yml extension by default, it could be worth to change to .yaml. Even tho the official extension is .yaml, there is no clear consensus.

  • I don't think a native integration of Hydra is desirable, because there is a steep-at-first learning curve to learn the package. Even tho the Kedro doc is very extensive, learning both at once is not something I'd like to experience.

    However, I think it would be a good thing to harmonize the config loading process (i.e. just 1 string containing the root path), because then it would be possible to create custom config loaders.

Thanks for this cool framework :)

datajoely pushed a commit that referenced this issue Mar 3, 2022
[AUTO-MERGE] Merge master into develop via merge-master-to-develop
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 7, 2022
@bergalli
Copy link
Author

bergalli commented Mar 12, 2022

Some update on this, I managed to make a custom decorator to be used on the register_pipelines() function, and that will reproduce most of the behavior of the hydra.main() decorator.
It works, but is very crafty. So I don't know if it will always have the same behavior as the original decorator.

Documentation and source can be found in this repo

The more I use both packages, the more I feel like a native integration of Hydra in Kedro would make sense. I'm thinking about logging management, environments switching, classes instantiation from yaml files, dynamic pipelines generation at runtime, ...
I'll make a post with more details on the dedicated Configuration issue soon.

@datajoely
Copy link
Contributor

@neltacigreb thank you for the update on this and your continued work - if helpful I could organise a discussion call with the maintainer group to help think through some of this? I'm not sure any of us are familiar with Hydra - but we absolutely see the user value here.

@datajoely
Copy link
Contributor

(I also think your snippet is actually pretty elegant)

@bergalli
Copy link
Author

@datajoely absolutely I'd be happy to explain globally what the package does and what parallels I see with the current config management in Kedro, I PM you on linkedin.

Obviously the Hydra doc is good to check, and for curiosity the one from Omegaconf too. Hydra is basically an extension of Omegaconf.

Also this repo is a good example, since it uses most of the functionalities provided by the package.

@noklam
Copy link
Contributor

noklam commented Apr 14, 2022

Thank you for summarizing this. This makes perfect sense and I have seen internal teams are trying to achieve the same thing by writing their own multi-runner (one commands to fire up multiple kedro pipelines with different parameters).

As far as I can remember. hydra is heavily based on its CLI and the compose API only has a subset of features (Is this still true? The last time I used it was 2 years ago).

Hydra probably has a different way of looking at how configurations should be structure and this could be related to #891.

@bergalli
Copy link
Author

bergalli commented Apr 29, 2022

Hi @noklam sorry for the delay.

hydra is heavily based on its CLI and the compose API only has a subset of features

This is true, using hydra via CLI allows to override parameters at runtime, or launch it in multirun mode (one command fires all configs). Also the CLI mode creates a new output folder for each run, which proves useful in multirun.
The compose API can only be used to create the config from the yaml files.
In both cases, the config is accessible in the code directly, for example accessible from the register_pipeline() function.
Here since hydra does not have direct access to the CLI, I made some adapter([repo(https://github.com/neltacigreb/kedroXhydra) ) to be able to try to test the 2 packages together.

Hydra probably has a different way of looking at how configurations should be structure and this could be related to #891.

Correct me if needed but I feel that the main difference is that kedro aims at simplicity in the config directory, while hydra encourage more complex config folder structures, so to make use of the override mechanism. They're similar on some subjects too (multirun, dynamic pipelines, overrides), where some are already provided in the kedro config

As I continue using the two packages, I'll focus on 2 features that could be a match in my opinion:

  • multirun/sweep with optuna
    • run multiple pipelines in parallel and search optimal parameters
  • function/classes instanciation from the config
    • Using any kind of dataloader as provided
    • Building pipelines from the config
  • Access to the config in the register_pipelines function:
    • Dynamic building of pipelines
  • Config overrides from cli:
    • mixing a general project config with run-specific configs

When I find some time i'll package my findings in a plugin :) until then if you think of some features that could be used into kedro I'd be glad to try them as well

@noklam
Copy link
Contributor

noklam commented Apr 29, 2022

This is true, using hydra via CLI allows to override parameters at runtime, or launch it in multirun mode (one command fires all configs). Also the CLI mode creates a new output folder for each run, which proves useful in multirun.
The compose API can only be used to create the config from the yaml files.
In both cases, the config is accessible in the code directly, for example accessible from the register_pipeline() function.

Thanks for the explanation. As I understand you are using the Compose API currently. So the main benefit of using hydra for

  • Composable Configuration ( I suspect this is mainly coming from OmegaConf but I am not sure)
  • Advance YAML usage like class instantiation.
  • ❌ multi-run (Only available for hydra cli), and you have your own implementation to mimic multi-runs via hooks

@bergalli
Copy link
Author

For the compose API, that's it exactly. In the repo I mentioned, their are 2 hydra decorators adapters.

  • One uses only the compose API and makes the config available in pipelines_registry()
  • another adapter wraps the hydra.main decorator and allows config overrides from the CLI, but I have never tested the multirun part, and it will certainly not work due to the way kedro handles the running of pipelines.

My plan to make the multirun usable, is to generate many kedro pipelines with different configurations, namespace them and assemble them in a big final pipeline.

I didn't know about the multirun hook's I'll look into that first to see if it fits my app

@merelcht merelcht added the Help Wanted 🙏 Contribution task, outside help would be appreciated! label Jun 7, 2022
@antonymilne antonymilne pinned this issue Jun 27, 2022
@antonymilne antonymilne added the pinned Issue shouldn't be closed by stale bot label Jun 28, 2022
@antonymilne antonymilne unpinned this issue Jun 28, 2022
@MatthiasRoels
Copy link

Did some digging into this issue myself. It seems the problem of integrating hydra with kedro stems from the fact that hydra tries to be a “replacement” for e.g. a click application. Hence kedro and hydra will never be natively compatible. That being said, as already shown above, it is still possible to create a custom decorator to make them compatible. The only real question is: in what location of the code base are we going to use that one (somewhere in the cli part seems most reasonable)?

In my opinion, hydra fits best as an alternative ConfigLoader potentially alongside an OmegaConfLoader. In that case, using hydra’s compose API would already suffice, provided we take care of missing features with the kedro cli.

The biggest problem I see with using hydra is that omegaconf (and hence hydra) only supports “soft-merges” of config, which is not always desirable! So, to me it feels like adopting hydra would also require a change in how we manage configuration in a project…

@bergalli
Copy link
Author

bergalli commented Feb 9, 2023

I've turned using hydra in a cli script that runs kedro in a subprocess, which works to use hydra plugins (hyperparameters search,..) but makes it difficult to debug a pipeline.

Having a custom OmegaConfLoader and compose API would be best integrated in kedro's workflow, but looses the ability to override parameters using at once a yaml file of overrides .

Not sure what soft merge refers to, but having 2 different ways of reading the config could indeed be problematic.

So far this script has worked for me, as long as the config structure stays simple:

import os
import shlex
import subprocess
import sys
from typing import List

import hydra
from flatten_dict import flatten
from omegaconf import DictConfig


def run_subprocess(command: str):
    print(f"Running command: \n\n{command}\n")
    process = subprocess.Popen(
        shlex.split(command),
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        bufsize=2,
    )
    with process.stdout as p_out:
        for line in iter(p_out.readline, b""):  # b'\n'-separated lines
            print(line.decode().strip())

    process.wait()  # to fetch returncode
    return process.returncode


def get_all_params_overrides(cfg: DictConfig) -> List[str]:
    config_flat = flatten(cfg)
    params_overrides = [
        ".".join(param_keys) + ":" + str(param_value)
        for param_keys, param_value in config_flat.items()
    ]
    return params_overrides


@hydra.main(config_path="conf", config_name="config", version_base=None)
def main(cfg: DictConfig):

    pipeline = cfg.pipeline if "pipeline" in cfg.keys() else "__default__"
    params_overrides = get_all_params_overrides(cfg)
    kedro_bin = os.path.join(os.path.split(sys.executable)[0], "kedro")
    command = " ".join(
        [
            kedro_bin,
            "run",
            f"--pipeline={pipeline}",
            f'--params="{",".join(params_overrides)}"',
        ]
    )

    returncode = run_subprocess(command)

    if returncode:
        raise Exception


if __name__ == "__main__":
    main()

@astrojuanlu
Copy link
Member

astrojuanlu commented Jul 18, 2023

Hello folks, since this issue was opened we've made great progress and now we intend to make OmegaConfigLoader the default and only configuration loader in Kedro (see #2693 for the upcoming deprecations, and #1657 for the research that ended up leading us there).

This means though that we don't plan to support Hydra natively in the near future. This doesn't mean that Hydra can't work with Kedro - in fact, some internal teams in McKinsey have created a HydraConfigLoader with great success.

This is not something that we'd like to maintain for the open source community though, so we'll look into blogging about our approach and giving away some bits of code, and if someone else wants to take over and publish it as a plugin, we'll be more than happy to promote it in https://github.com/kedro-org/awesome-kedro.

For now I'm closing this feature request as "won't fix". If you have more thoughts, please feel free to share them in this thread.

Thanks everyone who contributed to the conversation!

@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Jul 18, 2023
@astrojuanlu astrojuanlu removed Help Wanted 🙏 Contribution task, outside help would be appreciated! pinned Issue shouldn't be closed by stale bot labels Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

7 participants