Synthesis of user research when using configuration in Kedro #891

hamzaoza · 2021-09-13T14:51:35Z

Summary

Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. From user interviews, it can be observed the three main configurations used were for kedro run, the Data Catalog and parameters. The remaining options were seen as “setup once and forgot” for the remainder of the project. Overall configuration in Kedro is well received and liked by users who appreciate the approach Kedro has taken so far.

During this research, it became clear that configuration scaling impacts a small set of use cases where you have multiple environments (e.g. dev, staging and prod) and multiple use cases – maybe you’re using the same or a similar pipeline across different products for different countries. To gather deeper insights participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

1. Introduction

Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. It’s also an issue for new users who have never been exposed to this concept i.e., Data Scientists using software engineering principles for the first time. This research aims to understand the key pain points users face when using configuration and test possible solutions for the Data Catalog to develop a specification criterion for any solution.

2. Background

Kedro is influenced by the 12 Factor App but this results in a lot of duplication of configuration. From users, we have heard that yaml files can become unwieldy with each entry written manually making it error prone. Users also want to apply runtime parameters and want to parameterise runs in complex ways which Kedro doesn’t currently support.

As a result, some teams have tried to solve this independently – most notably by using Jinja2 templating through the template config loader though this has not become widespread across other teams. However, as we continue to grow, it is likely that more users will encounter similar issues and will need a Kedro native solution to support growth.

Finally, this is not a problem unique to Kedro. Google SREs have already faced a similar issue in the past who have outlined their thoughts and experiences here.

3. Research Approach

To develop a holistic overview of configuration in Kedro, a journalistic approach was used. Therefore, we were looking to answer the following questions:

Who is using configuration in Kedro?
What are they configuring in Kedro?
When are they using configuration in Kedro?
Why are they using Kedro configuration?
Where are they configuring Kedro?
How are they configuring Kedro?

Note: There is some overlap in the last two questions.

Research Scope

To help keep things manageable, the primary focus of this research was on the Data Catalog and how users interact with it. Nonetheless, pain points for other forms of configuration in Kedro were also captured and will be discussed later. Therefore, elements like parameters, credentials, etc. were not explicitly user tested. Furthermore, custom solutions created by teams may be referenced but will not be considered in the overall solution as they are not Kedro native features.

4. User Interview Matrix

In total, 19 interviews (lasting 1 hour each) across personas and experience levels were conducted to capture a spectrum of views. The user matrix breakdown is shown below.

	Data Sci.	Data Eng.	Verticals	External	Total
Beginner	2	0	1	0	3
Intermediate	3	1	0	1	5
Advanced	3	3	3	2	11
Total	8	4	4	3	19

Note: External users were sourced from Kedro Discord

5. Configuration Synthesis

	Kedro	`kedro run`	Template Config Loader	Credentials	Config Environments	Parameters	Data Catalog
Technology What technology is currently used to support this configuration?	Python	YAML	Python Jinja	YAML	YAML	YAML	YAML
Touchpoint Where in the Kedro project can the user make this configuration?	`src/<project-package>/settings.py` `pyproject.toml`	`kedro run --config **.yml` `export KEDRO_ENV=xyz`	`src/<project-name>/hooks.py`	`conf/**/credentials.yml`	`conf/base/.yml` `conf/local/.yml` `conf//.yml` `export KEDRO_ENV=**`	`conf//parameters.yml` `kedro run --params param_key1:value1,param_key2:2.0` `kedro run --config .yml`	`conf/**/catalog.yml`
Ownership Who is the lead user responsible for this configuration?	DE (50%) – TD (50%)	DE (50%) – DS (50%)	DE (80%) – DS (20%)	DE (80%) – DS (20%)	DS (100%)	DE (20%) – DS (80%)	DE (50%) – DS (50%)
User Sentiment How does the user feel about this approach?	😀	😐	🙂 - 😐	😀	🙂 - 😐	😀	😀
Benefits What do users like about this approach?	• It’s open source • Standard project structure • Easy to collaborate with others • Provides great defaults out of the box • Easy to ramp up a Kedro project	• Single point of entry to run code • Can use the –pipeline flag to run specific branches of code• Can git commit a config.yml file to reduce run errors	• Easy to setup • Overall, one of the easiest things to work with • Enables automation and scaling of Kedro • Easy to collaborate with others • A properly written hook can save lots of time	• Enforces best practices around managing credentials • Works as it should and is seamless • Can handle a variety of credentials out of the box • Each person can have their own setup to access data	• Fairly simple to use • Enables a structured approach to dev/qa/prod • Globals.yml can be different for each environment • Decouples code and config • Helps teams test and prototype in environments in a risk-free way	• Creates a structured way of working • Easy and straightforward to use • Easy to read and maintain • Like the “params:” prefix to quickly identify them in code	• Viewed as the best feature of Kedro • Declarative syntax makes it easy to use, read and debug • Simplification of I/O • Decouples code and I/O • Already has many data connectors built in • Transcoding datasets
Pain Points What are the pain points of this configuration?	• Breaking changes between 16 and 17 • Running into issues with `kedro install` on Windows • Changes to hooks and pipeline registry between versions	• Can be difficult to run a single node • Arguments in the terminal are not version controlled • --nodes is node_names in the yml file	• Depending on what you are using it for - can mix code and config to an extent and lose traceability • You need some knowledge to setup - not easy for beginners • Can reduce transparency of code. • Users might have the idea - but they don't always find it easy to implement • Jinja was not well received by clients	• Cannot inject credentials at runtime • For beginners, can be a little hard to grasp why credentials are separated from the Data Catalog or code • Feels misaligned with CI/CD tooling	• Can be easily abused by teams for other purposes • The inheritance pattern of local / custom / base can be hard for new users to pick up	• Only top-level key supported • Parameters not inheriting base keys and you need to overwrite the entire entry • Repetition and duplication of files • Can grow to large files leading a very nested dictionary • Cannot have ranges or step increments • Little IDE support means you need to follow the logic yourself	• Repetition of entries • Duplication of files • Minor changes to entries need to be applied everywhere - can be difficult to sync • Not easy to write a custom class for unsupported datasets • For some teams, YAML anchors are beyond their skillset • Very long catalog files
Feature Requests What new features are users requesting to support their work?	• Include CI/CD defaults out of the box • More documentation for migrations with breaking changes		• Common hooks templated by default • Hooks for when a model starts and ends • Have nested dependencies in globals.yml	• Easy way to sync these with environment variables • Enable flexible inheritance across environments	• Greater understanding of where config ends, and environments begin • Provision to separate use cases and environments	• Would like a parameter.load similar to the catalog • Implement namespaces to parameters • More dynamic entries i.e., ranges	• Default YAML included a link to the docs that clearly showed which Datasets were supported • Address the repetition and duplication of catalogs • More guidance on picking the best datatype for an entry • Support more upcoming datasets i.e., TensorFlow

Overall configuration in Kedro is well received and liked by users. No column had a particularly negative response and users largely understood and appreciated the approach Kedro has taken so far. During this exercise, it became clear that configuration scaling impacts a small set of use cases summarised in the table below.

	Single Environment	Single Environment	Multiple Environments	Multiple Environments
	Single Country	Multiple Countries	Single Country	Multiple Countries
Single Use Case	✅	✅	✅	❌
Multiple Use Cases	✅	❌	❌	❌

This would indicate that large configuration files are mostly seen internally often on large analytics project. This stems from Kedro not supporting multiple uses in a monorepo, therefore, forcing the user to use Config Environments as a stop-gap solution. This however then prevents teams from using it for its intended purpose of separating development environments.

6. GitHub Analysis

To support qualitative insights from user research, a custom GitHub query was created to gather quantitative on the Data Catalog.

At the time of running (18 Aug 2021) this presented 411 results of which 138 were real Kedro Data Catalog files. Note, empty Data Catalogs, spaceflights or iris examples and non Kedro projects were manually filtered out. This query assumes that these files are representative of open-source users and that Data Catalogs follow the /conf/ folder structure. Furthermore, it’s impossible to determine if these are complete files of finished projects or still under development.

From this, it was found that only 9% of users were using YAML anchors and only 2% were using globals.yml. However, 89% of users were using some type of namespacing in their catalog entries. Furthermore, the number of Data Catalog entries per file were counted. From the histogram below, Data Catalog entries peak around 10.

7. Data Catalog Generator

To better understand what users need from the Data Catalog, users were presented with possible options using prototype code. Participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (with users favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Here, participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.

	Vanilla Kedro	YAML Anchors	Pattern Matching	Jinja Templating	Python
Positives	• Viewed as the best feature of Kedro • Declarative syntax makes it easy to use, read and debug • Simplification of I/O • Decouples code and I/O • Already has many data connectors built in • Transcoding datasets	• Reduces the level of repetition in a file • Still easy to read and debug • Built in YAML feature, so used in other tools that use YAML	• Fairly easy to understand compared to Jinja and YAML anchors • Still somewhat declarative • Drastically reduces the number lines • Viewed beginner friendly • Takes away additional steps of having declare new files in the Data Catalog	• Can see the actual entries through the syntax • Somewhat established in the Python world so may have already used it elsewhere • Reduces the number lines but not as much as Pattern Matching • Greater control between memory and file datasets • Access to StackOverflow to help debug issues
Negatives	• Repetition of entries • Duplication of files • Minor changes to entries need to be applied everywhere - can be difficult to sync • Not easy to write a custom class for unsupported datasets • For some teams, YAML anchors are beyond their skillset • Very long Data Catalog files	• Built in YAML feature, so used in other tools that use YAML • Users were using it without knowing they are using it • Getting accustomed to the notation can take a while to learn and fully understand • Sub-keys are declared elsewhere which impacts readability	• Masks the true number of datasets • Concern about the order of operations • Doesn't work for raw datasets • Breaks when the files have different schema definitions in the Data Catalog entries • Concern about unintended consequences • Doesn't solve the file duplication problem • Same naming structure doesn't mean files have the same structure	• Multiple points of failure which also makes it difficult to debug • Doesn't work for raw datasets • User experience suggest beginners struggle to use and understand it - some teams have even removed it completely from their work • Can over complicate the Data Catalog with logic • Breaks when the files have different schema definitions in the Data Catalog entries • Doesn't solve the file duplication problem • Bigger learning curve compared to previous options • Whitespace control can be difficult to manage	• Users universally were very against the idea of moving the Data Catalog to python • Mixes code and I/O which goes against Kedro principles • Considered very unfriendly - especially for non-tech users • Huge concerns on giving too much freedom to users who might abuse this flexibility

8. Solution Criteria

While it was important to test the ideas, it was even more important to understand the criteria of a successful solution that would improve the experience of using the Data Catalog. Therefore, users identified the following 7 components:

Readability
Declarative Syntax
Beginner Friendly
Client Friendly
Reduce Repetition
Reduce Duplication
Backwards Compatibility

The text was updated successfully, but these errors were encountered:

deepyaman · 2021-09-14T15:44:57Z

From the histogram below, Data Catalog entries peak around 10.

I find this very interesting. A lot of the teams I've seen love/have gotten in the habit of creating a physical dataset for everything. I've gotten feedback when I've implicitly left something as a MemoryDataSet, including that it's complicated to debug (it's not, don't rely on literally everything getting written to disk to debug). What would also be interesting to see is number of datasets (are some projects bigger? are the nodes more granular? are other projects leveraging MemoryDataSet more?).

Is it possible to also see how often people read back datasets (after pipeline run), especially intermediate datasets? We don't care at all about storage/cost (and apparently performance), so we turn on versioning and write every dataset to disk. How often do we look at them? I'm guessing at the end of a bigger client project you've got 100K-1M+ datasets on disk, and people have only every looked at <1K. This need to write every dataset to disk manifests itself when you're reusing modular pipelines, and want to generate namespace.catalog_entry_X for each intermediate catalog entry in the modular pipeline. That's when I started using Jinja templating (see #583 for the syntax I needed), and that's the one place where I still feel templating is most useful. Was it really even necessary to physically manifest each internal catalog entry?

I guess what I'm getting at is, maybe the configuration is fine, and a lot of the teams I've seen are way overdoing their catalogs. 😂

deepyaman · 2021-09-14T15:45:23Z

Not a fan of pattern matching; will copy my comments to @hamzaoza here:

Too much "magic": With pattern matching, your catalog is defined by the pipeline catalog entries; you can't just look at the catalog and know what the different catalog entries are.
Custom DSL bad: Jinja is standard, and familiar to people who've done stuff like web dev. Even if most Kedro users don't come from that background, a custom DSL isn't necessarilyy better in that regard. Jinja is new to unfamiliar users, just like a DSL would be, but at least you can find plenty of other resources on Jinja.
Pattern matching in other languages isn't a good analog: To be perfectly honest, I wasn't really familiar with pattern matching in Scala or Haskell. Haskell's looks more similar to what I saw in the demo for Kedro. However, for both of these, pattern matching works more like an else statement, which is not really how I think about the default state of my catalog entry.

Sidebar: Just give me autoformatting with prettier on YAML templated using Jinja, and I'm happy. That's my main gripe with templating with Jinja (that prettier no longer works).

Isy89 · 2021-09-15T12:10:46Z

Interesting analysis! I personally like Jinja2 template system. The only problem I sow with it regards readability.
I generally have to run the same pipeline with different inputs and store results in different locations. So, data catalogs and config files became for me an important tool to keep track of which data was analyzed and where it was stored. This is a problem if the only thing I have is the template, because this information is lost. Furthermore, an explicit catalog like the vanilla one (or the one using YAML anchors) becomes really important for reproducibility, like in the case I have to regenerate the same results again. In my case, I use, from one side, the template system to make the process of using different inputs and outputs easier, from the other side, I store the generated catalog and the config files together with the data to ensure readability and reproducibility. It would also be nice if there would be a way to programmatically run the pipeline by passing the dictionaries of variables to the TemplatedConfigLoader through the CLI, instead of having to manually set them in the hooks or having to override the run command and subclass the KedroSession to achieve it.

datajoely · 2021-09-22T11:58:11Z

Thanks for this amazing piece of work @hamzaoza - I'm also quite impressed with how dbt works with Jinja, where they have concise SQL models at rest, but the compiled fully materialized SQL is available for debugging. Perhaps, the same approach could be used to allow people to write concise, complex catalogs - but allow users to materialise them in the format that Kedro sees at runtime?

datajoely · 2021-09-29T14:13:42Z

I've tried to consolidate my thinking and would like to present 4 prototypes for ways we can take this research forward and turn into features. Please comment, interact, react to the below:

datajoely · 2021-09-29T14:17:45Z

[Prototype 1] Robust support for environment variables

Abstract

We have multiple examples of people making a trivial change to ConfigLoader and TemplatedConfigLoader to automatically include certain environment variables within their configuration present at runtime. This can be critical in a deployment setting where an orchestrator surch as Argo or Airflow may inject certain information via environment variables.

The following order of precedence would apply here:

User need justification

Credentials - Easy way to sync these with environment variables
Environments - Several internal examples of projects making this tweak to their config loader

Implementation today

Today one can introduce environment variables into their global_dict using this trivial change to the TemplatedConfigLoader

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    return TemplatedConfigLoader(
        conf_paths,
        globals_pattern="*globals.yml",
        globals_dict={k: v for k, v in os.environ  if k.startswith("KEDRO_")}
    )

Proposal 1.1 - Environment variable pattern matching

In this proposal we introduce a new keyword argument that allows the user to specify a regular expression that matches environment variables and includes them in scope. Many of our users do similar things, this is simply providing a convenience function for doing so.

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    return TemplatedConfigLoader(
        conf_paths,
        globals_pattern="*globals.yml",
        env_var_key_pattern=['^KEDRO.+'] # regex pattern
  )

Proposal 1.2 - Extending Proposal 1 with specific features for credential management

A related area that people bring up is the way Kedro handles credential management, particularly since the enterprise world has become more sophisticated in the last couple of years with vendor led solutions such as HashiCorp Vault and Kerberos.

Environment variables have a part to play and perhaps the following change could help in the same sort of way:

@hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
    return TemplatedConfigLoader(
        conf_paths,
        globals_pattern="*globals.yml",
        env_var_credential_pattern=['^AWS.+ACCES_KEY.*'], # regex pattern
		evn_var_mapping= { 
                     # optional mechanism to rename certain keys
		    'AWS_ACCESS_KEY_ID' : 's3_token'
		    'AWS_SECRET_ACCESS_KEY' : 'secret_mapping'
		}
  )

Closing thoughts

This feels like a relatively minor change that would simplify things for users and make their workflow more succinct.
Proposal 2 is possibly a premature optimisation.

datajoely · 2021-09-29T14:20:27Z

[Protoype 2] Support dynamically generated configuration

Abstract

It is clear from supporting the product that users want to define their pipelines dynamically. This is partly because we as programmers want to follow the DRY (Do not repeat yourself) principles and because large Kedro projects will end in the user duplicating lots of configuration (configuration environments is one example where this is unavoidable)

Our thinking on this has been heavily influence by the Google SRE cookbook which outlines the exact journey we as a Kedro team have gone on:

Failing to see configuration as a programming problem (ConfigLoader)
Introducing a string formatting DSL as a temporary fix (TemplatedConfigLoader)
Discussing whether a more robust custom DSL is useful (Like our pattern matching proposal)
Thinking about whether an industry standard DSL like Jinja2 is the best bet
Contemplating giving users the full power of a Turing complete language like Python
Evaluating configuration specific languages like jsonnet, dhall, cue etc.

The solution that we land at will ultimately fall into category 3,4 or 6. Traditionally, the Kedro team has been resistant to going in this direction because it inevitably makes configuration less readable and more difficult for newbies to understand. Perhaps this proposal when combined with [Idea 3], perhaps adding a compilation step mitigates this readability point.

User need and justification

Since introducing Jinja2 support to TemplatedConfigLoader in 0.17.0 it has become a common pattern discussed in the open source community:
1, 2, 3, 4, 5, 6, 7

Proposal 2.1 - Jinja2

Kedro has had Jinja2 support since 0.17.0 was released and we have a great deal of user evidence to show that it is ben used. Jinja gives users basic programming structures like loops, conditionals and variables.
This is by far the simplest things for us to double down on since it is already supported by many python libraries and even anyconfig which Kedro uses behind the scenes.

Currently we does not support one key Jinja2 feature which would improve the developer experience: importing and reusing macros. via the include command.

Importing centrally defined templates could simplify a lot of the configuration currently done in YAML.
This could even open up the ability to avoid using globals.yml and the DSL unique to TemplatedConfigLoader.

In general the main arguments against going down this route are:

Jinja is not a whitespaced language so that working against a YAML target is dangerous
It can be intimidating for users to pick up, somewhat hard to read at rest
Can be hard to debug at scale

Proposal 2.2 - Pattern matching

The thinking behind this proposal hingest on not taking an imperative approach to generating declarative config.
It would force users to think very carefully about the naming convention of their catlaog entries and ultimately result in them writing a great deal fewer catalog entries.

Negative feedback from this proposal focused on:

The risk of invisible side effects
That it doesn't solve the duplication problem across config environments.

Proposal 2.3 - Jsonnet

Jssonnet can be thought of as a Jinja-like language that is better suited to a whitespaced semi-strucutred language like YAML or JSON.
Like Jinja2 the ability to import and reuse centrally defined templates can be a real accelerator.

This would require users to learn yet another syntax even if it is pretty intuitve
The tooling around Jsonnet is not as mature as Jinja2, but there is IDE support for both VS Code and PyCharm.

Closing Thoughts

A core principle in Kedro is that there should be only one obvious way of doing things - this means that all of the proposals above are mutually exculsive.

This is a high priority for Kedro - but it's a big decision choosing which horse to back.

datajoely · 2021-09-29T14:25:19Z

[Prototype 3] Introduce a `kedro compile` commands so that users can materialise what Kedro sees at run-time

Abstract

Materialise human readable version of what Kedro sees in terms of configuration at run time. The compiled YAML would live in a gitignore-d directory structure that fully resolves namespaces, hierarchical overrides, templating and other optimisations made in the name of conciseness at the expense of comphrensibility.

User need justification

Pattern mathcing - Masks the true number of datasets, Concern about the order of operations, Concern about unintended consequences

Jinja2 - Multiple points of failure which also makes it difficult to debug,

Config environments - The inheritance pattern of local / custom / base can be hard for new users to pick up

Example CLI command output

❯ kedro compile
2021-09-28 15:12 - test.cli - INFO - Compiled _conf_compiled/conf/local/catalog.yml
2021-09-28 15:12 - test.cli - INFO - Compiled _conf_compiled/conf/base/parameters.yml
2021-09-28 15:12 - test.cli - INFO - Compiled _conf_compiled/conf/base/logging.yml
2021-09-28 15:12 - test.cli - INFO - Compiled _conf_compiled/conf/base/catalog.yml

Proposal 3.1 - Environment order or precedence resolved (lineage included in comment)

Original conf/base/catalog.yml will not be seen by Kedro since it has a local override. This record will not exist in the _conf_compiled/base/catalog.yml file.

Compiled local/catalog.yml wins, but there is a comment explaining the lineage to the user.

Proposal 3.2 - Complex templating / anchoring present at rest is readable once compiled

YAML anchors reduce the amount of re-use present in the file, however readability suffers as a result

However, the fully resolved equivalent can be reviewed in the compiled director

Closing thoughts

This mechanism would allow users to see the results of sophisticated templating without using an interactive session. This feels like it would be particularly when a new user is onboarded to a project or an original developer needs to get back up to speed after a period of out of the project
There is a risk of users editing the compiled directory structure by accident, it is gitignore-d so your IDE should help - but I've made this mistake when using dbt before.
This method also fits nicely whatever the user is using to template their configuration - be it jinja2, pattern matching or even something like jsonnet [Proposal 2].

datajoely · 2021-09-29T14:27:17Z

[Prototype 4] A more consistent and robust mechanism for providing configuration overrides via the CLI

Abstract

The run arguments available via the Kedro CLI have evolved to date organically. Today there are 3 mechanisms for injecting some configuration into vanilla Kedro via the CLI:

#	Command	Comment
1	`kedro run --env=production`	Tweak the configuration order or precedence so that the configuration within the `conf/production` directory takes precedence. This technique is influenced by the thinking laid out in the 12 factor app.
2	`kedro run --params param_key1:1,param_key2:4`	This is the only way Kedro currently supports explicit and specific CLI configuration overrides, but only for parameters NOT credentials or catalog entries. To achieve this we have introduced a DSL or sorts and we know from telemetry and supporting users day today this is a very popular feature.
3	`kedro run --config config.yml`	This feature is a superset of all runtime configuration as it allows the user to lay out complex CLI commands as a file which can be maintained in a text editor and version controlled. There is an argument that this should be called `--kwargs` as it would be more specific and less overloaded than 'config'.

User need justification

The request to provide CLI overrides comes up relatively frequently examples 1,2,3 as well as multiple references on internally facing channels.

Much of this stems from a desire to separate the business logic (nodes, pipelines, models) from the inputs and outputs (catalog +credentials, parameters). Kedro provides a separation of these concerns, but they are still situated within the codebase on the user's file system.

This separation become a higher priority in production deployments for several key reasons.

It is often deployed/maintained by a different person to who developed the pipeline, the second person only cares about inputs/outputs and success of the pipeline rather than the
When packaging a modular pipeline Kedro doesn't package the catalog because we already incorporate this business logic distinction. The current workflow suggests you should run kedro catalog generate on the other side.
Initially run configuration environments where introduced to allow users to maintain staging / qa / prod pipelines. However, over time we have observed users start to use this pattern for deploying slightly different 'flavours' of a similar use-case. This is a great use of the functionality - but it speaks to the fact that as projects grow the configuration overhead grows especially when the various hierarchical overrides starting coming into play.

The proposals in the post will follow following order of precedence with CLI overrides taking the highest priority before dropping down to the other levels.

Proposal 4.1 - Allow users to specify a point to a zipped version of the `conf` structure via the CLI:

The idea here is that the user could package up a version of catalog, parameters (and credentials if so inclined) so that they have a mechanism of injecting lots of configuration at once, independently of the codebase or packaged pipeline.

kedro run override --kind=zip "path/to/catalogs.zip"

The --kind argument could allow us to point to folder directories, or glob paths as well.

Proposal 4.2 - Allow users to inject JSON overrides for complex configuration

In this example we allow the user to inject specific overrides as JSON. YAML isn't appropriate here since whitespacing in the terminal is a pain so it makes sense to work with JSON equivalents.

The --kind=json flag should be self explanatory, but the explicit --catalog, --params flags allow the user to specific about what they are trying to override.

kedro run override --kind=json --catalog='{"car":{"type":"pandas.CSVDataSet","filepath":"...'
kedro run override --kind=json --params='{"a":{"x":{"value":1},"y":{"value":2}}}'
kedro run override --kind=json --params="$(cat my_params.json)"

The third example highlights that this pattern can be extended by all sorts of terminal tricks that don't require the CLI to be extended, things like jq could be very useful here!.
In addition to the catalog and parameter arguments we could also provide --credentials and even ways to inject --globals for use in TemplatedConfigLoader.

Closing thoughts

One of the unwritten principles of Kedro is that things should be readable at rest - if users opt to move everything to the CLI we lose that.
CLI arguments are a pain to type, but skilled engineers can do wonders with command line utilities and pipes so this could open up more opportunities for power users.
CLI command are ephemeral and hard to retrieve without some additional infrastructure to retrieve after the fact. Something like AWS Cloudwatch or an ELK stack would allow users to retrieve a previous run configuration which would today be persisted on the user's file system. In time - we're expecting some of this to be tracked by the KedroSession store as well.

sheldontsen-qb · 2021-09-29T15:08:23Z

Thanks for this amazing piece of work @hamzaoza - I'm also quite impressed with how dbt works with Jinja, where they have concise SQL models at rest, but the compiled fully materialized SQL is available for debugging. Perhaps, the same approach could be used to allow people to write concise, complex catalogs - but allow users to materialise them in the format that Kedro sees at runtime?

Actually @datajoely we have a kedro pmpx render-conf command that compiles and renders everything after templating in full explicit form (including folder structures) into a separate location (set to log/rendered/ folder IIRC). It was a requested feature but I think once people know what they are doing - they don't really use that.

mzjp2 · 2021-09-29T15:18:20Z

Jinja is not a whitespaced language so that working against a YAML target is dangerous

Not Jinja, but Helm charts use templated configuration on YAML targets and is considered an industry standard: https://helm.sh/docs/chart_template_guide/functions_and_pipelines/, you often see things like:

  {{- with .Values.ingress.annotations }}
  annotations:
    {{- toYaml . | nindent 4 }}
  {{- end }}

datajoely · 2021-09-29T15:43:14Z

@mzjp2 - Yeah spotted that, in general the whitespaccing issue with Jinja isn't a dealbreaker - it just highlights that you need a series of hacks (or yet another DSL in this case) in to make it work well for YAML.

idanov · 2021-09-30T10:39:42Z

@datajoely I am not sure I am a big fan of this syntax, it looks quite strange and not very user-friendly... When I was saying that we can specify where to get the config from, I meant something as simple as providing the folder manually. E.g. currently you can run your project in three ways:

kedro run ...
<project_name> ... (if you have packaged and installed your project)
Session.create(...) (if you want to import your package programatically)

In all of those, there's a hard assumption that your current directory contains the config. What I would like is to make that assumption rather a soft one, i.e. the user should be able to specify where the config is located by pointing to a folder or a .tar.gz file. Something like this:

kedro run ... will assume your config is in conf/ in your current directory (or the name you have provided in settings.py)
kedro run --conf=/home/ivan/my_new_conf/ will load the config from an entirely different place
kedro run --conf=/home/server/configuration/conf.tar.gz will load the config from the tar archive

--conf is obviously not the final name.

Why is that functionality useful? This can help a lot with deploying configuration and packaging. E.g. when you do kedro package we can not only have the .whl file there, but also the conf.tar.gz and people can deploy them seperately. Moreover we can very easily get the conf.tar.gz packaged alongside the rest of the code (although we do not recommend that) and when people run their code, they can simply point to their site-packages folder and the tar archive in there (there was such a need in a recent call where the Kedro user had to deploy to a very strict deployment pipeline where they had no control over).

datajoely · 2021-09-30T11:00:22Z

@datajoely I am not sure I am a big fan of this syntax, it looks quite strange and not very user-friendly... When I was saying that we can specify where to get the config from, I meant something as simple as providing the folder manually. E.g. currently you can run your project in three ways:

kedro run ...

<project_name> ... (if you have packaged and installed your project)

@idanov what would you imagine the syntax to look like instead? For reference the proposal 4.1 would make it look like this:

[kedro|$package_name] run override --kind=zip "path/to/catalogs.zip"

datajoely · 2021-10-04T09:14:21Z

Community PR #927 further suggests that more complex CLI override facilities (Proposal 4) are desired

Galileo-Galilei · 2021-10-11T16:22:59Z

In all of those, there's a hard assumption that your current directory contains the config. What I would like is to make that assumption rather a soft one, i.e. the user should be able to specify where the config is located by pointing to a folder or a .tar.gz file. Something like this:

kedro run ... will assume your config is in conf/ in your current directory (or the name you have provided in settings.py)
kedro run --conf=/home/ivan/my_new_conf/ will load the config from an entirely different place
kedro run --conf=/home/server/configuration/conf.tar.gz will load the config from the tar archive
--conf is obviously not the final name.

For the record, my team has surcharged the CLI to add this option and this is exactly how we deploy our applications. The option is called --conf-root (which refers the to the CONF_ROOT global variables where the ConfigLoader looks for the configuration, I think it has been renamed CONF_SRC recently). We do append to the conf_paths list the src/<your-package>/conf folder path to be able to have default value in the project itself (as discussed in #770).

stale · 2021-12-10T16:40:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

shaunc · 2022-05-16T15:26:41Z

Was lead here after trying to solve my own problem via current jinja2 implementation in #1532.

My thought is: declarative is good, but a fairly sophisticated solution is need to cover all the use cases. I'm a big fan of terraform -- even if it has some idiosyncrasies. But it does have variables, loops, etc -- goes well beyond simple pattern matching. Pattern matching can be stretched, but when you stretch it it also gets complex and hard to read.

Suggestion: have a declarative solution that supports both yaml syntax, and its own syntax. For simple projects, yaml will be best/transparent. For more sophisticated uses, (variables, etc, which can be coded in separate sections -- supporting substitutions, etc), bespoke syntax will be easier to read. You will need some sort of looping construct IMO.

(See devspace for a yaml solution that quite extremely flexible -- variables, profiles with pattern replacement, and use of helm templates if need be. No loops though. Or Helm itself.)

As a practical matter, be aware that a declarative solution that naturally meets all use cases and is really DRY will be quite sophisticated. I suggest providing a separate jinja2 loader, with documentation stating "use the other solution if possible". Then incrementally add features to your declarative solution to incorporate reasonable uses of jinja2 found in the wild. (From my POV using helm charts is ok as well as suggested above, but ... eh ... jinja2 is easier to read.)
(More parochial concern) I am trying to maintain a mapping to DVC. In regards to this, if/when you implement looping in your declarative solution, assign (force assignment of) real identities to unrolled nodes (not just sequence numbers) to help with lifecycle management. This same issue comes up in Terraform.

IMO generating intermediate files is ok. Perhaps put in hidden .kedro/ directory? Explicit compilation step only for checking & optimization.

Afterthought: You could probably write a terraform extension that allows declaration in terraform syntax, and maintains correspondence with local yaml files. This requires a complex outside dependency, but might be simplest path to a sophisticated declarative way to do configuration.

merelcht · 2023-12-13T17:40:31Z

Configuration has had a complete overhaul with the new OmegaConfigLoader. The research gathered in this issue has been greatly utilised for that, but if configuration needs to be further developed new insights will need to be collected.

noklam · 2024-01-12T12:47:37Z

For reference, In 0.19, we have kedro catalog resolve which is very close to what proposed as kedro compile. It allows user to see the materialised version of catalog.

datajoely · 2024-01-12T13:23:20Z

I think we could go further here @noklam but there is some overlap, yes

Galileo-Galilei mentioned this issue Sep 20, 2021

Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends #904

Open

datajoely mentioned this issue Oct 5, 2021

Global credentials file for multiple pipelines #930

Closed

Galileo-Galilei mentioned this issue Nov 23, 2021

[KED-3023] Enable plugins to extend starter-aliases default list #1040

Closed

stale bot added the stale label Dec 10, 2021

datajoely removed the stale label Dec 10, 2021

datajoely pinned this issue Dec 10, 2021

datajoely mentioned this issue Mar 1, 2022

[Feature Request] Support for Hydra in Kedro #1303

Closed

noklam unpinned this issue Mar 3, 2022

merelcht added Component: Framework Issue/PR that addresses core framework functionality Type: User Research Synthesis ✍️ Issues to document results from user research labels Mar 15, 2022

This was referenced Apr 5, 2022

How to pass the env to config loader since registration moved to settings.py in 0.18.x #1411

Closed

Update catalog for namespace pref kedro-org/kedro-starters#81

Merged

antonymilne mentioned this issue Apr 14, 2022

Improve TemplatedConfigLoader - Default ${KEDRO_ENV} global & expose anyconfig.load kwargs #1438

Closed

5 tasks

datajoely mentioned this issue May 16, 2022

feat: pass globals as context to jinja2 template #1532

Closed

merelcht mentioned this issue Jun 13, 2022

How to pass extra params to config loader since registration moved to settings.py in 0.18.x #1527

Closed

noklam mentioned this issue Jul 5, 2022

Improving the I/O transparency with kedro run #1691

Closed

yetudada changed the title ~~[KED-2724] Synthesis of user research when using configuration in Kedro~~ Synthesis of user research when using configuration in Kedro Jul 27, 2022

noklam mentioned this issue Oct 24, 2022

YAML-based configuration pipeline - Good or Bad #1963

Closed

1 task

merelcht mentioned this issue Mar 15, 2023

Eliminate the need of having all datasets defined in the catalog #2423

Closed

merelcht closed this as completed Dec 13, 2023

datajoely mentioned this issue Feb 12, 2024

Config validation #3613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesis of user research when using configuration in Kedro #891

Synthesis of user research when using configuration in Kedro #891

hamzaoza commented Sep 13, 2021

deepyaman commented Sep 14, 2021

deepyaman commented Sep 14, 2021

Isy89 commented Sep 15, 2021 •

edited

Loading

datajoely commented Sep 22, 2021

datajoely commented Sep 29, 2021

datajoely commented Sep 29, 2021

datajoely commented Sep 29, 2021

datajoely commented Sep 29, 2021

datajoely commented Sep 29, 2021

sheldontsen-qb commented Sep 29, 2021

mzjp2 commented Sep 29, 2021

datajoely commented Sep 29, 2021

idanov commented Sep 30, 2021 •

edited

Loading

datajoely commented Sep 30, 2021

datajoely commented Oct 4, 2021

Galileo-Galilei commented Oct 11, 2021 •

edited

Loading

stale bot commented Dec 10, 2021

shaunc commented May 16, 2022 •

edited

Loading

merelcht commented Dec 13, 2023

noklam commented Jan 12, 2024

datajoely commented Jan 12, 2024

Synthesis of user research when using configuration in Kedro #891

Synthesis of user research when using configuration in Kedro #891

Comments

hamzaoza commented Sep 13, 2021

Summary

Table of Contents

1. Introduction

2. Background

3. Research Approach

Research Scope

4. User Interview Matrix

5. Configuration Synthesis

6. GitHub Analysis

7. Data Catalog Generator

8. Solution Criteria

deepyaman commented Sep 14, 2021

deepyaman commented Sep 14, 2021

Isy89 commented Sep 15, 2021 • edited Loading

datajoely commented Sep 22, 2021

datajoely commented Sep 29, 2021

datajoely commented Sep 29, 2021

[Prototype 1] Robust support for environment variables

Abstract

User need justification

Implementation today

Proposal 1.1 - Environment variable pattern matching

Proposal 1.2 - Extending Proposal 1 with specific features for credential management

Closing thoughts

datajoely commented Sep 29, 2021

[Protoype 2] Support dynamically generated configuration

Abstract

User need and justification

Proposal 2.1 - Jinja2

Proposal 2.2 - Pattern matching

Proposal 2.3 - Jsonnet

Closing Thoughts

datajoely commented Sep 29, 2021

[Prototype 3] Introduce a kedro compile commands so that users can materialise what Kedro sees at run-time

Abstract

User need justification

Example CLI command output

Proposal 3.1 - Environment order or precedence resolved (lineage included in comment)

Proposal 3.2 - Complex templating / anchoring present at rest is readable once compiled

Closing thoughts

datajoely commented Sep 29, 2021

[Prototype 4] A more consistent and robust mechanism for providing configuration overrides via the CLI

Abstract

User need justification

Proposal 4.1 - Allow users to specify a point to a zipped version of the conf structure via the CLI:

Proposal 4.2 - Allow users to inject JSON overrides for complex configuration

Closing thoughts

sheldontsen-qb commented Sep 29, 2021

mzjp2 commented Sep 29, 2021

datajoely commented Sep 29, 2021

idanov commented Sep 30, 2021 • edited Loading

datajoely commented Sep 30, 2021

datajoely commented Oct 4, 2021

Galileo-Galilei commented Oct 11, 2021 • edited Loading

stale bot commented Dec 10, 2021

shaunc commented May 16, 2022 • edited Loading

merelcht commented Dec 13, 2023

noklam commented Jan 12, 2024

datajoely commented Jan 12, 2024

Isy89 commented Sep 15, 2021 •

edited

Loading

[Prototype 3] Introduce a `kedro compile` commands so that users can materialise what Kedro sees at run-time

Proposal 4.1 - Allow users to specify a point to a zipped version of the `conf` structure via the CLI:

idanov commented Sep 30, 2021 •

edited

Loading

Galileo-Galilei commented Oct 11, 2021 •

edited

Loading

shaunc commented May 16, 2022 •

edited

Loading