Integrate clusters into the `DoubleMLData` class, Refactor data generators #338

JanTeichertKluge · 2025-06-17T14:38:01Z

Description

This pull request introduces updates to the doubleml library, focusing on refactoring the support for cluster data, improving modularity, and deprecating unused features. Key changes include the addition of cluster-related functionality, deprecation of time (t_col) and score/selection (s_col) variables, and updates to documentation and examples to reflect these changes.

Refactoring for Cluster Data:

Added support for cluster variables (cluster_cols) in the DoubleMLData class, including a new is_cluster_data flag to indicate cluster data usage.
Moved methods and properties to handle cluster variables, such as _set_cluster_vars and cluster_vars.
Deprecated the DoubleMLClusterData class, replacing it with DoubleMLData using is_cluster_data=True. Warnings are added to inform users about the planned removal in version 0.12.0.

Refactoring for Model Specific Data Backends:

Removed t_col (time variable) and s_col (score/selection variable) from DoubleMLData and related methods, as they are no longer relevant except for the data backends used in e.g. DoubleMLDID or DoubleMLSSM
Updated the _data_summary_str method and other internal logic to exclude references to these deprecated variables.

Codebase Modularity:

Updated __init__.py files to include new data classes (DoubleMLDIDData, DoubleMLPanelData, DoubleMLRDDData, DoubleMLSSMData) and removed unused imports.
Refactored disjoint set checks to accommodate the new cluster_cols logic.

Refactoring of Data Generators / Fetch Methods

Moved model specific data generators into the model specfic submodules
Adjusted the imports and the documention examples

Documentation and Examples:

Updated examples in the documentation to use the new submodules, e.g. doubleml.plm.datasets dgp path.

PR Checklist

The title of the pull request summarizes the changes made.
The PR contains a detailed description of all changes and additions.
References to related issues or PRs are added.
The code passes all (unit) tests.
Enhancements or new feature are equipped with unit tests.
The changes adhere to the PEP8 standards.

306 refactor data generators

Jan teichert kluge/issue272

doubleml/data/did_data.py

doubleml/datasets/fetch_401K.py

doubleml/datasets/fetch_bonus.py

doubleml/did/datasets/dgp_did_SZ2020.py

doubleml/rdd/tests/test_rdd_exceptions.py

doubleml/rdd/tests/test_rdd_return_types.py

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

… with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

doubleml/data/did_data.py

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Copilot

Pull Request Overview

This PR refactors the core DoubleMLData class to integrate multi‐way clustering support, deprecates the old DoubleMLClusterData, and overhauls how data backends and generators are organized. It also introduces first‐class support for DID, SSM, and RDD backends, updates all affected imports, and realigns the core DoubleML logic (string formatting, sample splitting, score dimensionality) to accommodate the new patterns.

Integrate cluster_cols into DoubleMLData, remove DoubleMLClusterData, add is_cluster_data flag
Move all model‐specific data generators into doubleml.plm.datasets and doubleml.irm.datasets, update return types
Extend DoubleML and related mixins to recognize new backends (DIDData, SSMData, RDDData), update __str__, sample splitting, and score shapes

Reviewed Changes

Copilot reviewed 137 out of 140 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
doubleml/utils/_check_return_types.py	Switched cluster‐data checks to use `is_cluster_data` and made score shapes parameterizable
doubleml/utils/_aliases.py	Added aliases for `DoubleMLDIDData`, `DoubleMLPanelData`, `DoubleMLRDDData`, `DoubleMLSSMData`
doubleml/rdd/rdd.py	Changed to require `DoubleMLRDDData`, renamed `s_col` to `score_col`, adjusted internal checks
doubleml/double_ml.py	Extended core `DoubleML` to detect new data backends, refactored `__str__`, sample splitting and array initialization
multiple `doubleml/plm/datasets` & `doubleml/irm/datasets`	Relocated and updated all dataset generators to new submodules with consistent import paths

Comments suppressed due to low confidence (2)

doubleml/rdd/rdd.py:85

The example still uses parameter s for score; the new signature expects score_col or the keyword score. Please update the docstring to match the actual API and avoid confusion.

    >>> obj_dml_data = dml.DoubleMLRDDData.from_arrays(x=data_dict["X"], y=data_dict["Y"], d=data_dict["D"], s=data_dict["score"])

doubleml/utils/_check_return_types.py:17

The original assertion ensured the underlying data object was specifically the cluster type. Now only the flag is checked, which could mask misuse (e.g. a noncluster object with is_cluster_data=True). Consider combining this with an isinstance(..., DoubleMLData) check to maintain the original safety net.

        assert dml_obj._dml_data.is_cluster_data

SvenKlaassen and others added 30 commits June 2, 2025 14:20

add a cross-sectional dgp

ac858cd

add simple test cases for cross sectional dgp

10e532e

reset index for in panel data

c96605d

add basic did_cs_binary version with simple tests

61dbf11

add internal atribute _score_dim to DoubleML class

ceebc6e

check prediction size based on internal n_obs

ade3b9a

update score dimensions init in the cs object

f113e61

Refactor Data Generators #306

d65edf8

update tests acc. to Refactor Data Generators #306

56d832c

update docstrings acc. to Refactor Data Generators #306

02adb24

update docstrings acc. to Refactor Data Generators #306

39d4e7e

update irm submod tests acc. to Refactor Data Generators #306

83cfe9c

update irm submod tests acc. to Refactor Data Generators #306

3ff0edb

update irm submod tests acc. to Refactor Data Generators #306

caa530e

update docstrings acc. to Refactor Data Generators #306

4cb9148

update docstrings acc. to Refactor Data Generators #306

312f601

update docstrings acc. to Refactor Data Generators #306

0d07790

update documentations acc. to Refactor Data Generators #306

8b4f4bc

update tests acc. to Refactor Data Generators #306

5c44395

Merge pull request #331 from DoubleML/306-refactor-data-generators

6fa737c

306 refactor data generators

Merge pull request #332 from DoubleML/JanTeichertKluge/issue272

cada753

Jan teichert kluge/issue272

upd

a9f4284

upd

a2566cb

update lambda and p calculation in did_cs

9ef4e53

add _score_dim property to doubleml class

e90441b

upd 305

eb19efe

update data backends

97abdd8

add _n_obs_sample_splitting property to doubleml class

9f6f5d4

some progress on refactoring the data backends.

b96a839

update check_resampling input

eb951c4

github-advanced-security bot found potential problems Jun 17, 2025

View reviewed changes

JanTeichertKluge added 9 commits June 17, 2025 16:46

minor changes in high lvl unit tests

cbb3818

minor changes in high lvl unit tests

fb4f440

fix rdd unit tests

cc5a110

fix exception unit test

80a890e

fix unit tests for cluster variables (kwd arg instead of positional arg)

0207b67

update checks for correct data backend type

987f8b3

adjust unit tests

45b1c35

adjust unit tests

7c27750

adjust unit tests

025b75e

JanTeichertKluge requested a review from Copilot July 2, 2025 13:14

JanTeichertKluge self-assigned this Jul 2, 2025

JanTeichertKluge linked an issue Jul 2, 2025 that may be closed by this pull request

Rework the datasets module #272

Open

Potential fix for code scanning alert no. 419: Unused import

270ed20

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

JanTeichertKluge and others added 4 commits July 2, 2025 15:17

Potential fix for code scanning alert no. 414: Unused local variable

c129395

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Potential fix for code scanning alert no. 415: Unused local variable

a76d4a7

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Potential fix for code scanning alert no. 421: Explicit returns mixed…

4b9a81b

… with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Update doubleml/utils/_check_return_types.py

1ffcbc6

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-advanced-security bot found potential problems Jul 2, 2025

View reviewed changes

doubleml/data/did_data.py Fixed Show fixed Hide fixed

JanTeichertKluge and others added 2 commits July 2, 2025 15:20

Potential fix for code scanning alert no. 424: Unused import

7a531bf

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

formatting issues

ca8377c

JanTeichertKluge changed the title ~~305 feature request integrate clusters into the doublemldata class~~ Integrate clusters into the DoubleMLData class, Refactor data generators Jul 2, 2025

JanTeichertKluge marked this pull request as ready for review July 3, 2025 07:46

JanTeichertKluge requested a review from Copilot July 3, 2025 07:46

Copilot AI reviewed Jul 3, 2025

View reviewed changes

JanTeichertKluge added the refactoring label Jul 3, 2025

JanTeichertKluge modified the milestones: Release 0.12.0, Release 0.11.0 Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate clusters into the `DoubleMLData` class, Refactor data generators #338

Integrate clusters into the `DoubleMLData` class, Refactor data generators #338

Uh oh!

JanTeichertKluge commented Jun 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Integrate clusters into the DoubleMLData class, Refactor data generators #338

Are you sure you want to change the base?

Integrate clusters into the DoubleMLData class, Refactor data generators #338

Uh oh!

Conversation

JanTeichertKluge commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Refactoring for Cluster Data:

Refactoring for Model Specific Data Backends:

Codebase Modularity:

Refactoring of Data Generators / Fetch Methods

Documentation and Examples:

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Integrate clusters into the `DoubleMLData` class, Refactor data generators #338

Integrate clusters into the `DoubleMLData` class, Refactor data generators #338

JanTeichertKluge commented Jun 17, 2025 •

edited

Loading