Skip to content

Integrate clusters into the DoubleMLData class, Refactor data generators #338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 105 commits into
base: main
Choose a base branch
from

Conversation

JanTeichertKluge
Copy link
Member

@JanTeichertKluge JanTeichertKluge commented Jun 17, 2025

Description

This pull request introduces updates to the doubleml library, focusing on refactoring the support for cluster data, improving modularity, and deprecating unused features. Key changes include the addition of cluster-related functionality, deprecation of time (t_col) and score/selection (s_col) variables, and updates to documentation and examples to reflect these changes.

Refactoring for Cluster Data:

  • Added support for cluster variables (cluster_cols) in the DoubleMLData class, including a new is_cluster_data flag to indicate cluster data usage.
  • Moved methods and properties to handle cluster variables, such as _set_cluster_vars and cluster_vars.
  • Deprecated the DoubleMLClusterData class, replacing it with DoubleMLData using is_cluster_data=True. Warnings are added to inform users about the planned removal in version 0.12.0.

Refactoring for Model Specific Data Backends:

  • Removed t_col (time variable) and s_col (score/selection variable) from DoubleMLData and related methods, as they are no longer relevant except for the data backends used in e.g. DoubleMLDID or DoubleMLSSM
  • Updated the _data_summary_str method and other internal logic to exclude references to these deprecated variables.

Codebase Modularity:

  • Updated __init__.py files to include new data classes (DoubleMLDIDData, DoubleMLPanelData, DoubleMLRDDData, DoubleMLSSMData) and removed unused imports.
  • Refactored disjoint set checks to accommodate the new cluster_cols logic.

Refactoring of Data Generators / Fetch Methods

  • Moved model specific data generators into the model specfic submodules
  • Adjusted the imports and the documention examples

Documentation and Examples:

  • Updated examples in the documentation to use the new submodules, e.g. doubleml.plm.datasets dgp path.

PR Checklist

  • The title of the pull request summarizes the changes made.
  • The PR contains a detailed description of all changes and additions.
  • References to related issues or PRs are added.
  • The code passes all (unit) tests.
  • Enhancements or new feature are equipped with unit tests.
  • The changes adhere to the PEP8 standards.

SvenKlaassen and others added 30 commits June 2, 2025 14:20
@JanTeichertKluge JanTeichertKluge requested a review from Copilot July 2, 2025 13:14
@JanTeichertKluge JanTeichertKluge self-assigned this Jul 2, 2025
@JanTeichertKluge JanTeichertKluge linked an issue Jul 2, 2025 that may be closed by this pull request
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Copilot

This comment was marked as outdated.

JanTeichertKluge and others added 4 commits July 2, 2025 15:17
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
… with implicit (fall through) returns

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
JanTeichertKluge and others added 2 commits July 2, 2025 15:20
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
@JanTeichertKluge JanTeichertKluge changed the title 305 feature request integrate clusters into the doublemldata class Integrate clusters into the DoubleMLData class, Refactor data generators Jul 2, 2025
@JanTeichertKluge JanTeichertKluge marked this pull request as ready for review July 3, 2025 07:46
@JanTeichertKluge JanTeichertKluge requested a review from Copilot July 3, 2025 07:46
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the core DoubleMLData class to integrate multi‐way clustering support, deprecates the old DoubleMLClusterData, and overhauls how data backends and generators are organized. It also introduces first‐class support for DID, SSM, and RDD backends, updates all affected imports, and realigns the core DoubleML logic (string formatting, sample splitting, score dimensionality) to accommodate the new patterns.

  • Integrate cluster_cols into DoubleMLData, remove DoubleMLClusterData, add is_cluster_data flag
  • Move all model‐specific data generators into doubleml.plm.datasets and doubleml.irm.datasets, update return types
  • Extend DoubleML and related mixins to recognize new backends (DIDData, SSMData, RDDData), update __str__, sample splitting, and score shapes

Reviewed Changes

Copilot reviewed 137 out of 140 changed files in this pull request and generated no comments.

Show a summary per file
File Description
doubleml/utils/_check_return_types.py Switched cluster‐data checks to use is_cluster_data and made score shapes parameterizable
doubleml/utils/_aliases.py Added aliases for DoubleMLDIDData, DoubleMLPanelData, DoubleMLRDDData, DoubleMLSSMData
doubleml/rdd/rdd.py Changed to require DoubleMLRDDData, renamed s_col to score_col, adjusted internal checks
doubleml/double_ml.py Extended core DoubleML to detect new data backends, refactored __str__, sample splitting and array initialization
multiple doubleml/plm/datasets & doubleml/irm/datasets Relocated and updated all dataset generators to new submodules with consistent import paths
Comments suppressed due to low confidence (2)

doubleml/rdd/rdd.py:85

  • The example still uses parameter s for score; the new signature expects score_col or the keyword score. Please update the docstring to match the actual API and avoid confusion.
    >>> obj_dml_data = dml.DoubleMLRDDData.from_arrays(x=data_dict["X"], y=data_dict["Y"], d=data_dict["D"], s=data_dict["score"])

doubleml/utils/_check_return_types.py:17

  • The original assertion ensured the underlying data object was specifically the cluster type. Now only the flag is checked, which could mask misuse (e.g. a noncluster object with is_cluster_data=True). Consider combining this with an isinstance(..., DoubleMLData) check to maintain the original safety net.
        assert dml_obj._dml_data.is_cluster_data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Integrate Clusters into the DoubleMLData Class Rework the datasets module
2 participants