-
Notifications
You must be signed in to change notification settings - Fork 95
Integrate clusters into the DoubleMLData
class, Refactor data generators
#338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Integrate clusters into the DoubleMLData
class, Refactor data generators
#338
Conversation
306 refactor data generators
Jan teichert kluge/issue272
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
… with implicit (fall through) returns Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
DoubleMLData
class, Refactor data generators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the core DoubleMLData
class to integrate multi‐way clustering support, deprecates the old DoubleMLClusterData
, and overhauls how data backends and generators are organized. It also introduces first‐class support for DID, SSM, and RDD backends, updates all affected imports, and realigns the core DoubleML
logic (string formatting, sample splitting, score dimensionality) to accommodate the new patterns.
- Integrate
cluster_cols
intoDoubleMLData
, removeDoubleMLClusterData
, addis_cluster_data
flag - Move all model‐specific data generators into
doubleml.plm.datasets
anddoubleml.irm.datasets
, update return types - Extend
DoubleML
and related mixins to recognize new backends (DIDData
,SSMData
,RDDData
), update__str__
, sample splitting, and score shapes
Reviewed Changes
Copilot reviewed 137 out of 140 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
doubleml/utils/_check_return_types.py | Switched cluster‐data checks to use is_cluster_data and made score shapes parameterizable |
doubleml/utils/_aliases.py | Added aliases for DoubleMLDIDData , DoubleMLPanelData , DoubleMLRDDData , DoubleMLSSMData |
doubleml/rdd/rdd.py | Changed to require DoubleMLRDDData , renamed s_col to score_col , adjusted internal checks |
doubleml/double_ml.py | Extended core DoubleML to detect new data backends, refactored __str__ , sample splitting and array initialization |
multiple doubleml/plm/datasets & doubleml/irm/datasets |
Relocated and updated all dataset generators to new submodules with consistent import paths |
Comments suppressed due to low confidence (2)
doubleml/rdd/rdd.py:85
- The example still uses parameter
s
for score; the new signature expectsscore_col
or the keywordscore
. Please update the docstring to match the actual API and avoid confusion.
>>> obj_dml_data = dml.DoubleMLRDDData.from_arrays(x=data_dict["X"], y=data_dict["Y"], d=data_dict["D"], s=data_dict["score"])
doubleml/utils/_check_return_types.py:17
- The original assertion ensured the underlying data object was specifically the cluster type. Now only the flag is checked, which could mask misuse (e.g. a noncluster object with
is_cluster_data=True
). Consider combining this with anisinstance(..., DoubleMLData)
check to maintain the original safety net.
assert dml_obj._dml_data.is_cluster_data
Description
This pull request introduces updates to the
doubleml
library, focusing on refactoring the support for cluster data, improving modularity, and deprecating unused features. Key changes include the addition of cluster-related functionality, deprecation of time (t_col
) and score/selection (s_col
) variables, and updates to documentation and examples to reflect these changes.Refactoring for Cluster Data:
cluster_cols
) in theDoubleMLData
class, including a newis_cluster_data
flag to indicate cluster data usage._set_cluster_vars
andcluster_vars
.DoubleMLClusterData
class, replacing it withDoubleMLData
usingis_cluster_data=True
. Warnings are added to inform users about the planned removal in version 0.12.0.Refactoring for Model Specific Data Backends:
t_col
(time variable) ands_col
(score/selection variable) fromDoubleMLData
and related methods, as they are no longer relevant except for the data backends used in e.g.DoubleMLDID
orDoubleMLSSM
_data_summary_str
method and other internal logic to exclude references to these deprecated variables.Codebase Modularity:
__init__.py
files to include new data classes (DoubleMLDIDData
,DoubleMLPanelData
,DoubleMLRDDData
,DoubleMLSSMData
) and removed unused imports.cluster_cols
logic.Refactoring of Data Generators / Fetch Methods
Documentation and Examples:
doubleml.plm.datasets
dgp path.PR Checklist