Seeding DoubleML sampling #335
-
Hi, Is it possible to set the random state of the entire DoubleML sampling, in order to get reproducible estimates? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @chiara-fb, Thank you for your question about the DoubleML package! Here’s an example of how to ensure reproducibility: import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.rdd.datasets import make_simple_rdd_data
from sklearn.linear_model import LinearRegression
np.random.seed(42)
data_dict = make_simple_rdd_data(n_obs=50, fuzzy=False)
cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]
df = pd.DataFrame(np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns=['y', 'd', 'score'] + cov_names)
dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')
ml_g = LinearRegression()
np.random.seed(2025)
rdflex_obj1 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)
np.random.seed(2025)
rdflex_obj2 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)
print(rdflex_obj1._smpls)
print(rdflex_obj2._smpls) # should be the same as rdflex_obj1._smpls
arrays_are_equal = all(
all(np.array_equal(arr1, arr2) for arr1, arr2 in zip(tuple1, tuple2))
for tuple1, tuple2 in zip(rdflex_obj1._smpls, rdflex_obj2._smpls)
)
print(f"Arrays are equal?: {arrays_are_equal}") Outputs:
It should therefore be sufficient to set a seed for creating the model instance. With the same splitting and a deterministic learner (such as linear regression), the estimators are also reproducible: res1 = rdflex_obj1.fit()
res2 = rdflex_obj2.fit()
print(res1)
print(res2)
print(f"Coefficients are equal?: {np.array_equal(res1.coef, res2.coef)}") Outputs:
Best regards |
Beta Was this translation helpful? Give feedback.
Hi @chiara-fb,
Thank you for your question about the DoubleML package!
Yes, it is possible to set a seed for reproducible sampling in DoubleML, including when using
RDFlex
. The key is to set the random seed before creating theRDFlex
instance, it is not an argument for the data backend or the model itself. DoubleML relies on a numpy seed for reproducibility, the DoubleML Models, likeRDFlex
, are using theDoubleMLResampling
class.Here’s an example of how to ensure reproducibility: