Seeding DoubleML sampling #335

chiara-fb · 2025-06-13T08:59:53Z

chiara-fb
Jun 13, 2025

Hi,

Is it possible to set the random state of the entire DoubleML sampling, in order to get reproducible estimates?
I am using RDFlex and cannot find an obvious way to seed it.

Answered by JanTeichertKluge

Jun 13, 2025

Hi @chiara-fb,

Thank you for your question about the DoubleML package!
Yes, it is possible to set a seed for reproducible sampling in DoubleML, including when using RDFlex. The key is to set the random seed before creating the RDFlex instance, it is not an argument for the data backend or the model itself. DoubleML relies on a numpy seed for reproducibility, the DoubleML Models, like RDFlex, are using the DoubleMLResampling class.

Here’s an example of how to ensure reproducibility:

import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.rdd.datasets import make_simple_rdd_data
from sklearn.linear_model import LinearRegression

np.random.seed(42)
data_dict = make_simple…

View full answer

JanTeichertKluge · 2025-06-13T12:16:35Z

JanTeichertKluge
Jun 13, 2025
Collaborator

Hi @chiara-fb,

Thank you for your question about the DoubleML package!
Yes, it is possible to set a seed for reproducible sampling in DoubleML, including when using RDFlex. The key is to set the random seed before creating the RDFlex instance, it is not an argument for the data backend or the model itself. DoubleML relies on a numpy seed for reproducibility, the DoubleML Models, like RDFlex, are using the DoubleMLResampling class.

Here’s an example of how to ensure reproducibility:

import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.rdd.datasets import make_simple_rdd_data
from sklearn.linear_model import LinearRegression

np.random.seed(42)
data_dict = make_simple_rdd_data(n_obs=50, fuzzy=False)
cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]
df = pd.DataFrame(np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns=['y', 'd', 'score'] + cov_names)
dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')

ml_g = LinearRegression()

np.random.seed(2025)
rdflex_obj1 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)
np.random.seed(2025)
rdflex_obj2 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)

print(rdflex_obj1._smpls)
print(rdflex_obj2._smpls)  # should be the same as rdflex_obj1._smpls


arrays_are_equal = all(
    all(np.array_equal(arr1, arr2) for arr1, arr2 in zip(tuple1, tuple2))
    for tuple1, tuple2 in zip(rdflex_obj1._smpls, rdflex_obj2._smpls)
)

print(f"Arrays are equal?: {arrays_are_equal}")

Outputs:

[[(array([ 0,  3,  4,  6,  7, 10, 13, 16, 18, 20, 21, 22, 24, 26, 27, 28, 30,
       32, 40, 43, 44, 45, 46, 47, 48]), array([ 1,  2,  5,  8,  9, 11, 12, 14, 15, 17, 19, 23, 25, 29, 31, 33, 34,
       35, 36, 37, 38, 39, 41, 42, 49])), (array([ 1,  2,  5,  8,  9, 11, 12, 14, 15, 17, 19, 23, 25, 29, 31, 33, 34,
       35, 36, 37, 38, 39, 41, 42, 49]), array([ 0,  3,  4,  6,  7, 10, 13, 16, 18, 20, 21, 22, 24, 26, 27, 28, 30,
       32, 40, 43, 44, 45, 46, 47, 48]))]]
[[(array([ 0,  3,  4,  6,  7, 10, 13, 16, 18, 20, 21, 22, 24, 26, 27, 28, 30,
       32, 40, 43, 44, 45, 46, 47, 48]), array([ 1,  2,  5,  8,  9, 11, 12, 14, 15, 17, 19, 23, 25, 29, 31, 33, 34,
       35, 36, 37, 38, 39, 41, 42, 49])), (array([ 1,  2,  5,  8,  9, 11, 12, 14, 15, 17, 19, 23, 25, 29, 31, 33, 34,
       35, 36, 37, 38, 39, 41, 42, 49]), array([ 0,  3,  4,  6,  7, 10, 13, 16, 18, 20, 21, 22, 24, 26, 27, 28, 30,
       32, 40, 43, 44, 45, 46, 47, 48]))]]
Arrays are equal?: True

It should therefore be sufficient to set a seed for creating the model instance. With the same splitting and a deterministic learner (such as linear regression), the estimators are also reproducible:

res1 = rdflex_obj1.fit()
res2 = rdflex_obj2.fit()
print(res1)
print(res2) 
print(f"Coefficients are equal?: {np.array_equal(res1.coef, res2.coef)}")

Outputs:

Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      -2.957    5.113     -0.578   5.630e-01  [-12.978, 7.064]
Robust                 -        -     -0.518   6.043e-01  [-16.181, 9.413]
Design Type:        Sharp
Cutoff:             0
First Stage Kernel: triangular
Final Bandwidth:    [0.56280546]
Method             Coef.     S.E.     t-stat       P>|t|           95% CI
-------------------------------------------------------------------------
Conventional      -2.957    5.113     -0.578   5.630e-01  [-12.978, 7.064]
Robust                 -        -     -0.518   6.043e-01  [-16.181, 9.413]
Design Type:        Sharp
Cutoff:             0
First Stage Kernel: triangular
Final Bandwidth:    [0.56280546]
Coefficients are equal?: True

Best regards
Jan

1 reply

chiara-fb Jun 13, 2025
Author

Thank you for the support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seeding DoubleML sampling #335

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Seeding DoubleML sampling #335

Uh oh!

chiara-fb Jun 13, 2025

Replies: 1 comment · 1 reply

Uh oh!

JanTeichertKluge Jun 13, 2025 Collaborator

Uh oh!

chiara-fb Jun 13, 2025 Author

chiara-fb
Jun 13, 2025

Replies: 1 comment 1 reply

JanTeichertKluge
Jun 13, 2025
Collaborator

chiara-fb Jun 13, 2025
Author