Seeding DoubleML sampling #335
-
|
Hi, Is it possible to set the random state of the entire DoubleML sampling, in order to get reproducible estimates? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
Hi @chiara-fb, Thank you for your question about the DoubleML package! Here’s an example of how to ensure reproducibility: import numpy as np
import pandas as pd
import doubleml as dml
from doubleml.rdd.datasets import make_simple_rdd_data
from sklearn.linear_model import LinearRegression
np.random.seed(42)
data_dict = make_simple_rdd_data(n_obs=50, fuzzy=False)
cov_names = ['x' + str(i) for i in range(data_dict['X'].shape[1])]
df = pd.DataFrame(np.column_stack((data_dict['Y'], data_dict['D'], data_dict['score'], data_dict['X'])), columns=['y', 'd', 'score'] + cov_names)
dml_data = dml.DoubleMLData(df, y_col='y', d_cols='d', x_cols=cov_names, s_col='score')
ml_g = LinearRegression()
np.random.seed(2025)
rdflex_obj1 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)
np.random.seed(2025)
rdflex_obj2 = dml.rdd.RDFlex(dml_data, ml_g, fuzzy=False, n_folds=2)
print(rdflex_obj1._smpls)
print(rdflex_obj2._smpls) # should be the same as rdflex_obj1._smpls
arrays_are_equal = all(
all(np.array_equal(arr1, arr2) for arr1, arr2 in zip(tuple1, tuple2))
for tuple1, tuple2 in zip(rdflex_obj1._smpls, rdflex_obj2._smpls)
)
print(f"Arrays are equal?: {arrays_are_equal}")Outputs: It should therefore be sufficient to set a seed for creating the model instance. With the same splitting and a deterministic learner (such as linear regression), the estimators are also reproducible: res1 = rdflex_obj1.fit()
res2 = rdflex_obj2.fit()
print(res1)
print(res2)
print(f"Coefficients are equal?: {np.array_equal(res1.coef, res2.coef)}")Outputs: Best regards |
Beta Was this translation helpful? Give feedback.
Hi @chiara-fb,
Thank you for your question about the DoubleML package!
Yes, it is possible to set a seed for reproducible sampling in DoubleML, including when using
RDFlex. The key is to set the random seed before creating theRDFlexinstance, it is not an argument for the data backend or the model itself. DoubleML relies on a numpy seed for reproducibility, the DoubleML Models, likeRDFlex, are using theDoubleMLResamplingclass.Here’s an example of how to ensure reproducibility: