Double Machine Learning
Notation
We use capital letters for matrices and small letters for vectors. The treatment is denoted by \(x\), the outcome is denoted by \(y\), the covariate is denoted by \(v\), and other adjustment set variables are \(w\). Greek letters are for error terms.
The double machine learning (DML) model [Chern2016] can be applied when all confounders of the treatment and outcome, variables that simultaneously influence the treatment and outcome, are observed. Let \(y\) be the outcome and \(x\) be the treatment, a DML model solves the following causal effect estimation (CATE estimation):
where \(F(v)\) is the CATE conditional on the condition \(v\). Furthermore, to estimate \(F(v)\), we note that
Thus by first estimating \(\mathbb{E}[y|w, v]\) and \(\mathbb{E}[x|w,v]\) as
we can get a new dataset \((\tilde{y}, \tilde{x})\) where
such that the relation between \(\tilde{y}\) and \(\tilde{x}\) is linear
which can be simply modeled by the linear regression model.
On the other hand, in the current version, \(F(v)\) takes the form
where \(H\) can be seen as a 3-rank tensor and \(\rho_k\) is a function of the covariate \(v\), e.g., \(\rho(v) = v\) in the simplest case. Therefore, the outcome \(y\) can now be represented as
In this sense, the linear regression problem between \(\tilde{y}\) and \(\tilde{x}\) now becomes
Implementation
In YLearn, we implement a double machine learning as in the algorithm described in the [Chern2016]:
1. Let k (cf_folds in our class) be an int. Form a k-fold random partition {…, (train_data_i, test_data_i), …, (train_data_k, test_data_k)}.
2. For each i, train y_model and x_model on train_data_i, then evaluate their performances in test_data_i whoes results will be saved as \((\hat{y}_k, \hat{x}_k)\). All \((\hat{y}_k, \hat{x}_k)\) will be combined to give the new dataset \((\hat{y}, \hat{x})\).
Define the differences
\[\begin{split}\tilde{y}& = y - \hat{y}, \\ \tilde{x}&= (x - \hat{x}) \otimes v.\end{split}\]Then form the new dataset \((\tilde{y}, \tilde{x})\).
4. Perform linear regression on the dataset \((\tilde{y}, \tilde{x})\) whose coefficients will be saved in a vector \(f\). The estimated CATE given \(v\) will just be
\[f \cdot v.\]
Example
from sklearn.ensemble import RandomForestRegressor
from ylearn.exp_dataset.exp_data import single_continuous_treatment
from ylearn.estimator_model.double_ml import DoubleML
# build the dataset
train, val, treatment_effect = single_continuous_treatment()
adjustment = train.columns[:-4]
covariate = 'c_0'
outcome = 'outcome'
treatment = 'treatment'
dml = DoubleML(x_model=RandomForestRegressor(), y_model=RandomForestRegressor(), cf_fold=3,)
dml.fit(train, outcome, treatment, adjustment, covariate,)
>>> 06-23 14:02:36 I ylearn.e.double_ml.py 684 - _fit_1st_stage: fitting x_model RandomForestRegressor
>>> 06-23 14:02:39 I ylearn.e.double_ml.py 690 - _fit_1st_stage: fitting y_model RandomForestRegressor
>>> DoubleML(x_model=RandomForestRegressor(), y_model=RandomForestRegressor(), yx_model=LinearRegression(), cf_fold=3)
Class Structures
- class ylearn.estimator_model.double_ml.DoubleML(x_model, y_model, yx_model=None, cf_fold=1, adjustment_transformer=None, covariate_transformer=None, random_state=2022, is_discrete_treatment=False, categories='auto')
- Parameters
x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the
fit()andpredict`()(alsopredict_proba()if x is discrete) methods.y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the
fit()andpredict()methods.yx_model (estimator, optional) – Machine learning models for fitting the residual of y on residual of x. Only support linear regression model in the current version.
cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.
adjustment_transformer (transormer, optional, default=None,) – Transformer for adjustment variables which can be used to generate new features of adjustment variables.
covariate_transformer (transormer, optional, default=None,) – Transformer for covariate variables which can be used to generate new features of covariate variables.
random_state (int, default=2022) –
is_discrete_treatment (bool, default=False) – If the treatment variables are discrete, set this to True.
categories (str, optional, default='auto') –
- fit(data, outcome, treatment, adjustment=None, covariate=None, **kwargs)
Fit the DoubleML estimator model. Note that the training of a DML has two stages, where we implement them in
_fit_1st_stage()and_fit_2nd_stage().- Parameters
data (pandas.DataFrame) – Training dataset for training the estimator.
outcome (list of str, optional) – Names of the outcome.
treatment (list of str, optional) – Names of the treatment.
adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,
covariate (list of str, optional, default=None) – Names of the covariate.
- Returns
The fitted model
- Return type
an instance of DoubleML
- estimate(data=None, treat=None, control=None, quantity=None)
Estimate the causal effect with the type of the quantity.
- Parameters
data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator directly evaluate all quantities in the training data if data is None.
treat (float or numpy.ndarray, optional, default=None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list or an ndarray where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray.
quantity (str, optional, default=None) –
Option for returned estimation result. The possible values of quantity include:
’CATE’ : the estimator will evaluate the CATE;
’ATE’ : the estimator will evaluate the ATE;
None : the estimator will evaluate the ITE or CITE.
control (float or numpy.ndarray, optional, default=None) – This is similar to the cases of treat.
- Returns
The estimated causal effects
- Return type
ndarray
- effect_nji(data=None)
Calculate causal effects with different treatment values.
- Parameters
data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator will use the training data if data is None.
- Returns
Causal effects with different treatment values.
- Return type
ndarray
- comp_transormer(x, categories='auto')
Transform the discrete treatment into one-hot vectors properly.
- Parameters
x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.
categories (str or list, optional, default='auto') –
- Returns
The transformed one-hot vectors.
- Return type
numpy.ndarray