Causal Forest

In [Athey2018], the authors argued that by imposing the local centering technique, i.e., by first regressing out the effect and treatment respectively aka the so called double machine learning framework, the performance of Generalized Random Forest (GRF) can be further improved. In YLearn, we implement the class CausalForest to support such technique. We illustrate its useage in the following example.

Example

We first build a dataset and define the names of treatment, outcome, and covariate separately.

import numpy as np
import matplotlib.pyplot as plt

from ylearn.estimator_model import CausalForest
from ylearn.exp_dataset.exp_data import sq_data
from ylearn.utils._common import to_df


# build dataset
n = 2000
d = 10
n_x = 1
y, x, v = sq_data(n, d, n_x)
true_te = lambda X: np.hstack([X[:, [0]]**2 + 1, np.ones((X.shape[0], n_x - 1))])
data = to_df(treatment=x, outcome=y, v=v)
outcome = 'outcome'
treatment = 'treatment'
adjustment = data.columns[2:]

# build test data
v_test = v[:min(100, n)].copy()
v_test[:, 0] = np.linspace(np.percentile(v[:, 0], 1), np.percentile(v[:, 0], 99), min(100, n))
test_data = to_df(v=v_test)

Now it leaves us to train the CausalForest and use it in the test data. Typically, we should first specify two models which regressing out the treatment and outcome respectively on the covariate. In this example, we use the RandomForestRegressor from sklearn to be such models. Note that if we use a regression model for the treatment, then the parameter is_discrete_treatment must be set as False. To have better performance, it is also recommended to set the honest_subsample_num as not None.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegressionCV

cf = CausalForest(
    x_model=RandomForestRegressor(),
    y_model=RandomForestRegressor(),
    cf_fold=1,
    is_discrete_treatment=False,
    n_jobs=1,
    n_estimators=100,
    random_state=3,
    min_samples_split=10,
    min_samples_leaf=3,
    min_impurity_decrease=1e-10,
    max_depth=100,
    max_leaf_nodes=1000,
    sub_sample_num=0.80,
    verbose=0,
    honest_subsample_num=0.45,
)
cf.fit(data=data, outcome=outcome, treatment=treatment, adjustment=None, covariate=adjustment)
effect = cf.estimate(test_data)

Class Structures

class ylearn.estimator_model.CausalForest(x_model, y_model, n_estimators=100, *, cf_fold=1, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None, adjustment_transformer=None, covariate_transformer=None, proba_output=False)

Parameters:

x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the fit() and predict`() (also predict_proba() if x is discrete) methods.
cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.
y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the fit() and predict() methods.
n_estimators (int, default=100) – The number of trees for growing the GRF.
sub_sample_num (int or float, default=None) –
The number of samples to train each individual tree.
- If a float is given, then the number of sub_sample_num*n_samples samples will be sampled to train a single tree
- If an int is given, then the number of sub_sample_num samples will be sampled to train a single tree
max_depth (int, default=None) – The max depth that a single tree can reach. If None is given, then there is no limit of the depth of a single tree.
min_samples_split (int, default=2) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features (int, float or {"sqrt", "log2"}, default=None) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
random_state (int) – Controls the randomness of the estimator.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), estimate(), and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
verbose (int, default=0) – Controls the verbosity when fitting and predicting
honest_subsample_num (int or float, default=None) –
The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.
- Use all sub_sample_num if None is given.
- If a float is given, then the number of honest_subsample_num*sub_sample_num samples will be used to train a single tree while the rest (1 - honest_subsample_num)*sub_sample_num samples will be used to label the trained tree.
- If an int is given, then the number of honest_subsample_num samples will be sampled to train a single tree while the rest sub_sample_num - honest_subsample_num samples will be used to label the trained tree.
adjustment_transformer (transformer, default=None) – Transfomer of adjustment variables. This can be used to generate new features.
covariate_transformer (transformer, default=None) – Transfomer of covariate variables. This can be used to generate new features.
proba_output (bool, default=False) – Whether to estimate probability of the outcome if it is a discrete one. If True, then the given y_model must have the method predict_proba().

fit(data, outcome, treatment, adjustment=None, covariate=None, control=None)

Fit the model on data to estimate the causal effect. Note that when a discrete treatment is given, then the first column will be automatically assumed as the control while other columns different treat assignments if control is not specified explicitly.

Parameters:

data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.
outcome (list of str, optional) – Names of the outcomes.
treatment (list of str, optional) – Names of the treatments.
covariate (list of str, optional, default=None) – Names of the covariate vectors.
adjustment (list of str, optional, default=None) – This will be the same as the covariate.
sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.
control (str, optional, default=None) – The value of the parameter treatment whcih will be the control group to estimate the causal effect. If None is given, then the first column of the treatment will be the control.

Returns:

Fitted GRForest

Return type:

instance of GRForest

estimate(data=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:: data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.
Returns:: The estimated causal effect.
Return type:: ndarray or float, optional

apply(*, v)

Apply trees in the forest to v, return leaf indices.

Parameters:: v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to dtype=np.float32.
Returns:: For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.
Return type:: v_leaves : array-like of shape (n_samples, )

property feature_importance

Returns:: Normalized total reduction of criteria by feature (Gini importance).
Return type:: ndarray of shape (n_features,)