Generalized Random Forest

To adpat random forest to causal effect estimation, [Athey2018] proposed a generalized version of it, named as Generalized Random Forest (GRF), by altering the criterion when building a single tree and designing a new kind of ensemble method to combine these trained trees. GRF can be used in, for example, quantile regression while in YLearn, we focus on its ability of performing highly flexible non-parametric causal effect estimation.

We now consider such estimation with GRF. Suppose that we observe samples \((X_i, Y_i, V_i) \in \mathbb{R}^{d_x} \times \mathbb{R} \times \mathbb{R}^{d_v}\) where \(Y\) is the outcome, \(X\) is the treatment and \(V\) is the covariate which ensures the unconfoundness condition. The forest weights \(\alpha_i(v)\) is defined by

\[\begin{split}\alpha_i^b(v) = \frac{\mathbb{I}\left( \left\{ V_i \in L^b(v) \right\} \right)}{|L^b(v)|},\\ \alpha_i(v) = \frac{1}{B} \sum_{b = 1}^B \alpha_i^b(v),\end{split}\]

where the subscript \(b\) refers to the \(b\)-th tree with a total number of \(B\) such trees, \(L^b(v)\) is the leaf that the sample which covariate \(v\) belongs to, and \(|L^b(v)|\) denotes the total number of training samples which fall into the samel leaf as the sample \(v\) for the \(b\)-th tree. Then the estimated causal effect can be expressed by

\[\left( \sum_{i=1}^n \alpha_i(x)(X_i - \bar{X}_\alpha)(X_i - \bar{X}_\alpha)^T\right)^{-1} \sum_{i = 1}^n \alpha_i(v) (X_i - \bar{X}_\alpha)(Y_i - \bar{Y}_\alpha)\]

where \(\bar{X}_\alpha = \sum \alpha_i X_i\) and \(\bar{Y}_\alpha = \sum \alpha_i Y_i\).

We now provide an example useage of applying the GRForest.

Example

We first build a dataset and define the names of treatment, outcome, and covariate separately.

import numpy as np
import matplotlib.pyplot as plt

from ylearn.estimator_model import GRForest
from ylearn.exp_dataset.exp_data import sq_data
from ylearn.utils._common import to_df


# build dataset
n = 2000
d = 10
n_x = 1
y, x, v = sq_data(n, d, n_x)
true_te = lambda X: np.hstack([X[:, [0]]**2 + 1, np.ones((X.shape[0], n_x - 1))])
data = to_df(treatment=x, outcome=y, v=v)
outcome = 'outcome'
treatment = 'treatment'
covariate = data.columns[2:]

# build test data
v_test = v[:min(100, n)].copy()
v_test[:, 0] = np.linspace(np.percentile(v[:, 0], 1), np.percentile(v[:, 0], 99), min(100, n))
test_data = to_df(v=v_test)

We now train the GRForest and use it in the test data. To have better performance, it is also recommended to set the honest_subsample_num as not None.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegressionCV

grf = GRForest(
    n_jobs=1,
    honest_subsample_num=None,
    min_samples_split=10,
    sub_sample_num=0.5,
    n_estimators=100,
    random_state=2022,
    min_impurity_decrease=1e-10,
    max_depth=100,
    max_leaf_nodes=100,
    verbose=0,
)
grf.fit(
    data=data, outcome=outcome, treatment=treatment, covariate=covariate
)
effect = grf.estimate(test_data)

Besides this GRForest, YLearn also implements a naive version of GRF with pure python in an easy to understand manner to help users get some insights on how GRF works in code level. It is worth to mention that, however, this naive version of GRF is super slow (~5mins for fitting 100 trees in a dataset with 2000 samples and 10 features). One can find this naive GRF in the folder ylearn/estimator_model/_naive_forest/.

The formal version of GRF is summarized as follows.

Class Structures

class ylearn.estimator_model.GRForest(n_estimators=100, *, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None)

Parameters:

n_estimators (int, default=100) – The number of trees for growing the GRF.
sub_sample_num (int or float, default=None) –
The number of samples to train each individual tree.
- If a float is given, then the number of sub_sample_num*n_samples samples will be sampled to train a single tree
- If an int is given, then the number of sub_sample_num samples will be sampled to train a single tree
max_depth (int, default=None) – The max depth that a single tree can reach. If None is given, then there is no limit of the depth of a single tree.
min_samples_split (int, default=2) –
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features (int, float or {"sqrt", "log2"}, default=None) –
The number of features to consider when looking for the best split:
- If int, then consider max_features features at each split.
- If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
- If None, then max_features=n_features.
random_state (int) – Controls the randomness of the estimator.
max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), estimate(), and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
verbose (int, default=0) – Controls the verbosity when fitting and predicting
honest_subsample_num (int or float, default=None) –
The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.
- Use all sub_sample_num if None is given.
- If a float is given, then the number of honest_subsample_num*sub_sample_num samples will be used to train a single tree while the rest (1 - honest_subsample_num)*sub_sample_num samples will be used to label the trained tree.
- If an int is given, then the number of honest_subsample_num samples will be sampled to train a single tree while the rest sub_sample_num - honest_subsample_num samples will be used to label the trained tree.

fit(data, outcome, treatment, adjustment=None, covariate=None)

Fit the model on data to estimate the causal effect.

Parameters:

data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.
outcome (list of str, optional) – Names of the outcomes.
treatment (list of str, optional) – Names of the treatments.
covariate (list of str, optional, default=None) – Names of the covariate vectors.
adjustment (list of str, optional, default=None) – This will be the same as the covariate.
sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.

Returns:

Fitted GRForest

Return type:

instance of GRForest

estimate(data=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:: data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.
Returns:: The estimated causal effect.
Return type:: ndarray or float, optional

apply(*, v)

Apply trees in the forest to X, return leaf indices.

Parameters:: v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to dtype=np.float32.
Returns:: For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.
Return type:: v_leaves : array-like of shape (n_samples, )

property feature_importance

Returns:: Normalized total reduction of criteria by feature (Gini importance).
Return type:: ndarray of shape (n_features,)