Causal Forest
In [Athey2018], the authors argued that by imposing the local centering technique, i.e., by first regressing out the effect and treatment respectively aka the so called double machine learning framework, the performance of Generalized Random Forest (GRF) can be further improved. In YLearn, we implement the class CausalForest to support such technique. We illustrate its useage in the following example.
Class Structures
- class ylearn.estimator_model.CausalForest(x_model, y_model, n_estimators=100, *, cf_fold=1, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None, adjustment_transformer=None, covariate_transformer=None, proba_output=False)
- Parameters:
x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the
fit()
andpredict`()
(alsopredict_proba()
if x is discrete) methods.cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.
y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the
fit()
andpredict()
methods.n_estimators (int, default=100) – The number of trees for growing the GRF.
sub_sample_num (int or float, default=None) –
The number of samples to train each individual tree.
If a float is given, then the number of
sub_sample_num*n_samples
samples will be sampled to train a single treeIf an int is given, then the number of
sub_sample_num
samples will be sampled to train a single tree
max_depth (int, default=None) – The max depth that a single tree can reach. If
None
is given, then there is no limit of the depth of a single tree.min_samples_split (int, default=2) –
The minimum number of samples required to split an internal node:
If int, then consider min_samples_split as the minimum number.
If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
min_samples_leaf (int or float, default=1) –
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.If int, then consider min_samples_leaf as the minimum number.
If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features (int, float or {"sqrt", "log2"}, default=None) –
The number of features to consider when looking for the best split:
If int, then consider max_features features at each split.
If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
If “sqrt”, then max_features=sqrt(n_features).
If “log2”, then max_features=log2(n_features).
If None, then max_features=n_features.
random_state (int) – Controls the randomness of the estimator.
max_leaf_nodes (int, default=None) – Grow a tree with
max_leaf_nodes
in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
n_jobs (int, default=None) – The number of jobs to run in parallel.
fit()
,estimate()
, andapply()
are all parallelized over the trees.None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.verbose (int, default=0) – Controls the verbosity when fitting and predicting
honest_subsample_num (int or float, default=None) –
The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.
Use all
sub_sample_num
ifNone
is given.If a float is given, then the number of
honest_subsample_num*sub_sample_num
samples will be used to train a single tree while the rest(1 - honest_subsample_num)*sub_sample_num
samples will be used to label the trained tree.If an int is given, then the number of
honest_subsample_num
samples will be sampled to train a single tree while the restsub_sample_num - honest_subsample_num
samples will be used to label the trained tree.
adjustment_transformer (transformer, default=None) – Transfomer of adjustment variables. This can be used to generate new features.
covariate_transformer (transformer, default=None) – Transfomer of covariate variables. This can be used to generate new features.
proba_output (bool, default=False) – Whether to estimate probability of the outcome if it is a discrete one. If True, then the given
y_model
must have the methodpredict_proba()
.
- fit(data, outcome, treatment, adjustment=None, covariate=None, control=None)
Fit the model on data to estimate the causal effect. Note that when a discrete treatment is given, then the first column will be automatically assumed as the control while other columns different treat assignments if
control
is not specified explicitly.- Parameters:
data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.
outcome (list of str, optional) – Names of the outcomes.
treatment (list of str, optional) – Names of the treatments.
covariate (list of str, optional, default=None) – Names of the covariate vectors.
adjustment (list of str, optional, default=None) – This will be the same as the covariate.
sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.
control (str, optional, default=None) – The value of the parameter
treatment
whcih will be the control group to estimate the causal effect. IfNone
is given, then the first column of thetreatment
will be thecontrol
.
- Returns:
Fitted GRForest
- Return type:
instance of GRForest
- estimate(data=None)
Estimate the causal effect of the treatment on the outcome in data.
- Parameters:
data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.
- Returns:
The estimated causal effect.
- Return type:
ndarray or float, optional
- apply(*, v)
Apply trees in the forest to v, return leaf indices.
- Parameters:
v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to
dtype=np.float32
.- Returns:
For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.
- Return type:
v_leaves : array-like of shape (n_samples, )
- property feature_importance
- Returns:
Normalized total reduction of criteria by feature (Gini importance).
- Return type:
ndarray of shape (n_features,)