Why: An All-in-One Causal Learning API
Want to use YLearn in a much easier way? Try the all-in-one API Why!
Why is an API which encapsulates almost everything in YLearn, such as identifying causal effects and scoring a trained estimator model. It provides to users a simple and efficient way to use our package: one can directly pass the only thing you have, the data, into Why and call various methods of it rather than learning multiple concepts such as adjustment set before being able to find interesting information hidden in your data. Why is designed to enable the full-pipeline of causal inference: given data, it first tries to discover the causal graph if not provided, then it attempts to find possible variables as treatments and identify the causal effects, after which a suitable estimator model will be trained to estimate the causal effects, and, finally, the policy is evaluated to suggest the best option for each individual.
Example usages
In this chapter, we use dataset california_housing to show how to use Why. We prepare the dataset with code below:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
data = housing.frame
outcome = housing.target_names[0]
data[outcome] = housing.target
The variable data is our prepared dataset.
Fit Why with default settings
The simplest way to use Why is creating Why instance with default settings and fit it with training data and outcome name only.
from ylearn import Why
why = Why()
why.fit(data, outcome)
print('identified treatment:',why.treatment_)
print('identified adjustment:',why.adjustment_)
print('identified covariate:',why.covariate_)
print('identified instrument:',why.instrument_)
print(why.causal_effect())
Outputs:
identified treatment: ['MedInc', 'HouseAge']
identified adjustment: None
identified covariate: ['AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None
mean min max std
MedInc 0.411121 -0.198831 1.093134 0.064856
HouseAge -0.000385 -0.039162 0.114263 0.005845
Fit Why with customized treatments
We can fit Why with argument treatment to specify the desired features as treatment.
from ylearn import Why
why = Why()
why.fit(data, outcome, treatment=['AveBedrms', ])
print('identified treatment:',why.treatment_)
print('identified adjustment:',why.adjustment_)
print('identified covariate:',why.covariate_)
print('identified instrument:',why.instrument_)
print(why.causal_effect())
Outputs:
identified treatment: ['AveBedrms']
identified adjustment: None
identified covariate: ['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None
mean min max std
AveBedrms 0.197422 -0.748971 10.857963 0.169682
Identify treatment without fitting Why
We can call Why’s method identify to identify treatment, adjustment, covariate and instrument without fitting it.
why = Why()
r=why.identify(data, outcome)
print('identified treatment:',r[0])
print('identified adjustment:',r[1])
print('identified covariate:',r[2])
print('identified instrument:',r[3])
Outputs:
identified treatment: ['MedInc', 'HouseAge']
identified adjustment: None
identified covariate: ['AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None
Class Structures
- class ylearn._why.Why(discrete_outcome=None, discrete_treatment=None, identifier='auto', identifier_options=None, estimator='auto', estimator_options=None, random_state=None)
An all-in-one API for causal learning.
- Parameters
discrete_outcome (bool, default=None) – If True, force the outcome as discrete; If False, force the outcome as continuous; If None, inferred from outcome.
discrete_treatment (bool, default=None) – If True, force the treatment variables as discrete; If False, force the treatment variables as continuous; if None, inferred from the first treatment
identifier (str or Identifier, default=auto') – If str, available options: ‘auto’ or ‘discovery’ or ‘gcastle’ or ‘pgm’
identifier_options (dict, optional, default=None) – Parameters (key-values) to initialize the identifier
estimator (str, optional, default='auto') – Name of a valid EstimatorModel. One can also pass an instance of a valid estimator model.
estimator_options (dict, optional, default=None) – Parameters (key-values) to initialize the estimator model
fn_cost (callable, optional, default=None) – Cost function, used to readjust the causal effect based on cost.
effect_name (str, default='effect') – The column name in the argument DataFrame passed to fn_cost. Effective when fn_cost is not None.
random_state (int, optional, default=None) – Random state seed
- feature_names_in_
list of feature names seen during fit
- outcome_
name of outcome
- treatment_
list of treatment names identified during fit
- adjustment_
list of adjustment names identified during fit
- covariate_
list of covariate names identified during fit
- instrument_
list of instrument names identified during fit
- identifier_
identifier object or None. Used to identify treatment/adjustment/covariate/instrument if they were not specified during fit
- y_encoder_
LabelEncoder object or None. Used to encode outcome if it is discrete.
- preprocessor_
Pipeline object to preprocess data during fit
- estimators_
estimators dict for each treatment where key is the treatment name and value is the EstimatorModel object
- fit(data, outcome, *, treatment=None, adjustment=None, covariate=None, instrument=None, treatment_count_limit=None, copy=True, **kwargs)
Fit the Why object, steps:
encode outcome if its dtype is not numeric
identify treatment and adjustment/covariate/instrument
encode treatment if discrete_treatment is True
preprocess data
fit causal estimators
- Parameters
data (pandas.DataFrame, required) – Training dataset.
outcome (str, required) – Name of the outcome.
treatment (list of str, optional) – Names of the treatment. If str, will be split into list with comma; if None, identified by identifier.
adjustment (list of str, optional, default=None) – Names of the adjustment. Identified by identifier if adjustment/covariate/instrument are all None.
covariate (list of str, optional, default=None) – Names of the covariate. Identified by identifier if adjustment/covariate/instrument are all None.
instrument (list of str, optional, default=None) – Names of the instrument. Identified by identifier if adjustment/covariate/instrument are all None.
treatment_count_limit (int, optional) – maximum treatment number, default min(5, 10% of total feature number).
copy (bool, default=True) – Set False to perform inplace transforming and avoid a copy of data.
- Returns
The fitted
Why
.- Return type
instance of
Why
- identify(data, outcome, *, treatment=None, adjustment=None, covariate=None, instrument=None, treatment_count_limit=None)
Identify treatment and adjustment/covariate/instrument without fitting Why.
- Parameters
data (pandas.DataFrame, required) – Training dataset.
outcome (str, required) – Name of the outcome.
treatment (list of str, optional) – Names of the treatment. If str, will be split into list with comma; if None, identified by identifier.
adjustment (list of str, optional, default=None) – Names of the adjustment. Identified by identifier if adjustment/covariate/instrument are all None.
covariate (list of str, optional, default=None) – Names of the covariate. Identified by identifier if adjustment/covariate/instrument are all None.
instrument (list of str, optional, default=None) – Names of the instrument. Identified by identifier if adjustment/covariate/instrument are all None.
treatment_count_limit (int, optional) – maximum treatment number, default min(5, 10% of the number of features).
- Returns
tuple of identified treatment, adjustment, covariate, instrument
- Rtypes
tuple
- causal_graph()
Get identified causal graph.
- Returns
Identified causal graph
- Return type
instance of
CausalGraph
- causal_effect(test_data=None, treatment=None, treat=None, control=None, target_outcome=None, quantity='ATE', return_detail=False, **kwargs)
Estimate the causal effect.
- Parameters
test_data (pandas.DataFrame, optional) – The test data to evaluate the causal effect. If None, the training data is used.
treatment (str or list, optional) – Treatment names, should be subset of attribute treatment_, default all elements in attribute treatment_
treat (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None
control (treatment value or list or ndarray or pandas.Series, default None) – This is similar to the cases of treat, by default None
target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.
quantity (str, optional, default 'ATE', optional) – ‘ATE’ or ‘ITE’, default ‘ATE’.
return_detail (bool, default False) – If True, return effect details in result.
kwargs (dict, optional) – Other options to call estimator.estimate().
- Returns
- causal effect of each treatment. When quantity=’ATE’, the result DataFrame columns are:
mean: mean of causal effect,
min: minimum of causal effect,
max: maximum of causal effect,
detail (if return_detail is True ): causal effect ndarray;
in the case of discrete treatment, the result DataFrame indices are multiindex of (treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame indices are treatment names. When quantity=’ITE’, the result DataFrame are individual causal effect of each treatment, in the case of discrete treatment, the result DataFrame columns are multiindex of (treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame columns are treatment names.
- Return type
pandas.DataFrame
- individual_causal_effect(test_data, control=None, target_outcome=None)
Estimate the causal effect for each individual.
- Parameters
test_data (pandas.DataFrame, required) – The test data to evaluate the causal effect.
control (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, control should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where control[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None
target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.
- Returns
individual causal effect of each treatment. The result DataFrame columns are the treatment names; In the case of discrete treatment, the result DataFrame indices are multiindex of (individual index in test_data, treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame indices are multiindex of (individual index in test_data, treatment name).
- Return type
pandas.DataFrame
- whatif(test_data, new_value, treatment=None)
Get counterfactual predictions when treatment is changed to new_value from its observational counterpart.
- Parameters
test_data (pandas.DataFrame, required) – The test data to predict.
new_value (ndarray or pd.Series, required) – It should have the same length with test_data.
treatment (str, default None) – Treatment name. If str, it should be one of the fitted attribute treatment_. If None, the first element in the attribute treatment_ is used.
- Returns
The counterfactual prediction
- Return type
pandas.Series
- score(test_data=None, treat=None, control=None, scorer='auto')
Scoring the fitted estimator models.
- Parameters
test_data (pandas.DataFrame, required) – The test data to score.
treat (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None
control (treatment value or list or ndarray or pandas.Series) – This is similar to the cases of treat, by default None
scorer (str, default 'auto') – Reserved.
- Returns
Score of the estimator models
- Return type
float
- policy_interpreter(test_data, treatment=None, control=None, target_outcome=None, **kwargs)
Get the policy interpreter
- Parameters
test_data (pandas.DataFrame, required) – The test data to evaluate.
treatment (str or list, optional) – Treatment names, should be one or two element. default the first two elements in attribute treatment_
control (treatment value or list or ndarray or pandas.Series) – In the case of single discrete treatment, control should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, control should be a list where control[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the control value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, control should be a float or a ndarray or pandas.Series, by default None
target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.
kwargs (dict) – options to initialize the PolicyInterpreter.
- Returns
The fitted instance of
PolicyInterpreter
.- Return type
instance of
PolicyInterpreter
- uplift_model(test_data, treatment=None, treat=None, control=None, target_outcome=None, name=None, random=None)
Get uplift model over one treatment.
- Parameters
test_data (pandas.DataFrame, required) – The test data to evaluate.
treatment (str or list, optional) – Treatment name. If str, it should be one of the fitted attribute treatment_. If None, the first element in the attribute treatment_ is used.
treat (treatment value, optional) – If None, the last element in the treatment encoder’s attribute classes_ is used.
control (treatment value, optional) – If None, the first element in the treatment encoder’s attribute classes_ is used.
target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.
name (str) – Lift name. If None, treat value is used.
random (str, default None) – Lift name for random generated data. if None, no random lift is generated.
- Returns
The fitted instance of
UpliftModel
.- Return type
instance of
UpliftModel
- plot_causal_graph()
Plot the causal graph.
- plot_policy_interpreter(test_data, treatment=None, control=None, **kwargs)
Plot the interpreter.
- Returns
The fitted instance of
PolicyInterpreter
.- Return type
instance of
PolicyInterpreter