Welcome to YLearn’s documentation!

YLearn, a pun of “learn why”, is a python package for causal learning which supports various aspects of causal inference ranging from causal effect identification, estimation, and causal graph discovery, etc.

User Guide

Overview of YLearn and Causal Inference

Machine learning has made great achievements in recent years. The areas in which machine learning succeeds are mainly for prediction, e.g., the classification of pictures of cats and dogs. However, machine learning is incapable of answering some questions that naturally arise in many scenarios. One example is for the counterfactual questions in policy evaluations: what would have happened if the policy had changed? Due to the fact that these counterfactuals can not be observed, machine learning models, the prediction tools, can not be used. These incapabilities of machine learning partly give rise to applications of causal inference in these days.

Causal inference directly models the outcome of interventions and formalizes the counterfactual reasoning. With the aid of machine learning, causal inference can draw causal conclusions from observational data in various manners nowadays, rather than relying on conducting craftly designed experiments.

A typical complete causal inference procedure is composed of three parts. First, it learns causal relationships using the technique called causal discovery. These relationships are then expressed either in the form of Structural Causal Models or Directed Acyclic Graphs (DAG). Second, it expresses the causal estimands, which are clarified by the interested causal questions such as the average treatment effects, in terms of the observed data. This process is known as identification. Finally, once the causal estimand is identified, causal inference proceeds to focus on estimating the causal estimand from observational data. Then policy evaluation problems and counterfactual questions can also be answered.

YLearn, equipped with many techniques developed in recent literatures, is implemented to support the whole causal inference pipeline from causal discovery to causal estimand estimation with the help of machine learning. This is more promising especially when there are abundant observational data.

Quick Start

In this part, we first show several simple example usages of YLearn. These examples cover the most common functionalities. Then we present a case study with Why to unveil the hidden causal relations in data.

Example usages

We present several necessary example usages of YLearn in this section, which covers defining a causal graph, identifying the causal effect, and training an estimator model, etc. Please see their specific documentations to for more details.

  1. Representation of causal graph

    Given a set of variables, the representation of its causal graph in YLearn requires a python dict to denote the causal relations of variables, in which the keys of the dict are children of all elements in the corresponding values which usually should be a list of names of variables. For an instance, in the simplest case, for a given causal graph \(X \leftarrow W \rightarrow Y\), we first define a python dict for the causal relations, which will then be passed to CausalGraph as a parameter:

    causation = {'X': ['W'], 'W':[], 'Y':['W']}
    cg = CausalGraph(causation=causation)
    

    cg will be the causal graph encoding the causal relation \(X \leftarrow W \rightarrow Y\) in YLearn. If there exist unobserved confounders in the causal graph, then, aside from the observed variables, we should also define a python list containing these causal relations. See Causal Graph for more details.

  2. Identification of causal effect

    It is crucial to identify the causal effect when we want to estimate it from data. The first step for identifying the causal effect is identifying the causal estimand. This can be easily done in YLearn. For an instance, suppose that we are interested in identifying the causal estimand \(P(Y|do(X=x))\) in the causal graph cg, then we should first define an instance of CausalModel and call the identify() method:

    cm = CausalModel(causal_graph=cg)
    cm.identify(treatment={'X'}, outcome={'Y'}, identify_method=('backdoor', 'simple'))
    

    where we use the backdoor-adjustment method here. YLearn also support front-door adjustment, finding instrumental variables, and, most importantly, the general identification method developed in [Pearl] which is able to identify any causal effect if it is identifiable.

  3. Estimation of causal effect

    The estimation of causal effects in YLearn is also fairly easy. It follows the common approach of deploying a machine learning model since YLearn focuses on the intersection of machine learning and causal inference in this part. Given a dataset, one can apply any EstimatorModel in YLearn with a procedure composed of 3 steps:

    • Given data in the form of pandas.DataFrame, find the names of treatment, outcome, adjustment, covariate.

    • Call fit() method of EstimatorModel to train the model.

    • Call estimate() method of EstimatorModel to estimate causal effects in test data.

    See Estimator Model: Estimating the Causal Effects for more details.

  4. Using the all-in-one API: Why

    For the purpose of applying YLearn in a unified and eaiser manner, YLearn provides the API Why. Why is an API which encapsulates almost everything in YLearn, such as identifying causal effects and scoring a trained estimator model. To use Why, one should first create an instance of Why which needs to be trained by calling its method fit(), after which other utilities, such as causal_effect(), score(), and whatif(), could be used. This procedure is illustrated in the following code example:

    from sklearn.datasets import fetch_california_housing
    
    from ylearn import Why
    
    housing = fetch_california_housing(as_frame=True)
    data = housing.frame
    outcome = housing.target_names[0]
    data[outcome] = housing.target
    
    why = Why()
    why.fit(data, outcome, treatment=['AveBedrms', 'AveRooms'])
    
    print(why.causal_effect())
    

API: Interacting with YLearn

All-in-one API

Class Name

Description

Why

An API which encapsulates almost everything in YLearn, such as identifying causal effects and scoring a trained estimator model. It provides to users a simple and efficient way to use YLearn.

Causal Structures Discovery

Class Name

Description

CausalDiscovery

Find causal structures in observational data.

Causal Model

Class Name

Description

CausalGraph

Express the causal structures and support other operations related to causal graph, e.g., add and delete edges to the graph.

CausalModel

Encode causations represented by the CausalGraph. Mainly support causal effect identification, e.g., backdoor adjustment.

Prob

Represent the probability distribution.

Estimator Models

Class Name

Description

GRForest

A highly flexible nonparametric estimator (Generalized Random Forest, GRF) model which supports both discrete and continuous treatment. The unconfoundedness condition is required.

CausalForest

A generalized random forest combined with the local centering technique (i.e. double machine learning framework). The unconfoundedness condition is required.

CTCausalForest

A causal forest as an ensemble of a bunch of CausalTree. Similar to the CausalTree, the treatment should be binary. The unconfoundedness condition is required.

ApproxBound

A model used for estimating the upper and lower bounds of the causal effects. This model does not need the unconfoundedness condition.

CausalTree

A class for estimating causal effect with decision tree. The unconfoundedness condition is required.

DeepIV

Instrumental variables with deep neural networks. Must provide the names of instrumental variables.

NP2SLS

Nonparametric instrumental variables. Must provide the names of instrumental variables.

DoubleML

Double machine learning model for the estimation of CATE. The unconfoundedness condition is required.

DoublyRobust and PermutedDoublyRobust

Doubly robust method for the estimation of CATE. The permuted version considers all possible treatment-control pairs. The unconfoundedness condition is required and the treatment must be discrete.

SLearner and PermutedSLearner

SLearner. The permuted version considers all possible treatment-control pairs. The unconfoundedness condition is required and the treatment must be discrete.

TLearner and PermutedTLearner

TLearner with multiple machine learning models. The permuted version considers all possible treatment-control pairs. The unconfoundedness condition is required and the treatment must be discrete.

XLearner and PermutedXLearner

XLearner with multiple machine learning models. The permuted version considers all possible treatment-control pairs. The unconfoundedness condition is required and the treatment must be discrete.

RLoss

Effect score for measuring the performances of estimator models. The unconfoundedness condition is required.

Policy

Class Name

Description

PolicyTree

A class for finding the optimal policy for maximizing the causal effect with the tree model.

Interpreter

Class Name

Description

CEInterpreter

An object used to interpret the estimated CATE using the decision tree model.

PolicyInterpreter

An object used to interpret the policy given by some PolicyModel.

Causal Model: The Representation of Causal Structures

Causal Graph

This is a class for representing DAGs of causal structures.

Generally, for a set of variables \(V\), a variable \(V_i\) is said to be a cause of a variable \(V_j\) if \(V_j\) can change in response to changes in \(V_i\). In a DAG for causal structures, every parent is a direct causes of all its children. And we refer to these DAGs for causal structures as causal graphs. For the terminologies of graph, one can see, for example, Chapter 1.2 in [Pearl].

There are five basic structures composed of two or three nodes for building causal graphs. Besides the structures, there are flows of association and causation in causal graphs in the probability language. Any two nodes \(X\) and \(Y\) connected by the flow of association implies that they are statistically dependent, i.e., \(P(X, Y) \neq P(X)P(Y)\). Let \(X, Y\) and \(W\) be three distinct nodes, then the five basics structures include:

  1. chains:

\[X \rightarrow W \rightarrow Y,\]

\(X\) and \(Y\) are statistically dependent;

  1. forks:

\[X \leftarrow W \rightarrow Y,\]

\(X\) and \(Y\) are statistically dependent;

  1. colliders:

\[X \rightarrow W \leftarrow Y,\]

\(X\) and \(Y\) are statistically independent;

  1. two unconnected nodes:

\[X \quad Y,\]

\(X\) and \(Y\) are statistically independent;

  1. two connected nodes:

\[X \rightarrow Y,\]

\(X\) and \(Y\) are statistically dependent.

In YLearn, one can use the CausalGraph to represent causal structures by first giving a python dict where each key in this dict is a child of all elements in the corresponding dict value, which usually should be a list of str.

Class Structures

class ylearn.causal_model.graph.CausalGraph(causation, dag=None, latent_confounding_arcs=None)
Parameters:
  • causation (dict) – Descriptions of the causal structures where values are parents of the corresponding keys.

  • dag (networkx.MultiGraph, optional, default=None) – A known graph structure. If provided, dag must represent the causal structures stored in causation.

  • latent_confounding_arcs (set or list of tuple of two str, optional, default=None,) – Two elements in the tuple are names of nodes in the graph where there exists an latent confounding arcs between them. Semi-Markovian graphs with unobserved confounders can be converted to a graph without unobserved variables, where one can add bi-directed latent confounding arcs to represent these relations. For example, the causal graph X <- U -> Y, where U is an unobserved confounder of X and Y, can be converted equivalently to X <–>Y where <–> is a latent confounding arc.

ancestors(x)

Return the ancestors of all nodes in x.

Parameters:

x (set of str) – A set of nodes in the graph.

Returns:

Ancestors of nodes in x in the graph.

Return type:

set of str

descendents(x)

Return the descendents of all nodes in x.

Parameters:

x (set of str) – A set of nodes in the graph.

Returns:

Descendents of nodes in x in the graph.

Return type:

set of str

parents(x, only_observed=True)

Return the direct parents of the node x in the graph.

Parameters:
  • x (str) – Name of the node x.

  • only_observed (bool, default=True) – If True, then only find the observed parents in the causal graph, otherwise also include the unobserved variables, by default True

Returns:

Parents of the node x in the graph

Return type:

list

add_nodes(nodes, new=False)

If not new, add all nodes in the nodes to the current CausalGraph, else create a new graph and add nodes.

Parameters:
  • x (set or list) – Nodes waited to be added to the current causal graph.

  • new (bool, default=False) – If new create and return a new graph. Defaults to False.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

add_edges_from(edge_list, new=False, observed=True)

Add edges to the causal graph.

Parameters:
  • edge_list (list) – Every element of the list contains two elements, the first for the parent

  • new (bool, default=False) – If new create and return a new graph. Defaults to False.

  • observed (bool, default=True) – Add unobserved bidirected confounding arcs if not observed.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

add_edge(edge_list, s, t, observed=True)

Add edges to the causal graph.

Parameters:
  • s (str) – Source of the edge.

  • t (str) – Target of the edge.

  • observed (bool, default=True) – Add unobserved bidirected confounding arcs if not observed.

remove_nodes(nodes, new=True)

Remove all nodes of nodes in the graph.

Parameters:
  • nodes (set or list) – Nodes waited to be removed.

  • new (bool, default=True) – If True, create a new graph, remove nodes in that graph and return it. Defaults to False.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

remove_edge(edge, observed=True)

Remove the edge in the CausalGraph. If not observed, remove the unobserved latent confounding arcs.

Parameters:
  • edge (tuple) – 2 elements denote the start and end of the edge, respectively.

  • observed (bool, default=True) – If not observed, remove the unobserved latent confounding arcs.

remove_edges_from(edge_list, new=False, observed=True)

Remove all edges in the edge_list in the graph.

Parameters:
  • edge_list (list) – list of edges to be removed.

  • new (bool, default=False) – If new, create a new CausalGraph and remove edges.

  • observed (bool, default=True) – Remove unobserved latent confounding arcs if not observed.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

build_sub_graph(subset)

Return a new CausalGraph as the subgraph of the graph with nodes in the subset.

Parameters:

subset (set) – The set of the subgraph.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

remove_incoming_edges(x, new=False)

Remove incoming edges of all nodes of x. If new, do this in the new CausalGraph.

Parameters:
  • x (set or list) –

  • new (bool, default=False,) – Return a new graph if set as Ture.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

remove_outgoing_edges(x, new=False)

Remove outgoing edges of all nodes of x. If new, do this in the new CausalGraph.

Parameters:
  • x (set or list) –

  • new (bool, default=False,) – Return a new graph if set as Ture.

Returns:

Modified causal graph

Return type:

instance of CausalGraph

property c_components

The C-components set of the graph.

Returns:

The C-components set of the graph.

Return type:

set of str

property observed_dag

Return the observed part of the graph, including observed nodes and edges between them.

Returns:

The observed part of the graph

Return type:

networkx.MultiGraph

property explicit_unob_var_dag

Build a new dag where all unobserved confounding arcs are replaced by explicit unobserved variables.

Returns:

Dag with explicit unobserved nodes

Return type:

networkx.MultiGraph

property topo_order

Return the topological order of the nodes in the observed graph.

Returns:

Nodes in the topological order

Return type:

generator

Causal Model

CausalModel is a core object for performing Identification and finding Instrumental Variables.

Before introducing the causal model, we should clarify the definition of interventions first. Interventions would be to take the whole population and give every one some operation. [Pearl] defined the \(do\)-operator to describe such operations. Probabilistic models can not serve to predict the effect of interventions which leads to the need for causal model.

The formal definition of causal model is due to [Pearl]. A causal model is a triple

\[M = \left< U, V, F\right>\]

where

  • \(U\) are exogenous (variables that are determined by factors outside the model);

  • \(V\) are endogenous that are determined by \(U \cup V\), and \(F\) is a set of functions such that

\[V_i = F_i(pa_i, U_i)\]

with \(pa_i \subset V \backslash V_i\).

For example, \(M = \left< U, V, F\right>\) is a causal model where

\[ \begin{align}\begin{aligned}V = \{V_1, V_2\},\\U = \{ U_1, U_2, I, J\},\\F = \{F_1, F_2 \}\end{aligned}\end{align} \]

such that

\[\begin{split}V_1 = F_1(I, U_1) = \theta_1 I + U_1\\ V_2 = F_2(V_1, J, U_2, ) = \phi V_1 + \theta_2 J + U_2.\end{split}\]

Note that every causal model can be associated with a DAG and encodes necessary information of the causal relationships between variables. YLearn uses CausalModel to represent a causal model and support many operations related to the causal model such as Identification.

Identification

To characterize the effect of the intervention, one needs to consider the causal effect which is a causal estimand including the \(do\)-operator. The action that converts the causal effect into corresponding statistical estimands is called Identification and is implemented in CausalModel in YLearn. Note that not all causal effects can be converted to statistical estimands. We refer to such causal effects as not identifiable. We list several identification methods supported by CausalModel.

Class Structures

class ylearn.causal_model.CausalModel(causal_graph=None, data=None)
Parameters:
  • causal_graph (CausalGraph, optional, default=None) – An instance of CausalGraph which encodes the causal structures.

  • data (pandas.DataFrame, optional, default=None) – The data used to discover the causal structures if causal_graph is not provided.

id(y, x, prob=None, graph=None)

Identify the causal quantity \(P(y|do(x))\) if identifiable else return raise IdentificationError. Note that here we only consider semi-Markovian causal model, where each unobserved variable is a parent of exactly two nodes. This is because any causal model with unobserved variables can be converted to a semi-Markovian causal model encoding the same set of conditional independences.

Parameters:
  • y (set of str) – Set of names of outcomes.

  • x (set of str) – Set of names of treatments.

  • prob (Prob, optional, default=None) – Probability distribution encoded in the graph.

  • graph (CausalGraph) – CausalGraph encodes the information of corresponding causal structures.

Returns:

The probability distribution of the converted casual effect.

Return type:

Prob

Raises:

IdentificationError – If the interested causal effect is not identifiable, then raise IdentificationError.

is_valid_backdoor_set(set_, treatment, outcome)

Determine if a given set is a valid backdoor adjustment set for causal effect of treatments on the outcomes.

Parameters:
  • set (set) – The adjustment set.

  • treatment (set or list of str) – Names of the treatment. str is also acceptable for single treatment.

  • outcome (set or list of str) – Names of the outcome. str is also acceptable for single outcome.

Returns:

True if the given set is a valid backdoor adjustment set for the causal effect of treatment on outcome in the current causal graph.

Return type:

bool

get_backdoor_set(treatment, outcome, adjust='simple', print_info=False)

Return the backdoor adjustment set for the given treatment and outcome.

Parameters:
  • treatment (set or list of str) – Names of the treatment. str is also acceptable for single treatment.

  • outcome (set or list of str) – Names of the outcome. str is also acceptable for single outcome.

  • adjust (str) –

    Set style of the backdoor set. Available options are

    simple: directly return the parent set of treatment

    minimal: return the minimal backdoor adjustment set

    all: return all valid backdoor adjustment set.

  • print_info (bool, default=False) – If True, print the identified results.

Returns:

The first element is the adjustment list, while the second is the encoded Prob.

Return type:

tuple of two element

Raises:

IdentificationError – Raise error if the style is not in simple, minimal or all or no set can satisfy the backdoor criterion.

get_backdoor_path(treatment, outcome)

Return all backdoor paths connecting treatment and outcome.

Parameters:
  • treatment (str) – Name of the treatment.

  • outcome (str) – Name of the outcome

Returns:

A list containing all valid backdoor paths between the treatment and outcome in the graph.

Return type:

list

has_collider(path, backdoor_path=True)

If the path in the current graph has a collider, return True, else return False.

Parameters:
  • path (list of str) – A list containing nodes in the path.

  • backdoor_path (bool, default=True) – Whether the path is a backdoor path.

Returns:

True if the path has a collider.

Return type:

bool

is_connected_backdoor_path(path)

Test whether a backdoor path is connected.

Parameters:

path (list of str) – A list describing the path.

Returns:

True if path is a d-connected backdoor path and False otherwise.

Return type:

bool

is_frontdoor_set(set_, treatment, outcome)

Determine if the given set is a valid frontdoor adjustment set for the causal effect of treatment on outcome.

Parameters:
  • set (set) – The set waited to be determined as a valid front-door adjustment set.

  • treatment (str) – Name of the treatment.

  • outcome (str) – Name of the outcome.

Returns:

True if the given set is a valid frontdoor adjustment set for causal effects of treatments on outcomes.

Return type:

bool

get_frontdoor_set(treatment, outcome, adjust='simple')

Return the frontdoor set for adjusting the causal effect between treatment and outcome.

Parameters:
  • treatment (set of str or str) – Name of the treatment. Should contain only one element.

  • outcome (set of str or str) – Name of the outcome. Should contain only one element.

  • adjust (str, default='simple') –

    Available options include ‘simple’: Return the frontdoor set with minimal number of elements.

    ’minimal’: Return the frontdoor set with minimal number of elements.

    ’all’: Return all possible frontdoor sets.

Returns:

2 elements (adjustment_set, Prob)

Return type:

tuple

Raises:

IdentificationError – Raise error if the style is not in simple, minimal or all or no set can satisfy the frontdoor criterion.

get_iv(treatment, outcome)

Find the instrumental variables for the causal effect of the treatment on the outcome.

Parameters:
  • treatment (iterable) – Name(s) of the treatment.

  • outcome (iterable) – Name(s) of the outcome.

Returns:

A valid instrumental variable set that will be an empty one if there is no such set.

Return type:

set

is_valid_iv(treatment, outcome, set_)

Determine whether a given set is a valid instrumental variable set.

Parameters:
  • treatment (iterable) – Name(s) of the treatment.

  • outcome (iterable) – Name(s) of the outcome.

  • set (set) – The set waited to be tested.

Returns:

True if the set is a valid instrumental variable set and False otherwise.

Return type:

bool

identify(treatment, outcome, identify_method='auto')

Identify the causal effect expression. Identification is an operation that converts any causal effect quantity, e.g., quantities with the do operator, into the corresponding statistical quantity such that it is then possible to estimate the causal effect in some given data. However, note that not all causal quantities are identifiable, in which case an IdentificationError will be raised.

Parameters:
  • treatment (set or list of str) – Set of names of treatments.

  • outcome (set or list of str) – Set of names of outcomes.

  • identify_method (tuple of str or str, optional, default='auto') –

    If the passed value is a tuple or list, then it should have two elements where the first one is for the identification methods and the second is for the returned set style.

    Available options:

    ’auto’ : Perform identification with all possible methods

    ’general’: The general identification method, see id()

    (‘backdoor’, ‘simple’): Return the set of all direct confounders of both treatments and outcomes as a backdoor adjustment set.

    (‘backdoor’, ‘minimal’): Return all possible backdoor adjustment sets with minimal number of elements.

    (‘backdoor’, ‘all’): Return all possible backdoor adjustment sets.

    (‘frontdoor’, ‘simple’): Return all possible frontdoor adjustment sets with minimal number of elements.

    (‘frontdoor’, ‘minimal’): Return all possible frontdoor adjustment sets with minimal number of elements.

    (‘frontdoor’, ‘all’): Return all possible frontdoor adjustment sets.

Returns:

A python dict where keys of the dict are identify methods while the values are the corresponding results.

Return type:

dict

Raises:

IdentificationError – If the causal effect is not identifiable or if the identify_method was not given properly.

estimate(estimator_model, data=None, *, treatment=None, outcome=None, adjustment=None, covariate=None, quantity=None, **kwargs)

Estimate the identified causal effect in a new dataset.

Parameters:
  • estimator_model (EstimatorModel) – Any suitable estimator models implemented in the EstimatorModel can be applied here.

  • data (pandas.DataFrame, optional, default=None) – The data set for causal effect to be estimated. If None, use the data which is used for discovering causal graph.

  • treatment (set or list, optional, default=None) – Names of the treatment. If None, the treatment used for backdoor adjustment will be taken as the treatment.

  • outcome (set or list, optional, default=None) – Names of the outcome. If None, the outcome used for backdoor adjustment will be taken as the outcome.

  • adjustment (set or list, optional, default=None) – Names of the adjustment set. If None, the adjustment set is given by the simplest backdoor set found by CausalModel.

  • covariate (set or list, optional, default=None) – Names of covariate set. Ignored if set as None.

  • quantity (str, optional, default=None) – The interested quantity when evaluating causal effects.

Returns:

The estimated causal effect in data.

Return type:

np.ndarray or float

identify_estimate(data, outcome, treatment, estimator_model=None, quantity=None, identify_method='auto', **kwargs)

Combination of the identify method and the estimate method. However, since current implemented estimator models assume (conditionally) unconfoundness automatically (except for methods related to iv), we may only consider using backdoor set adjustment to fulfill the unconfoundness condition.

Parameters:
  • treatment (set or list of str, optional) – Set of names of treatments.

  • outcome (set or list of str, optional) – Set of names of outcome.

  • identify_method (tuple of str or str, optional, default='auto') –

    If the passed value is a tuple or list, then it should have two elements where the first one is for the identification methods and the second is for the returned set style.

    Available options:

    ’auto’ : Perform identification with all possible methods

    ’general’: The general identification method, see id()

    (‘backdoor’, ‘simple’): Return the set of all direct confounders of both treatments and outcomes as a backdoor adjustment set.

    (‘backdoor’, ‘minimal’): Return all possible backdoor adjustment sets with minimal number of elements.

    (‘backdoor’, ‘all’): Return all possible backdoor adjustment sets.

    (‘frontdoor’, ‘simple’): Return all possible frontdoor adjustment sets with minimal number of elements.

    (‘frontdoor’, ‘minimal’): Return all possible frontdoor adjustment sets with minimal number of elements.

    (‘frontdoor’, ‘all’): Return all possible frontdoor adjustment sets.

  • quantity (str, optional, default=None) – The interested quantity when evaluating causal effects.

Returns:

The estimated causal effect in data.

Return type:

np.ndarray or float

Representation of Probability

To represent and modifies probabilities such as

\[P(x, y|z),\]

one can define an instance of Prob and change its attributes.

class ylearn.causal_model.prob.Prob(variables=set(), conditional=set(), divisor=set(), marginal=set(), product=set())

Probability distribution, e.g., the probability expression

\[\sum_{w}P(v|y)[P(w|z)P(x|y)P(u)].\]

We will clarify below the meanings of our variables with this example.

Parameters:
  • variables (set, default=set()) – The variables (\(v\) in the above example) of the probability.

  • conditional (set, default=set()) – The conditional set (\(y\) in the above example) of the probability.

  • marginal (set, default=set()) – The sum set (\(w\) in the above example) for marginalizing the probability.

  • product (set, default=set()) – If not set(), then the probability is composed of the first probability object \((P(v|y))\) and several other probabiity objects that are all saved in the set product, e.g., product = {P1, P2, P3} where P1 for \(P(w|z)\), P2 for \(P(x|y)\), and P3 for \(P(u)\) in the above example.

parse()

Return the expression of the probability distribution.

Returns:

Expression of the encoded probabiity

Return type:

str

show_latex_expression()

Show the latex expression.

For a set of variables \(V\), its causal structure can be represented by a directed acyclic graph (DAG), where each node corresponds to an element of \(V\) while each direct functional relationship among the corresponding variables can be represented by a link in the DAG. A causal structure guides the precise specification of how each variable is influenced by its parents in the DAG. For an instance, \(X \leftarrow W \rightarrow Y\) denotes that \(W\) is a parent, thus also a common cause, of \(X\) and \(Y\). More specifically, for two distinct variables \(V_i\) and \(V_j\), if their functional relationship is

\[V_j = f(V_i, \eta_{ij})\]

for some function \(f\) and noise \(\eta\), then in the DAG representing the causal structure of the set of variables \(V\), there should be an arrow pointing to \(V_j\) from \(V_i\). A detailed introduction to such DAGs for causal structures can be found in [Pearl].

A causal effect, also named as causal estimand, can be expressed with the \(do\)-operator according to [Pearl]. As an example,

\[P(y|do(x))\]

denotes the probability function of \(y\) after imposing the intervention \(x\). Causal structures are crucial to expressing and estimating interested causal estimands. YLearn implements an object, CausalGraph, to support representations for causal structures and related operations of the causal structures. Please see Causal Graph for details.

YLearn concerns the intersection of causal inference and machine learning. Therefore we assume that we have abundant observational data rather than having the access to design randomized experiments. Then Given a DAG for some causal structure, the causal estimands, e.g., the average treatment effects (ATEs), usually can not be directly estimated from the data due to the counterfactuals which can never be observed. Thus it is necessary to convert these causal estimands into other quantities, which can be called as statistical estimands and can be estimated from data, before proceeding to any estimation. The procedure of converting a causal estimand into the corresponding statistical estimand is called identification.

The object for supporting identification and other related operations of causal structures is CausalModel. More details can be found in Causal Model.

In the language of Pearl’s causal inference, it is also necessary to represent the results in the language of probability. For this purpose, YLearn also implements an object Prob which is introduced in Representation of Probability.

Estimator Model: Estimating the Causal Effects

For a causal effect with \(do\)-operator, after converting it into the corresponding statistical estimand with the approach called Identification, the task of causal inference now becomes estimating the statistical estimand, the converted causal effect. Before diving into any specific estimation methods for causal effects, we briefly introduce the problem settings of the estimation of causal effects.

Problem Setting

It is introduced in Causal Model that every causal structure has a corresponding DAG called causal graph. Furthermore, each child-parent family in a DAG \(G\) represents a deterministic function

\[X_i = F_i (pa_i, \eta_i), i = 1, \dots, n,\]

where \(pa_i\) are parents of \(x_i\) in \(G\) and \(\eta_i\) are random disturbances representing exogeneous not present in the analysis. We call these functions Structural Equation Model related to the causal structures. For a set of variables \(W\) that satisfies the back-door criterion (see Identification), the causal effect of \(X\) on \(Y\) is given by the formula

\[P(y|do(x)) = \sum_w P(y| x, w)P(w).\]

In such case, variables \(X\) for which the above equality is valid are also named “conditionally ignorable given \(W\)” in the potential outcome framework. The set of variables \(W\) satisfying this condition is called adjustment set. And in the language of structural equation model, these relations are encoded by

\[\begin{split}X & = F_1 (W, \epsilon),\\ Y & = F_2 (W, X, \eta).\end{split}\]

Our problems can be expressed with the structural equation model.

Estimator Models

YLearn implements several estimator models for the estimation of causal effects:

Approxmation Bound for Causal Effects

Many estimator models require the unconfoundedness condition which is usually untestable. One applicable approach is to build the upper and lower bounds of our causal effects before diving into specifical estimations.

There are four different bounds in YLearn. We briefly introduce them as follows. One can see [Neal2020] for details.

Class Structures
class ylearn.estimator_model.approximation_bound.ApproxBound(y_model, x_prob=None, x_model=None, random_state=2022, is_discrete_treatment=True, categories='auto')

A model used for estimating the upper and lower bounds of the causal effects.

Parameters:
  • y_model (estimator, optional) – Any valid y_model should implement the fit() and predict() methods

  • x_prob (ndarray of shape (c, ), optional, default=None) – An array of probabilities assigning to the corresponding values of x where c is the number of different treatment classes. All elements in the array are positive and sumed to 1. For example, x_prob = array([0.5, 0.5]) means both x = 0 and x = 1 take probability 0.5. Please set this as None if you are using multiple treatments.

  • x_model (estimator, optional, default=None) – Models for predicting the probabilities of treatment. Any valid x_model should implement the fit() and predict_proba() methods.

  • random_state (int, optional, default=2022) –

  • is_discrete_treatment (bool, optional, default=True) – True if the treatment is discrete.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, covariate=None, is_discrete_covariate=False, **kwargs)

Fit x_model and y_model.

Parameters:
  • data (pandas.DataFrame) – Training data.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • is_discrete_covariate (bool, optional, default=False) –

Returns:

The fitted instance of ApproxBound.

Return type:

instance of ApproxBound

Raises:

ValueError – Raise error when the treatment is not discrete.

estimate(data=None, treat=None, control=None, y_upper=None, y_lower=None, assump=None)

Estimate the approximation bound of the causal effect of the treatment on the outcome.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • treat (ndarray of str, optional, default=None) – Values of the treatment group. For example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’.

  • control (ndarray of str, optional, default=None) – Values of the control group.

  • y_upper (float, defaults=None) – The upper bound of the outcome.

  • y_lower (float, defaults=None) – The lower bound of the outcome.

  • assump (str, optional, default='no-assump') –

    Options for the returned bounds. Should be one of

    1. no-assump: calculate the no assumption bound whose result will always contain 0.

    2. non-negative: The treatment is always positive.

    3. non-positive: The treatment is always negative.

    4. optimal: The treatment is taken if its effect is positive.

Returns:

The first element is the lower bound while the second element is the upper bound. Note that if covariate is provided, all elements are ndarrays of shapes (n, ) indicating the lower and upper bounds of corresponding examples where n is the number of examples.

Return type:

tuple

Raises:

Exception – Raise Exception if the model is not fitted or if the assump is not given correctly.

comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

Meta-Learner

Meta-Learners [Kunzel2019] are estimator models that aim to estimate the CATE by taking advantage of machine learning models when the treatment is discrete, e.g., the treatment has only two values 1 and 0, and when the unconfoundedness condition is satisified. Generally speaking, it employs multiple machine learning models with the flexibility on the choice of models.

YLearn implements 3 Meta-Learners: S-Learner, T-Learner, and X-Learner. We provide below several useful examples before introducing their class structures.

S-Learner

SLearner uses one machine learning model to estimate the causal effects. Specifically, we fit a model to predict outcome \(y\) from treatment \(x\) and adjustment set (or covariate) \(w\) with a machine learning model \(f\):

\[y = f(x, w).\]

The causal effect \(\tau(w)\) is then calculated as

\[\tau(w) = f(x=1, w) - f(x=0, w).\]
class ylearn.estimator_model.meta_learner.SLearner(model, random_state=2022, is_discrete_treatment=True, categories='auto', *args, **kwargs)
Parameters:
  • model (estimator, optional) – The base machine learning model for training SLearner. Any model should be some valid machine learning model with fit() and predict() functions.

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=True) – Treatment must be discrete for SLearner.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None, combined_treatment=True, **kwargs)

Fit the SLearner in the dataset.

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • treat (int, optional) – Label of the intended treatment group

  • control (int, optional) – Label of the intended control group

  • combined_treatment (bool, optional, default=True) –

    Only modify this parameter for multiple treatments, where multiple discrete treatments are combined to give a single new group of discrete treatment if set as True. When combined_treatment is set to True, then if there are multiple treatments, we can use the combined_treatment technique to covert multiple discrete classification tasks into a single discrete classification task. For an example, if there are two different binary treatments:

    1. treatment_1: \(x_1 | x_1 \in \{'sleep', 'run'\}\),

    2. treatment_2: \(x_2 | x_2 \in \{'study', 'work'\}\),

    then we can convert these two binary classification tasks into a single classification task with 4 different classes:

    treatment: \(x | x \in \{0, 1, 2, 3\}\),

    where, for example, 1 stands for (‘sleep’ and ‘stuy’).

Returns:

The fitted instance of SLearner.

Return type:

instance of SLearner

estimate(data=None, quantity=None)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

Returns:

The estimated causal effects

Return type:

ndarray

effect_nji(data=None)

Calculate causal effects with different treatment values.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

_comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

T-Learner

The problem of SLearner is that the treatment vector is only 1-dimensional while the adjustment vector could be multi-dimensional. Thus if the dimension of the adjustment is much larger than 1, then the estimated results will always be close to 0. TLearner uses two machine learning models to estimate the causal effect. Specifically, let \(w\) denote the adjustment set (or covariate), we

  1. Fit two models \(f_t(w)\) for the treatment group (\(x=\) treat) and \(f_0(w)\) for the control group (\(x=\) control), respectively:

    \[y_t = f_t(w)\]

with data where \(x=\) treat and

\[y_0 = f_0(w)\]

with data where \(x=\) control.

  1. Compute the causal effect \(\tau(w)\) as the difference between predicted results of these two models:

    \[\tau(w) = f_t(w) - f_0(w).\]
class ylearn.estimator_model.meta_learner.TLearner(model, random_state=2022, is_discrete_treatment=True, categories='auto', *args, **kwargs)
Parameters:
  • model (estimator, optional) – The base machine learning model for training SLearner. Any model should be some valid machine learning model with fit() and predict() functions.

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=True) – Treatment must be discrete for SLearner.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None, combined_treatment=True, **kwargs)

Fit the TLearner in the dataset.

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • treat (int, optional) – Label of the intended treatment group

  • control (int, optional) – Label of the intended control group

  • combined_treatment (bool, optional, default=True) –

    Only modify this parameter for multiple treatments, where multiple discrete treatments are combined to give a single new group of discrete treatment if set as True. When combined_treatment is set to True, then if there are multiple treatments, we can use the combined_treatment technique to covert the multiple discrete classification tasks into a single discrete classification task. For an example, if there are two different binary treatments:

    1. treatment_1: \(x_1 | x_1 \in \{'sleep', 'run'\}\),

    2. treatment_2: \(x_2 | x_2 \in \{'study', 'work'\}\),

    then we can convert these two binary classification tasks into a single classification task with 4 different classes:

    treatment: \(x | x \in \{0, 1, 2, 3\}\),

    where, for example, 1 stands for (‘sleep’ and ‘stuy’).

Returns:

The fitted instance of TLearner.

Return type:

instance of TLearner

estimate(data=None, quantity=None)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

Returns:

The estimated causal effects

Return type:

ndarray

effect_nji(data=None)

Calculate causal effects with different treatment values.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

_comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

X-Learner

TLearner does not use all data efficiently. This issue can can be addressed by the XLearner which utilities all data to train several models. Training a XLearner is composed of 3 steps:

  1. As in the case of TLearner, we first train two different models for the control group and treated group, respectively:

    \[\begin{split}& f_0(w) \text{for the control group}\\ & f_t(w) \text{for the treat group}.\end{split}\]
  2. Generate two new datasets \(\{(h_0, w)\}\) using the control group and \(\{(h_t, w)\}\) using the treated group where

    \[\begin{split}h_0 & = f_t(w) - y_0,\\ h_t & = y_t - f_0(w).\end{split}\]

    Then train two new machine learing models \(k_0(w)\) and \(k_t(w)\) in these datasets such that

    \[\begin{split}h_0 & = k_0(w) \\ h_t & = k_t(w).\end{split}\]
  3. Get the final model by combining the above two models:

    \[g(w) = k_0(w)a(w) + k_t(w)(1 - a(w))\]

    where \(a(w)\) is a coefficient adjusting the weight of \(k_0\) and \(k_t\).

Finally, the casual effect \(\tau(w)\) can be estimated as follows:

\[\tau(w) = g(w).\]
class ylearn.estimator_model.meta_learner.XLearner(model, random_state=2022, is_discrete_treatment=True, categories='auto', *args, **kwargs)
Parameters:
  • model (estimator, optional) – The base machine learning model for training XLearner. Any model should be some valid machine learning model with fit() and predict() functions.

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=True) – Treatment must be discrete for SLearner.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None, combined_treatment=True, **kwargs)

Fit the XLearner in the dataset.

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • treat (int, optional) – Label of the intended treatment group

  • control (int, optional) – Label of the intended control group

  • combined_treatment (bool, optional, default=True) –

    Only modify this parameter for multiple treatments, where multiple discrete treatments are combined to give a single new group of discrete treatment if set as True. When combined_treatment is set to True, then if there are multiple treatments, we can use the combined_treatment technique to covert the multiple discrete classification tasks into a single discrete classification task. For an example, if there are two different binary treatments:

    1. treatment_1: \(x_1 | x_1 \in \{'sleep', 'run'\}\),

    2. treatment_2: \(x_2 | x_2 \in \{'study', 'work'\}\),

    then we can convert these two binary classification tasks into a single classification task with 4 different classes:

    treatment: \(x | x \in \{0, 1, 2, 3\}\),

    where, for example, 1 stands for (‘sleep’ and ‘stuy’).

Returns:

The fitted instance of XLearner.

Return type:

instance of XLearner

estimate(data=None, quantity=None)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

Returns:

The estimated causal effects

Return type:

ndarray

effect_nji(data=None)

Calculate causal effects with different treatment values.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

_comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

Double Machine Learning

The double machine learning (DML) model [Chern2016] can be applied when all confounders of the treatment and outcome, variables that simultaneously influence the treatment and outcome, are observed. Let \(y\) be the outcome and \(x\) be the treatment, a DML model solves the following causal effect estimation (CATE estimation):

\[\begin{split}y & = F(v) x + g(v, w) + \epsilon \\ x & = h(v, w) + \eta\end{split}\]

where \(F(v)\) is the CATE conditional on the condition \(v\). Furthermore, to estimate \(F(v)\), we note that

\[y - \mathbb{E}[y|w, v] = F(v) (x - \mathbb{E}[x|w, v]) + \epsilon.\]

Thus by first estimating \(\mathbb{E}[y|w, v]\) and \(\mathbb{E}[x|w,v]\) as

\[\begin{split}m(v, w) & = \mathbb{E}[y|w, v]\\ h(v, w) & = \mathbb{E}[x|w,v],\end{split}\]

we can get a new dataset \((\tilde{y}, \tilde{x})\) where

\[\begin{split}\tilde{y} & = y - m(v, w) \\ \tilde{x} & = x - h(v, w)\end{split}\]

such that the relation between \(\tilde{y}\) and \(\tilde{x}\) is linear

\[\tilde{y} = F(v) \tilde(x) + \epsilon\]

which can be simply modeled by the linear regression model.

On the other hand, in the current version, \(F(v)\) takes the form

\[F_{ij}(v) = \sum_k H_{ijk} \rho_k(v).\]

where \(H\) can be seen as a 3-rank tensor and \(\rho_k\) is a function of the covariate \(v\), e.g., \(\rho(v) = v\) in the simplest case. Therefore, the outcome \(y\) can now be represented as

\[\begin{split}y_i & = \sum_j F_{ij}x_j + g(v, w)_j + \epsilon \\ & = \sum_j \sum_k H_{ijk}\rho_k(v)x_j + g(v, w)_j + \epsilon\end{split}\]

In this sense, the linear regression problem between \(\tilde{y}\) and \(\tilde{x}\) now becomes

\[\tilde{y}_i = \sum_j \sum_k H_{ijk}\rho_k(v) \tilde{x}_j + \epsilon.\]
Class Structures
class ylearn.estimator_model.double_ml.DoubleML(x_model, y_model, yx_model=None, cf_fold=1, adjustment_transformer=None, covariate_transformer=None, random_state=2022, is_discrete_treatment=False, categories='auto')
Parameters:
  • x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the fit() and predict`() (also predict_proba() if x is discrete) methods.

  • y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the fit() and predict() methods.

  • yx_model (estimator, optional) – Machine learning models for fitting the residual of y on residual of x. Only support linear regression model in the current version.

  • cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.

  • adjustment_transformer (transormer, optional, default=None,) – Transformer for adjustment variables which can be used to generate new features of adjustment variables.

  • covariate_transformer (transormer, optional, default=None,) – Transformer for covariate variables which can be used to generate new features of covariate variables.

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=False) – If the treatment variables are discrete, set this to True.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, **kwargs)

Fit the DoubleML estimator model. Note that the training of a DML has two stages, where we implement them in _fit_1st_stage() and _fit_2nd_stage().

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

Returns:

The fitted model

Return type:

an instance of DoubleML

estimate(data=None, treat=None, control=None, quantity=None)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator directly evaluate all quantities in the training data if data is None.

  • treat (float or numpy.ndarray, optional, default=None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list or an ndarray where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

  • control (float or numpy.ndarray, optional, default=None) – This is similar to the cases of treat.

Returns:

The estimated causal effects

Return type:

ndarray

effect_nji(data=None)

Calculate causal effects with different treatment values.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator will use the training data if data is None.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

Doubly Robust

The doubly robust method (see [Funk2010]) estimates the causal effects when the treatment is discrete and the unconfoundness condition is satisified. Training a doubly robust model is composed of 3 steps.

  1. Let \(k\) be an int. Form a \(K\)-fold random partition for the data \(\{(X_i, W_i, V_i, Y_i)\}_{i = 1}^n\) such that

    \[\{(x_i, w_i, v_i, y_i)\}_{i = 1}^n = D_k \cup T_k\]

    where \(D_k\) stands for the training data while \(T_k\) stands for the test data and \(\cup_{k = 1}^K T_k = \{(X_i, W_i, V_i, Y_i)\}_{i = 1}^n\).

  2. For each \(k\), train two models \(f(X, W, V)\) and \(g(W, V)\) on \(D_k\) to predict \(y\) and \(x\), respectively. Then evaluate their performances in \(T_k\) whoes results will be saved as \(\{(\hat{X}, \hat{Y})\}_k\). All \(\{(\hat{X}, \hat{Y})\}_k\) will be combined to give the new dataset \(\{(\hat{X}_i, \hat{Y}_i(X, W, V))\}_{i = 1}^n\).

  3. For any given pair of treat group where \(X=x\) and control group where \(X = x_0\), we build the final dataset \(\{(V, \tilde{Y}_x - \tilde{Y}_0)\}\) where \(\tilde{Y}_x\) is defined as

    \[\begin{split}\tilde{Y}_x & = \hat{Y}(X=x, W, V) + \frac{(Y - \hat{Y}(X=x, W, V)) * \mathbb{I}(X=x)}{P[X=x| W, V]} \\ \tilde{Y}_0 & = \hat{Y}(X=x_0, W, V) + \frac{(Y - \hat{Y}(X=x_0, W, V)) * \mathbb{I}(X=x_0)}{P[X=x_0| W, V]}\end{split}\]

    and train the final machine learning model \(h(W, V)\) on this dataset to predict the causal effect \(\tau(V)\)

    \[\tau(V) = \tilde{Y}_x - \tilde{Y}_0 = h(V).\]

    Then we can directly estimate the causal effects by passing the covariate \(V\) to the model \(h(V)\).

Class Structures
class ylearn.estimator_model.doubly_robust.DoublyRobust(x_model, y_model, yx_model, cf_fold=1, random_state=2022, categories='auto')
Parameters:
  • x_model (estimator, optional) – The machine learning model which is trained to modeling the treatment. Any valid x_model should implement the fit() and predict_proba() methods.

  • y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome with covariates (possibly adjustment) and the treatment. Any valid y_model should implement the fit() and predict() methods.

  • yx_model (estimator, optional) – The machine learning model which is trained in the final stage of doubly robust method to modeling the causal effects with covariates (possibly adjustment). Any valid yx_model should implement the fit() and predict() methods.

  • cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.

  • random_state (int, default=2022) –

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None, combined_treatment=True, **kwargs)

Fit the DoublyRobust estimator model. Note that the trainig of a doubly robust model has three stages, where we implement them in _fit_1st_stage() and _fit_2nd_stage().

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • treat (int, optional) – Label of the intended treatment group. If None, then treat will be set as 1. In the case of single discrete treatment, treat should be an int or str in one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list or an ndarray where treat[i] indicates the value of the i-th intended treatment. For example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’.

  • control (int, optional) – Label of the intended control group. This is similar to the cases of treat. If None, then control will be set as 0.

Returns:

The fitted instance of DoublyRobust.

Return type:

instance of DoublyRobust

estimate(data=None, quantity=None, treat=None, all_tr_effects=False)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

  • treat (float or numpy.ndarray, optional, default=None) – In the case of single discrete treatment, treat should be an int or str in one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list or an ndarray where treat[i] indicates the value of the i-th intended treatment. For example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’.

  • all_tr_effects (bool, default=False,) – If True, return all causal effects with all values of treatments, otherwise only return the causal effect of the treatment with the value of treat if it is provided. If treat is not provided, then the value of treatment is taken as the value of that when fitting the estimator model.

Returns:

The estimated causal effects

Return type:

ndarray

effect_nji(data=None)

Calculate causal effects with different treatment values. Note that this method only will convert any problem with discrete treatment into that with binary treatment. One can use _effect_nji_all() to get casual effects with all values of treat taken by treatment.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

Causal Tree

Causal Tree is a data-driven approach to partition the data into subpopulations which differ in the magnitude of their causal effects [Athey2015]. This method is applicable when the unconfoundness is satisfied given the adjustment set (covariate) \(V\). The interested causal effects is the CATE:

\[\tau(v) := \mathbb{}[Y_i(do(X=x_t)) - Y_i(do(X=x_0)) | V_i = v]\]

Due to the fact that the counterfactuals can never be observed, [Athey2015] developed an honest approach where the loss function (criterion for building a tree) is designed as

\[e (S_{tr}, \Pi) := \frac{1}{N_{tr}} \sum_{i \in S_{tr}} \hat{\tau}^2 (V_i; S_{tr}, \Pi) - \frac{2}{N_{tr}} \cdot \sum_{\ell \in \Pi} \left( \frac{\Sigma^2_{S_{tr}^{treat}}(\ell)}{p} + \frac{\Sigma^2_{S_{tr}^{control}}(\ell)}{1 - p}\right)\]

where \(N_{tr}\) is the number of samples in the training set \(S_{tr}\), \(p\) is the ratio of the number of samples in the treat group to that of the control group in the training set, and

\[\begin{split}\hat{\tau}(v) = \frac{1}{\#(\{i\in S_{treat}: V_i \in \ell(v; \Pi)\})} \sum_{ \{i\in S_{treat}: V_i \in \ell(v; \Pi)\}} Y_i \\ - \frac{1}{\#(\{i\in S_{control}: V_i \in \ell(v; \Pi)\})} \sum_{ \{i\in S_{control}: V_i \in \ell(v; \Pi)\}} Y_i.\end{split}\]
Class Structures
class ylearn.estimator_model.causal_tree.CausalTree(*, splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=2022, max_leaf_nodes=None, max_features=None, min_impurity_decrease=0.0, min_weight_fraction_leaf=0.0, ccp_alpha=0.0, categories='auto')
Parameters:
  • splitter ({"best", "random"}, default="best") – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

  • max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int or float, default=2) –

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.

    • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default to None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following

    N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None)

Fit the model on data to estimate the causal effect.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • outcome (list of str, optional) – Names of the outcomes.

  • treatment (list of str, optional) – Names of the treatments.

  • covariate (list of str, optional, default=None) – Names of the covariate vectors.

  • adjustment (list of str, optional, default=None) – Names of the covariate vectors. Note that we may only need the covariate set, which usually is a subset of the adjustment set.

  • treat (int or list, optional, default=None) –

    If there is only one discrete treatment, then treat indicates the treatment group. If there are multiple treatment groups, then treat should be a list of str with length equal to the number of treatments. For example, when there are multiple discrete treatments,

    array([‘run’, ‘read’])

    means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’.

  • control (int or list, optional, default=None) – See treat.

Returns:

Fitted CausalTree

Return type:

instance of CausalTree

estimate(data=None, quantity=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

Returns:

The estimated causal effect with the type of the quantity.

Return type:

ndarray or float, optional

plot_causal_tree(feature_names=None, max_depth=None, class_names=None, label='all', filled=False, node_ids=False, proportion=False, rounded=False, precision=3, ax=None, fontsize=None)

Plot a policy tree. The sample counts that are shown are weighted with any sample_weights that might be present. The visualization is fit automatically to the size of the axis. Use the figsize or dpi arguments of plt.figure to control the size of the rendering.

Returns:

List containing the artists for the annotation boxes making up the tree.

Return type:

annotations : list of artists

decision_path(*, data=None, wv=None)

Return the decision path.

Parameters:
  • wv (numpy.ndarray, default=None) – The input samples as an ndarray. If None, then the DataFrame data will be used as the input samples.

  • data (pandas.DataFrame, default=None) – The input samples. The data must contains columns of the covariates used for training the model. If None, the training data will be passed as input samples.

Returns:

Return a node indicator CSR matrix where non zero elements indicates that the samples goes through the nodes.

Return type:

indicator : sparse matrix of shape (n_samples, n_nodes)

apply(*, data=None, wv=None)

Return the index of the leaf that each sample is predicted as.

Parameters:
  • wv (numpy.ndarray, default=None) – The input samples as an ndarray. If None, then the DataFrame data will be used as the input samples.

  • data (pandas.DataFrame, default=None) – The input samples. The data must contains columns of the covariates used for training the model. If None, the training data will be passed as input samples.

Returns:

For each datapoint v_i in v, return the index of the leaf v_i ends up in. Leaves are numbered within [0; self.tree_.node_count), possibly with gaps in the numbering.

Return type:

v_leaves : array-like of shape (n_samples, )

property feature_importance
Returns:

Normalized total reduction of criteria by feature (Gini importance).

Return type:

ndarray of shape (n_features,)

Forest Estimator Models

Random forest is a widely used algorithm in machine learning. Many empirical properties of random forest including stability and the ability of flexible adaptation to complicated forms have made random forest and its variants as popular and reliable choices in a lot of tasks. It is then a natural and crucial idea to extend tree based models for causal effect estimation such as causal tree to forest based ones. These works are pioneered by [Athey2018]. Similar to the case of machine learning, typically for causal effect estimation, forest estimator models have better performance than tree models while sharing equivalent interpretability and other advantages. Thus it is always recommended to try these estimator models first.

In YLearn, we currently cover three types of forest estimator models for causal effect estimation under the unconfoundness asssumption:

Generalized Random Forest

To adpat random forest to causal effect estimation, [Athey2018] proposed a generalized version of it, named as Generalized Random Forest (GRF), by altering the criterion when building a single tree and designing a new kind of ensemble method to combine these trained trees. GRF can be used in, for example, quantile regression while in YLearn, we focus on its ability of performing highly flexible non-parametric causal effect estimation.

We now consider such estimation with GRF. Suppose that we observe samples \((X_i, Y_i, V_i) \in \mathbb{R}^{d_x} \times \mathbb{R} \times \mathbb{R}^{d_v}\) where \(Y\) is the outcome, \(X\) is the treatment and \(V\) is the covariate which ensures the unconfoundness condition. The forest weights \(\alpha_i(v)\) is defined by

\[\begin{split}\alpha_i^b(v) = \frac{\mathbb{I}\left( \left\{ V_i \in L^b(v) \right\} \right)}{|L^b(v)|},\\ \alpha_i(v) = \frac{1}{B} \sum_{b = 1}^B \alpha_i^b(v),\end{split}\]

where the subscript \(b\) refers to the \(b\)-th tree with a total number of \(B\) such trees, \(L^b(v)\) is the leaf that the sample which covariate \(v\) belongs to, and \(|L^b(v)|\) denotes the total number of training samples which fall into the samel leaf as the sample \(v\) for the \(b\)-th tree. Then the estimated causal effect can be expressed by

\[\left( \sum_{i=1}^n \alpha_i(x)(X_i - \bar{X}_\alpha)(X_i - \bar{X}_\alpha)^T\right)^{-1} \sum_{i = 1}^n \alpha_i(v) (X_i - \bar{X}_\alpha)(Y_i - \bar{Y}_\alpha)\]

where \(\bar{X}_\alpha = \sum \alpha_i X_i\) and \(\bar{Y}_\alpha = \sum \alpha_i Y_i\).

We now provide an example useage of applying the GRForest.

Besides this GRForest, YLearn also implements a naive version of GRF with pure python in an easy to understand manner to help users get some insights on how GRF works in code level. It is worth to mention that, however, this naive version of GRF is super slow (~5mins for fitting 100 trees in a dataset with 2000 samples and 10 features). One can find this naive GRF in the folder ylearn/estimator_model/_naive_forest/.

The formal version of GRF is summarized as follows.

Class Structures
class ylearn.estimator_model.GRForest(n_estimators=100, *, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None)
Parameters:
  • n_estimators (int, default=100) – The number of trees for growing the GRF.

  • sub_sample_num (int or float, default=None) –

    The number of samples to train each individual tree.

    • If a float is given, then the number of sub_sample_num*n_samples samples will be sampled to train a single tree

    • If an int is given, then the number of sub_sample_num samples will be sampled to train a single tree

  • max_depth (int, default=None) – The max depth that a single tree can reach. If None is given, then there is no limit of the depth of a single tree.

  • min_samples_split (int, default=2) –

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.

    • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

  • n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), estimate(), and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

  • verbose (int, default=0) – Controls the verbosity when fitting and predicting

  • honest_subsample_num (int or float, default=None) –

    The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.

    • Use all sub_sample_num if None is given.

    • If a float is given, then the number of honest_subsample_num*sub_sample_num samples will be used to train a single tree while the rest (1 - honest_subsample_num)*sub_sample_num samples will be used to label the trained tree.

    • If an int is given, then the number of honest_subsample_num samples will be sampled to train a single tree while the rest sub_sample_num - honest_subsample_num samples will be used to label the trained tree.

fit(data, outcome, treatment, adjustment=None, covariate=None)

Fit the model on data to estimate the causal effect.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • outcome (list of str, optional) – Names of the outcomes.

  • treatment (list of str, optional) – Names of the treatments.

  • covariate (list of str, optional, default=None) – Names of the covariate vectors.

  • adjustment (list of str, optional, default=None) – This will be the same as the covariate.

  • sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.

Returns:

Fitted GRForest

Return type:

instance of GRForest

estimate(data=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:

data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.

Returns:

The estimated causal effect.

Return type:

ndarray or float, optional

apply(*, v)

Apply trees in the forest to X, return leaf indices.

Parameters:

v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to dtype=np.float32.

Returns:

For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.

Return type:

v_leaves : array-like of shape (n_samples, )

property feature_importance
Returns:

Normalized total reduction of criteria by feature (Gini importance).

Return type:

ndarray of shape (n_features,)

Causal Forest

In [Athey2018], the authors argued that by imposing the local centering technique, i.e., by first regressing out the effect and treatment respectively aka the so called double machine learning framework, the performance of Generalized Random Forest (GRF) can be further improved. In YLearn, we implement the class CausalForest to support such technique. We illustrate its useage in the following example.

Class Structures
class ylearn.estimator_model.CausalForest(x_model, y_model, n_estimators=100, *, cf_fold=1, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None, adjustment_transformer=None, covariate_transformer=None, proba_output=False)
Parameters:
  • x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the fit() and predict`() (also predict_proba() if x is discrete) methods.

  • cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.

  • y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the fit() and predict() methods.

  • n_estimators (int, default=100) – The number of trees for growing the GRF.

  • sub_sample_num (int or float, default=None) –

    The number of samples to train each individual tree.

    • If a float is given, then the number of sub_sample_num*n_samples samples will be sampled to train a single tree

    • If an int is given, then the number of sub_sample_num samples will be sampled to train a single tree

  • max_depth (int, default=None) – The max depth that a single tree can reach. If None is given, then there is no limit of the depth of a single tree.

  • min_samples_split (int, default=2) –

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.

    • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default=None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

  • n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), estimate(), and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • verbose (int, default=0) – Controls the verbosity when fitting and predicting

  • honest_subsample_num (int or float, default=None) –

    The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.

    • Use all sub_sample_num if None is given.

    • If a float is given, then the number of honest_subsample_num*sub_sample_num samples will be used to train a single tree while the rest (1 - honest_subsample_num)*sub_sample_num samples will be used to label the trained tree.

    • If an int is given, then the number of honest_subsample_num samples will be sampled to train a single tree while the rest sub_sample_num - honest_subsample_num samples will be used to label the trained tree.

  • adjustment_transformer (transformer, default=None) – Transfomer of adjustment variables. This can be used to generate new features.

  • covariate_transformer (transformer, default=None) – Transfomer of covariate variables. This can be used to generate new features.

  • proba_output (bool, default=False) – Whether to estimate probability of the outcome if it is a discrete one. If True, then the given y_model must have the method predict_proba().

fit(data, outcome, treatment, adjustment=None, covariate=None, control=None)

Fit the model on data to estimate the causal effect. Note that when a discrete treatment is given, then the first column will be automatically assumed as the control while other columns different treat assignments if control is not specified explicitly.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • outcome (list of str, optional) – Names of the outcomes.

  • treatment (list of str, optional) – Names of the treatments.

  • covariate (list of str, optional, default=None) – Names of the covariate vectors.

  • adjustment (list of str, optional, default=None) – This will be the same as the covariate.

  • sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.

  • control (str, optional, default=None) – The value of the parameter treatment whcih will be the control group to estimate the causal effect. If None is given, then the first column of the treatment will be the control.

Returns:

Fitted GRForest

Return type:

instance of GRForest

estimate(data=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:

data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.

Returns:

The estimated causal effect.

Return type:

ndarray or float, optional

apply(*, v)

Apply trees in the forest to v, return leaf indices.

Parameters:

v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to dtype=np.float32.

Returns:

For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.

Return type:

v_leaves : array-like of shape (n_samples, )

property feature_importance
Returns:

Normalized total reduction of criteria by feature (Gini importance).

Return type:

ndarray of shape (n_features,)

Ensemble of Causal Trees

An efficient and useful technique for growing a random forest is by simply averaging the result of each individual tree. Consequently, we can also apply this technique to grow a causal forest by combining many single causal tree. In YLearn, we implement this idea in the class CTCausalForest (refering to Causal Tree Causal Forest).

Since it is an ensemble of a bunch of CausalTree, currently it only supports binary treatment. One may need specify the treat and control groups before applying the CTCausalForest. This will be improved in the future version.

We provide below an example of it.

Class Structures
class ylearn.estimator_model.CTCausalForest(n_estimators=100, *, sub_sample_num=None, max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features=1.0, min_impurity_decrease=0.0, n_jobs=None, random_state=None, ccp_alpha=0.0, is_discrete_treatment=True, is_discrete_outcome=False, verbose=0, warm_start=False, honest_subsample_num=None)
Parameters:
  • n_estimators (int, default=100) – The number of trees for growing the CTCausalForest.

  • sub_sample_num (int or float, default=None) –

    The number of samples to train each individual tree.

    • If a float is given, then the number of sub_sample_num*n_samples samples will be sampled to train a single tree

    • If an int is given, then the number of sub_sample_num samples will be sampled to train a single tree

  • max_depth (int, default=None) – The max depth that a single tree can reach. If None is given, then there is no limit of the depth of a single tree.

  • min_samples_split (int, default=2) –

    The minimum number of samples required to split an internal node:

    • If int, then consider min_samples_split as the minimum number.

    • If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • min_impurity_decrease (float, default=0.0) – A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

  • n_jobs (int, default=None) – The number of jobs to run in parallel. fit(), estimate(), and apply() are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • verbose (int, default=0) – Controls the verbosity when fitting and predicting

  • honest_subsample_num (int or float, default=None) –

    The number of samples to train each individual tree in an honest manner. Typically setting this value will have better performance.

    • Use all sub_sample_num if None is given.

    • If a float is given, then the number of honest_subsample_num*sub_sample_num samples will be used to train a single tree while the rest (1 - honest_subsample_num)*sub_sample_num samples will be used to label the trained tree.

    • If an int is given, then the number of honest_subsample_num samples will be sampled to train a single tree while the rest sub_sample_num - honest_subsample_num samples will be used to label the trained tree.

fit(data, outcome, treatment, adjustment=None, covariate=None, treat=None, control=None)

Fit the model on data to estimate the causal effect. Note that similar to CausalTree, currently CTCausalForest assumes a binary treatment where the values of treat and control are controled by the corresponding parameters.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • outcome (list of str, optional) – Names of the outcomes.

  • treatment (list of str, optional) – Names of the treatments.

  • covariate (list of str, optional, default=None) – Names of the covariate vectors.

  • adjustment (list of str, optional, default=None) – This will be the same as the covariate.

  • sample_weight (ndarray, optional, default=None) – Weight of each sample of the training set.

  • treat (int or list, optional, default=None) –

    If there is only one discrete treatment, then treat indicates the treatment group. If there are multiple treatment groups, then treat should be a list of str with length equal to the number of treatments. For example, when there are multiple discrete treatments,

    array([‘run’, ‘read’])

    means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’.

  • control (int or list, optional, default=None) – See treat.

Returns:

Fitted CTCausalForest

Return type:

instance of CTCausalForest

estimate(data=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:

data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.

Returns:

The estimated causal effect.

Return type:

ndarray or float, optional

apply(*, v)

Apply trees in the forest to X, return leaf indices.

Parameters:

v (numpy.ndarray,) – The input samples. Internally, its dtype will be converted to dtype=np.float32.

Returns:

For each datapoint v_i in v and for each tree in the forest, return the index of the leaf v ends up in.

Return type:

v_leaves : array-like of shape (n_samples, )

property feature_importance
Returns:

Normalized total reduction of criteria by feature (Gini importance).

Return type:

ndarray of shape (n_features,)

Instrumental Variables

Instrumental Variables (IV) deal with the case for estimating causal effects in the presence of unobserved confounding variables that simultaneously have effects on the treatment \(X\) and the outcome \(Y\). A set of variables \(Z\) is said to be a set of instrumental variables if for any \(z\) in \(Z\):

  1. \(z\) has a causal effect on \(X\).

  2. The causal effect of \(z\) on \(Y\) is fully mediated by \(X\).

  3. There are no back-door paths from \(z\) to \(Y\).

In such case, we must first find the IV (which can be done by using the CausalModel, see Identification). For an instance, the variable \(Z\) in the following figure can serve as a valid IV for estimating the causal effects of \(X\) on \(Y\) in the presence of the unobserved confounder \(U\).

_images/iv3.png

Causal graph with IV

YLearn implements two different methods related to IV: deepiv [Hartford], which utilizes the deep learning models to IV, and IV of nonparametric models [Newey2002].

The IV Framework and Problem Setting

The IV framework aims to predict the value of the outcome \(y\) when the treatment \(x\) is given. Besides, there also exist some covariates vectors \(v\) that simultaneously affect both \(y\) and \(x\). There also are some unobserved confounders \(e\) that potentially also affect \(y\), \(x\) and \(v\). The core part of causal questions lies in estimating the causal quantity

\[\mathbb{E}[y| do(x)]\]

in the following causal graph, where the set of causal relationships are determined by the set of functions

\[\begin{split}y & = f(x, v) + e\\ x & = h(v, z) + \eta\\ \mathbb{E}[e] & = 0.\end{split}\]
_images/iv4.png

Causal graph with IV and both observed and unobserved confounders

The IV framework solves this problem by doing a two-stage estimation:

  1. Estimate \(\hat{H}(z, v)\) that captures the relationship between \(x\) and the variables \((z, v)\).

  2. Replace \(x\) with the predicted result of \(\hat{H}(z, v)\) given \((v, z)\). Then estimate \(\hat{G}(x, v)\) to build the relationship between \(y\) and \((x, v)\).

The final casual effects can then be calculated.

IV Classes
Nonparametric Instrumental Variables
Two-stage Least Squares

When the relationship between the outcome \(y\), treatment \(x\) and covariate \(v\) are assumed to be linear, e.g., [Angrist1996],

\[\begin{split}y & = \alpha x + \beta v + e \\ x & = \gamma z + \lambda v + \eta,\end{split}\]

then the IV framework becomes direct: it will first train a linear model for \(x\) given \(z\) and \(v\), then it replaces \(x\) with the predicted values \(\hat{x}\) to train a linear model for \(y\) in the second stage. This procedure is called the two-stage least-squares (2SLS).

Nonparametric IV

Removing the linear assumptions regarding the relationships between variables, the nonparametric IV can replace the linear regression with a linear projection onto a series of known basis functions [Newey2002].

This method is similar to the conventional 2SLS and is also composed of 2 stages after finding new features of \(x\), \(v\), and \(z\),

\[\begin{split}\tilde{z}_d & = f_d(z)\\ \tilde{v}_{\mu} & = g_{\mu}(v),\end{split}\]

which are represented by some non-linear functions (basis functions) \(f_d\) and \(g_{\mu}\). After transforming into the new spaces, we then

  1. Fit the treatment model:

\[\hat{x}(z, v, w) = \sum_{d, \mu} A_{d, \mu} \tilde{z}_d \tilde{v}_{\mu} + h(v, w) + \eta\]
  1. Generate new treatments x_hat, and then fit the outcome model

\[y(\hat{x}, v, w) = \sum_{m, \mu} B_{m, \mu} \psi_m(\hat{x}) \tilde{v}_{\mu} + k(v, w) + \epsilon.\]

The final causal effect can then be estimated. For an example, the CATE given \(v\) is estimated as

\[y(\hat{x_t}, v, w) - y(\hat{x_0}, v, w) = \sum_{m, \mu} B_{m, \mu} (\psi_m(\hat{x_t}) - \psi_m(\hat{x_0})) \tilde{v}_{\mu}.\]

YLearn implement this procedure in the class NP2SLS.

Class structures
class ylearn.estimator_model.iv.NP2SLS(x_model=None, y_model=None, random_state=2022, is_discrete_treatment=False, is_discrete_outcome=False, categories='auto')
Parameters:
  • x_model (estimator, optional, default=None) – The machine learning model to model the treatment. Any valid x_model should implement the fit and predict methods, by default None

  • y_model (estimator, optional, default=None) – The machine learning model to model the outcome. Any valid y_model should implement the fit and predict methods, by default None

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=False) –

  • is_discrete_outcome (bool, default=False) –

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, instrument, is_discrete_instrument=False, treatment_basis=('Poly', 2), instrument_basis=('Poly', 2), covar_basis=('Poly', 2), adjustment=None, covariate=None, **kwargs)

Fit a NP2SLS. Note that when both treatment_basis and instrument_basis have degree 1 we are actually doing 2SLS.

Parameters:
  • data (DataFrame) – Training data for the model.

  • outcome (str or list of str, optional) – Names of the outcomes.

  • treatment (str or list of str, optional) – Names of the treatment.

  • covariate (str or list of str, optional, default=None) – Names of the covariate vectors.

  • instrument (str or list of str, optional) – Names of the instrument variables.

  • adjustment (str or list of str, optional, default=None) – Names of the adjustment variables.

  • treatment_basis (tuple of 2 elements, optional, default=('Poly', 2)) – Option for transforming the original treatment vectors. The first element indicates the transformation basis function while the second one denotes the degree. Currently only support ‘Poly’ in the first element.

  • instrument_basis (tuple of 2 elements, optional, default=('Poly', 2)) – Option for transforming the original instrument vectors. The first element indicates the transformation basis function while the second one denotes the degree. Currently only support ‘Poly’ in the first element.

  • covar_basis (tuple of 2 elements, optional, default=('Poly', 2)) – Option for transforming the original covariate vectors. The first element indicates the transformation basis function while the second one denotes the degree. Currently only support ‘Poly’ in the first element.

  • is_discrete_instrument (bool, default=False) –

estimate(data=None, treat=None, control=None, quantity=None)

Estimate the causal effect of the treatment on the outcome in data.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – If None, data will be set as the training data.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

  • treat (float, optional, default=None) – Value of the treament when imposing intervention. If None, then treat will be set to 1.

  • control (float, optional, default=None) – Value of the treament such that the treament effect is \(y(do(x=treat)) - y (do(x = control))\).

Returns:

The estimated causal effect with the type of the quantity.

Return type:

ndarray or float, optional

effect_nji(data=None)

Calculate causal effects with different treatment values.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator will use the training data if data is None.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

DeepIV

DeepIV, developed in [Hartford], is a method for estimating the causal effects in the presence of the unobserved confounder between treatment and outcome variables. It applies deep learning methods to accurately characterize the causal relationships between the treatment and outcome when the instrumental variables (IV) are present. Due to the representation powers of deep learning models, it does not assume any parametric forms for the causal relationships.

Training a DeepIV has two steps and resembles the estimation procedure of a normal IV method. Specifically, we

  1. train a neural network, which we refer to as the treatment network \(F(Z, V)\), to estimate the distribution of the treatment \(X\) given the IV \(Z\) and covariate variables \(V\)

  2. train another neural network, which we refer to as the outcome network \(H(X, V)\), to estimate the outcome \(Y\) given treatment \(X\) and covariate variables \(V\).

The final causal effect can then be estimated by the outcome network \(H(X, W)\). For an instance, the CATE \(\tau(v)\) is estimated as

\[\tau(v) = H(X=x_t, V=v) - H(X=x_0, W=v).\]
Class Structures
class ylearn.estimator_model.deepiv.DeepIV(x_net=None, y_net=None, x_hidden_d=None, y_hidden_d=None, num_gaussian=5, is_discrete_treatment=False, is_discrete_outcome=False, is_discrete_instrument=False, categories='auto', random_state=2022)
Parameters:
  • x_net (ylearn.estimator_model.deepiv.Net, optional, default=None) – Representation of the mixture density network for continuous treatment or an usual classification net for discrete treatment. If None, the default neural network will be used. See ylearn.estimator_model.deepiv.Net for reference.

  • y_net (ylearn.estimator_model.deepiv.Net, optional, default=None) – Representation of the outcome network. If None, the default neural network will be used.

  • x_hidden_d (int, optional, default=None) – Dimension of the hidden layer of the default x_net of DeepIV.

  • y_hidden_d (int, optional, default=None) – Dimension of the hidden layer of the default y_net of DeepIV.

  • is_discrete_treatment (bool, default=False) –

  • is_discrete_instrument (bool, default=False) –

  • is_discrete_outcome (bool, default=False) –

  • num_gaussian (int, default=5) – Number of gaussians when using the mixture density network which will be directly ignored when the treatment is discrete.

  • random_state (int, default=2022) –

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, instrument=None, adjustment=None, approx_grad=True, sample_n=None, y_net_config=None, x_net_config=None, **kwargs)

Train the DeepIV model.

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • instrument (list of str, optional) – Names of the IV. Must provide for DeepIV.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness, which can also be seen as the covariates in the current version.

  • approx_grad (bool, default=True) – Whether use the approximated gradient as in [Hartford].

  • sample_n (int, optional, default=None) – Times of new samples when using the approx_grad technique.

  • x_net_config (dict, optional, default=None) – Configuration of the x_net.

  • y_net_config (dict, optional, default=None) – Configuration of the y_net.

Returns:

The trained DeepIV model

Return type:

instance of DeepIV

estimate(data=None, treat=None, control=None, quantity=None, marginal_effect=False, *args, **kwargs)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – Test data. The model will use the training data if set as None.

  • quantity (str, optional, default=None) –

    Option for returned estimation result. The possible values of quantity include:

    1. ’CATE’ : the estimator will evaluate the CATE;

    2. ’ATE’ : the estimator will evaluate the ATE;

    3. None : the estimator will evaluate the ITE or CITE.

  • treat (int, optional, default=None) – Value of the treatment, by default None. If None, then the model will set treat=1.

  • control (int, optional, default=None) – Value of the control, by default None. If None, then the model will set control=1.

Returns:

Estimated causal effects

Return type:

torch.tensor

effect_nji(data=None)

Calculate causal effects with different treatment values.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

Scoring Estimated Causal Effects

Estimator models for estimating causal effects can not be easily evaluated due to the fact that the true effects are not directly observed. This differs from the usual machine learning tasks whose results can be easily evaluated by, for example, the value of loss functions.

Authors in [Schuler] proposed a framework, a schema suggested by [Nie], to evaluate causal effects estimated by different estimator models. Roughly speaking, this framework is a direct application of the double machine learning methods. Specifically, for a causal effect model ce_model() (trained in a training set) that is waited to be evaluated, we

  1. Train a model y_model() to estimate the outcome \(y\) and a x_model() to estimate the treatment \(x\) in a validation set, which is usually not the same as the training set;

  2. In the validation set \(D_{val}\), let \(\tilde{y}\) and \(\tilde{x}\) denote the differences

    \[\begin{split}\tilde{y} & = y - \hat{y}(v), \\ \tilde{x} & = x - \hat{x}(v)\end{split}\]

    where \(\hat{y}\) and \(\hat{x}\) are the estimated outcome and treatment on covariates \(v\) in \(D_{val}\). Furthermore, let

    \[\tau(v)\]

    denote the causal effects estimated by the ce_model() in \(D_{val}\), then the metric of the causal effect for the ce_model is calculated as

    \[E_{V}[(\tilde{y} - \tilde{x} \tau(v))^2].\]
Class Structures
class ylearn.estimator_model.effect_score.RLoss(x_model, y_model, yx_model=None, cf_fold=1, adjustment_transformer=None, covariate_transformer=None, random_state=2022, is_discrete_treatment=False, categories='auto')
Parameters:
  • x_model (estimator, optional) – Machine learning models for fitting x. Any such models should implement the fit() and predict`() (also predict_proba() if x is discrete) methods.

  • y_model (estimator, optional) – The machine learning model which is trained to modeling the outcome. Any valid y_model should implement the fit() and predict() methods.

  • yx_model (estimator, optional) – Machine learning models for fitting the residual of y on residual of x. Only support linear regression model in the current version.

  • cf_fold (int, default=1) – The number of folds for performing cross fit in the first stage.

  • adjustment_transformer (transormer, optional, default=None,) – Transformer for adjustment variables which can be used to generate new features of adjustment variables.

  • covariate_transformer (transormer, optional, default=None,) – Transformer for covariate variables which can be used to generate new features of covariate variables.

  • random_state (int, default=2022) –

  • is_discrete_treatment (bool, default=False) – If the treatment variables are discrete, set this to True.

  • categories (str, optional, default='auto') –

fit(data, outcome, treatment, adjustment=None, covariate=None, combined_treatment=True, **kwargs)

Fit the RLoss estimator model. Note that the training of a DML has two stages, where we implement them in _fit_1st_stage() and _fit_2nd_stage().

Parameters:
  • data (pandas.DataFrame) – Training dataset for training the estimator.

  • outcome (list of str, optional) – Names of the outcome.

  • treatment (list of str, optional) – Names of the treatment.

  • adjustment (list of str, optional, default=None) – Names of the adjustment set ensuring the unconfoundness,

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • combined_treatment (bool, default=True) –

    When combined_treatment is set to True, then if there are multiple treatments, we can use the combined_treatment technique to covert the multiple discrete classification tasks into a single discrete classification task. For an example, if there are two different binary treatments:

    \[\begin{split}treatment_1 &: x_1 | x_1 \in \{'sleep', 'run'\}, \\ treatment_2 &: x_2 | x_2 \in \{'study', 'work'\},\end{split}\]

    then we can convert to these two binary classification tasks into a single classification with 4 different classes:

    \[treatment: x | x \in \{0, 1, 2, 3\},\]

    where, for example, 1 stands for (‘sleep’ and ‘study’).

Returns:

instance of RLoss

Return type:

The fitted RLoss model for evaluating other estimator models in the validation set.

score(test_estimator, treat=None, control=None)

Estimate the causal effect with the type of the quantity.

Parameters:
  • data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator directly evaluate all quantities in the training data if data is None.

  • treat (float or numpy.ndarray, optional, default=None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list or an ndarray where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, array([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray.

  • control (float or numpy.ndarray, optional, default=None) – This is similar to the cases of treat.

Returns:

The score for the test_estimator

Return type:

float

effect_nji(data=None)

Calculate causal effects with different treatment values.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data for the estimator to evaluate the causal effect, note that the estimator will use the training data if data is None.

Returns:

Causal effects with different treatment values.

Return type:

ndarray

comp_transormer(x, categories='auto')

Transform the discrete treatment into one-hot vectors properly.

Parameters:
  • x (numpy.ndarray, shape (n, x_d)) – An array containing the information of the treatment variables.

  • categories (str or list, optional, default='auto') –

Returns:

The transformed one-hot vectors.

Return type:

numpy.ndarray

The evaluations of

\[\mathbb{E}[F_2(x_1, W, \eta) - F_2(x_0, W, \eta)]\]

in ATE and

\[\mathbb{E}[F_2(x_1, W, V, \eta) - F_2(x_0, W, V, \eta)]\]

in CATE will be the tasks of various suitable estimator models in YLearn. The concept EstimatorModel in YLearn is designed for this purpose.

A typical EstimatorModel should have the following structure:

class BaseEstModel:
    """
    Base class for various estimator model.

    Parameters
    ----------
    random_state : int, default=2022
    is_discrete_treatment : bool, default=False
        Set this to True if the treatment is discrete.
    is_discrete_outcome : bool, default=False
        Set this to True if the outcome is discrete.
    categories : str, optional, default='auto'

    """
    def fit(
        self,
        data,
        outcome,
        treatment,
        **kwargs,
    ):
        """Fit the estimator model.

        Parameters
        ----------
        data : pandas.DataFrame
            The dataset used for training the model

        outcome : str or list of str, optional
            Names of the outcome variables

        treatment : str or list of str
            Names of the treatment variables

        Returns
        -------
        instance of BaseEstModel
            The fitted estimator model.
        """

    def estimate(
        self,
        data=None,
        quantity=None,
        **kwargs
    ):
        """Estimate the causal effect.

        Parameters
        ----------
        data : pd.DataFrame, optional
            The test data for the estimator to evaluate the causal effect, note
            that the estimator directly evaluate all quantities in the training
            data if data is None, by default None

        quantity : str, optional
            The possible values of quantity include:
                'CATE' : the estimator will evaluate the CATE;
                'ATE' : the estimator will evaluate the ATE;
                None : the estimator will evaluate the ITE or CITE, by default None

        Returns
        -------
        ndarray
            The estimated causal effect with the type of the quantity.
        """

    def effect_nji(self, data=None, *args, **kwargs):
        """Return causal effects for all possible values of treatments.

        Parameters
        ----------
        data : pd.DataFrame, optional
            The test data for the estimator to evaluate the causal effect, note
            that the estimator directly evaluate all quantities in the training
            data if data is None, by default None
        """

Causal Discovery: Exploring the Causal Structures in Data

No-Tears

The problem of revealing the structures of directed acyclic graphs (DAGs) can be solved by formulating a continuous optimization problem over real matrices with the constraint enforcing the acyclicity condition [Zheng2018]. Specifically, for a given vector \(x \in \mathbb{R}^d\) such that there exists a matrix \(V\) which satisifies \(x = Vx + \eta\) for some noise vector \(\eta \in \mathbb{R}^d\), the optimization problem can be summarized as follows:

\[\begin{split}\min_{W \in \mathbb{R}^{d\times d}} & F(W) \\ s.t. \quad & h(W) = 0,\end{split}\]

where \(F(W)\) is a continuous function measuring \(\|x - Wx\|\) and

\[h(W) = tr\left( e^{W \circ W} \right)\]

where \(\circ\) is the Hadamard product. This optimization can then be solved with some optimization technique, such as gradient desscent.

The YLearn class for the NO-TEARS algorithm is CausalDiscovery.

A fundamental task in causal learning is to find the underlying causal relationships, the so-called “causal structures”, and apply them. Traditionally, these relationships might be revealed by designing randomized experiments or imposing interventions. However, such methods are too expensive or even impossible. Therefore, many techniques, e.g., the PC algorithm (see [Spirtes2001]), have been suggested recently to analyze the causal structures by directly utilizing observational data. These techniques are named as causal discovery.

The current version of YLearn implements a score-based method for causal discovery [Zheng2018]. More methods will be added in later versions.

Policy: Selecting the Best Option

In tasks such as policy evaluation, e.g., [Athey2020], besides the causal effects, we may also have interest in other questions such as whether an example should be assigned to a treatment and if the answer is yes, which option is the best among all possible treatment values. YLearn implements PolicyTree for such purpose. Given a trained estimator model or estimated causal effects, it finds the optimal polices for each example by building a decision tree model which aims to maximize the causal effect of each example.

The criterion for training the tree is

\[S = \sum_i\sum_k g_{ik}e_{ki}\]

where \(g_{ik} = \phi(v_i)_k\) with \(\phi: \mathbb{R}^D \to \mathbb{R}^K\) being a map from \(v_i\in \mathbb{R}^D\) to a basis vector with only one nonzero element in \(\mathbb{R}^K\) and \(e_{ki}\) denotes the causal effect of taking the \(k\)-th value of the treatment for example \(i\).

See also

BaseDecisionTree in sklearn.

Note that one can use the PolicyInterpreter to interpret the result of a policy model.

Class Structures

class ylearn.policy.policy_model.PolicyTree(*, criterion='policy_reg', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=2022, max_leaf_nodes=None, max_features=None, min_impurity_decrease=0.0, ccp_alpha=0.0, min_weight_fraction_leaf=0.0)
Parameters:
  • criterion ({'policy_reg'}, default="'policy_reg'") –

    The function to measure the quality of a split. The criterion for training the tree is (in the Einstein notation)

    \[S = \sum_i g_{ik} e^k_{i},\]

    where \(g_{ik} = \phi(v_i)_k\) is a map from the covariates, \(v_i\), to a basis vector which has only one nonzero element in the \(R^k\) space. By using this criterion, the aim of the model is to find the index of the treatment which will render the max causal effect, i.e., finding the optimal policy.

  • splitter ({"best", "random"}, default="best") – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

  • max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int or float, default=2) – The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default to None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following

    N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

fit(data, covariate, *, effect=None, effect_array=None, est_modle=None, sample_weight=None)

Fit the PolicyInterpreter model to interpret the policy for the causal effect estimated by the est_model on data. One has several options for passing the causal effects, which usually is a vector of (n, j, i) where n is the number of the examples, j is the dimension of the outcome, and i is the number of possible treatment values or the dimension of the treatment:

  1. Only pass est_model. Then est_model will be used to generate the causal effects.

  2. Only pass effect_array which will be set as the causal effects and effect and est_model will be ignored.

  3. Only pass effect. This usually is a list of names of the causal effect in data which will then be used as the causal effects for training the model.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • est_model (estimator_model) – est_model should be any valid estimator model of ylearn which was already fitted and can estimate the CATE. If effect=None and effect_array=None, then est_model can not be None and the causal effect will be estimated by the est_model.

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • effect (list of str, optional, default=None) – Names of the causal effect in data. If effect_array is not None, then effect will be ignored.

  • effect_array (numpy.ndarray, default=None) – The causal effect that waited to be fitted by PolicyTree. If this is not provided and est_model is None, then effect can not be None.

Returns:

Fitted PolicyModel

Return type:

instance of PolicyModel

predict_ind(data=None)

Estimate the optimal policy for the causal effects of the treatment on the outcome in the data, i.e., return the index of the optimal treatment.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data in the form of the DataFrame. The model will only use this if v is set as None. In this case, if data is also None, then the data used for trainig will be used.

Returns:

The index of the optimal treatment dimension.

Return type:

ndarray or int, optional

predict_opt_effect(data=None)

Estimate the value of the optimal policy for the causal effects of the treatment on the outcome in the data, i.e., return the value of the causal effects when taking the optimal treatment.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data in the form of the DataFrame. The model will only use this if v is set as None. In this case, if data is also None, then the data used for trainig will be used.

Returns:

The estimated causal effect with the optimal treatment value.

Return type:

ndarray or float, optional

apply(*, v=None, data=None)

Return the index of the leaf that each sample is predicted as.

Parameters:
  • v (numpy.ndarray, default=None) – The input samples as an ndarray. If None, then the DataFrame data will be used as the input samples.

  • data (pandas.DataFrame, default=None) – The input samples. The data must contains columns of the covariates used for training the model. If None, the training data will be passed as input samples.

Returns:

For each datapoint v_i in v, return the index of the leaf v_i ends up in. Leaves are numbered within [0; self.tree_.node_count), possibly with gaps in the numbering.

Return type:

v_leaves : array-like of shape (n_samples, )

decision_path(*, v=None, data=None)

Return the decision path.

Parameters:
  • v (numpy.ndarray, default=None) – The input samples as an ndarray. If None, then the DataFrame data will be used as the input samples.

  • data (pandas.DataFrame, default=None) – The input samples. The data must contains columns of the covariates used for training the model. If None, the training data will be passed as input samples.

Returns:

Return a node indicator CSR matrix where non zero elements indicates that the samples goes through the nodes.

Return type:

indicator : sparse matrix of shape (n_samples, n_nodes)

get_depth()

Return the depth of the policy tree. The depth of a tree is the maximum distance between the root and any leaf.

Returns:

The maximum depth of the tree.

Return type:

int

get_n_leaves()

Return the number of leaves of the policy tree.

Returns:

Number of leaves

Return type:

int

property feature_importance

Return the feature importances. The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance. Warning: impurity-based feature importances can be misleading for high cardinality features (many unique values). See sklearn.inspection.permutation_importance() as an alternative.

Returns:

Normalized total reduction of criteria by feature (Gini importance).

Return type:

ndarray of shape (n_features,)

property n_features_
Returns:

number of features

Return type:

int

plot(*, feature_names=None, max_depth=None, class_names=None, label='all', filled=False, node_ids=False, proportion=False, rounded=False, precision=3, ax=None, fontsize=None)

Plot the PolicyTree. The sample counts that are shown are weighted with any sample_weights that might be present. The visualization is fit automatically to the size of the axis. Use the figsize or dpi arguments of plt.figure to control the size of the rendering.

Returns:

List containing the artists for the annotation boxes making up the tree.

Return type:

annotations : list of artists

Interpreter: Explaining the Causal Effects

To interpret the causal effects estimated by various estimator models, YLearn implements tree models CEInterpreter for causal effect interpretabilities and PolicyInterpreter for policy evaluation interpretabilities in the current version.

CEInterpreter

For the CATE \(\tau(v)\) estimated by an estimator model, e.g., double machine learning model, CEInterpreter interprets the results by building a decision tree to model the relationships between \(\tau(v)\) and the covariates \(v\). Then one can use the decision rules of the fitted tree model to analyze \(\tau(v)\).

Class Structures

class ylearn.effect_interpreter.ce_interpreter.CEInterpreter(*, criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=2022, max_leaf_nodes=None, max_features=None, min_impurity_decrease=0.0, min_weight_fraction_leaf=0.0, ccp_alpha=0.0, categories='auto')
Parameters:
  • criterion ({"squared_error", "friedman_mse", "absolute_error", "poisson"}, default="squared_error") – The function to measure the quality of a split. Supported criteria are “squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, “friedman_mse”, which uses mean squared error with Friedman’s improvement score for potential splits, “absolute_error” for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and “poisson” which uses reduction in Poisson deviance to find splits.

  • splitter ({"best", "random"}, default="best") – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

  • max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int or float, default=2) – The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default to None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following

    N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

fit(data, est_model, **kwargs)

Fit the CEInterpreter model to interpret the causal effect estimated by the est_model on data.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • est_model (estimator_model) – est_model should be any valid estimator model of ylearn which was already fitted and can estimate the CATE.

Returns:

Fitted CEInterpreter

Return type:

instance of CEInterpreter

interpret(*, v=None, data=None)

Interpret the fitted model in the test data.

Parameters:
  • v (numpy.ndarray, optional, default=None) – The test covariates in the form of ndarray. If this is given, then data will be ignored and the model will use this as the test data.

  • data (pandas.DataFrame, optional, default=None) – The test data in the form of the DataFrame. The model will only use this if v is set as None. In this case, if data is also None, then the data used for training will be used.

Returns:

The interpreted results for all examples.

Return type:

dict

plot(*, feature_names=None, max_depth=None, class_names=None, label='all', filled=False, node_ids=False, proportion=False, rounded=False, precision=3, ax=None, fontsize=None)

Plot the fitted tree model. The sample counts that are shown are weighted with any sample_weights that might be present. The visualization is fit automatically to the size of the axis. Use the figsize or dpi arguments of plt.figure to control the size of the rendering.

Returns:

List containing the artists for the annotation boxes making up the tree.

Return type:

annotations : list of artists

PolicyInterpreter

PolicyInterpreter can be used to interpret the policy returned by an instance of PolicyTree. By assigning different strategies to different examples, it aims to maximize the casual effects of a subgroup and separate them from those with negative causal effects.

Class Structures

class ylearn.interpreter.policy_interpreter.PolicyInterpreter(*, criterion='policy_reg', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, random_state=2022, max_leaf_nodes=None, max_features=None, min_impurity_decrease=0.0, ccp_alpha=0.0, min_weight_fraction_leaf=0.0)
Parameters:
  • criterion ({'policy_reg'}, default="'policy_reg'") –

    The function to measure the quality of a split. The criterion for training the tree is (in the Einstein notation)

    \[S = \sum_i g_{ik} y^k_{i},\]

    where \(g_{ik} = \phi(v_i)_k\) is a map from the covariates, \(v_i\), to a basis vector which has only one nonzero element in the \(R^k\) space. By using this criterion, the aim of the model is to find the index of the treatment which will render the max causal effect, i.e., finding the optimal policy.

  • splitter ({"best", "random"}, default="best") – The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

  • max_depth (int, default=None) – The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

  • min_samples_split (int or float, default=2) – The minimum number of samples required to split an internal node: - If int, then consider min_samples_split as the minimum number. - If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

  • min_samples_leaf (int or float, default=1) –

    The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

    • If int, then consider min_samples_leaf as the minimum number.

    • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

  • min_weight_fraction_leaf (float, default=0.0) – The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

  • max_features (int, float or {"sqrt", "log2"}, default=None) –

    The number of features to consider when looking for the best split:

    • If int, then consider max_features features at each split.

    • If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

    • If “sqrt”, then max_features=sqrt(n_features).

    • If “log2”, then max_features=log2(n_features).

    • If None, then max_features=n_features.

  • random_state (int) – Controls the randomness of the estimator.

  • max_leaf_nodes (int, default to None) – Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

  • min_impurity_decrease (float, default=0.0) –

    A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following

    N_t / N * (impurity - N_t_R / N_t * right_impurity - N_t_L / N_t * left_impurity)

    where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child. N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

fit(data, est_model, *, covariate=None, effect=None, effect_array=None)

Fit the PolicyInterpreter model to interpret the policy for the causal effect estimated by the est_model on data.

Parameters:
  • data (pandas.DataFrame) – The input samples for the est_model to estimate the causal effects and for the CEInterpreter to fit.

  • est_model (estimator_model) – est_model should be any valid estimator model of ylearn which was already fitted and can estimate the CATE.

  • covariate (list of str, optional, default=None) – Names of the covariate.

  • effect (list of str, optional, default=None) – Names of the causal effect in data. If effect_array is not None, then effect will be ignored.

  • effect_array (numpy.ndarray, default=None) – The causal effect that waited to be interpreted by the PolicyInterpreter. If this is not provided, then effect can not be None.

Returns:

Fitted PolicyInterpreter

Return type:

instance of PolicyInterpreter

interpret(*, data=None)

Interpret the fitted model in the test data.

Parameters:

data (pandas.DataFrame, optional, default=None) – The test data in the form of the DataFrame. The model will only use this if v is set as None. In this case, if data is also None, then the data used for training will be used.

Returns:

The interpreted results for all examples.

Return type:

dict

plot(*, feature_names=None, max_depth=None, class_names=None, label='all', filled=False, node_ids=False, proportion=False, rounded=False, precision=3, ax=None, fontsize=None)

Plot the tree model. The sample counts that are shown are weighted with any sample_weights that might be present. The visualization is fit automatically to the size of the axis. Use the figsize or dpi arguments of plt.figure to control the size of the rendering.

Returns:

List containing the artists for the annotation boxes making up the tree.

Return type:

annotations : list of artists

Why: An All-in-One Causal Learning API

Want to use YLearn in a much easier way? Try the all-in-one API Why!

Why is an API which encapsulates almost everything in YLearn, such as identifying causal effects and scoring a trained estimator model. It provides to users a simple and efficient way to use our package: one can directly pass the only thing you have, the data, into Why and call various methods of it rather than learning multiple concepts such as adjustment set before being able to find interesting information hidden in your data. Why is designed to enable the full-pipeline of causal inference: given data, it first tries to discover the causal graph if not provided, then it attempts to find possible variables as treatments and identify the causal effects, after which a suitable estimator model will be trained to estimate the causal effects, and, finally, the policy is evaluated to suggest the best option for each individual.

_images/flow.png

Why can help almost every part of the whole pipeline of causal inference.

Example usages

In this chapter, we use dataset california_housing to show how to use Why. We prepare the dataset with code below:

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
data = housing.frame
outcome = housing.target_names[0]
data[outcome] = housing.target

The variable data is our prepared dataset.

Fit Why with default settings

The simplest way to use Why is creating Why instance with default settings and fit it with training data and outcome name only.

from ylearn import Why

why = Why()
why.fit(data, outcome)

print('identified treatment:',why.treatment_)
print('identified adjustment:',why.adjustment_)
print('identified covariate:',why.covariate_)
print('identified instrument:',why.instrument_)

print(why.causal_effect())

Outputs:

identified treatment: ['MedInc', 'HouseAge']
identified adjustment: None
identified covariate: ['AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None
              mean       min       max       std
MedInc    0.411121 -0.198831  1.093134  0.064856
HouseAge -0.000385 -0.039162  0.114263  0.005845

Fit Why with customized treatments

We can fit Why with argument treatment to specify the desired features as treatment.

from ylearn import Why

why = Why()
why.fit(data, outcome, treatment=['AveBedrms', ])

print('identified treatment:',why.treatment_)
print('identified adjustment:',why.adjustment_)
print('identified covariate:',why.covariate_)
print('identified instrument:',why.instrument_)

print(why.causal_effect())

Outputs:

identified treatment: ['AveBedrms']
identified adjustment: None
identified covariate: ['MedInc', 'HouseAge', 'AveRooms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None
               mean       min        max       std
AveBedrms  0.197422 -0.748971  10.857963  0.169682

Identify treatment without fitting Why

We can call Why’s method identify to identify treatment, adjustment, covariate and instrument without fitting it.

why = Why()
r=why.identify(data, outcome)

print('identified treatment:',r[0])
print('identified adjustment:',r[1])
print('identified covariate:',r[2])
print('identified instrument:',r[3])

Outputs:

identified treatment: ['MedInc', 'HouseAge']
identified adjustment: None
identified covariate: ['AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
identified instrument: None

Class Structures

class ylearn._why.Why(discrete_outcome=None, discrete_treatment=None, identifier='auto', identifier_options=None, estimator='auto', estimator_options=None, random_state=None)

An all-in-one API for causal learning.

Parameters:
  • discrete_outcome (bool, default=None) – If True, force the outcome as discrete; If False, force the outcome as continuous; If None, inferred from outcome.

  • discrete_treatment (bool, default=None) – If True, force the treatment variables as discrete; If False, force the treatment variables as continuous; if None, inferred from the first treatment

  • identifier (str or Identifier, default=auto') – If str, available options: ‘auto’ or ‘discovery’ or ‘gcastle’ or ‘pgm’

  • identifier_options (dict, optional, default=None) – Parameters (key-values) to initialize the identifier

  • estimator (str, optional, default='auto') – Name of a valid EstimatorModel. One can also pass an instance of a valid estimator model.

  • estimator_options (dict, optional, default=None) – Parameters (key-values) to initialize the estimator model

  • fn_cost (callable, optional, default=None) – Cost function, used to readjust the causal effect based on cost.

  • effect_name (str, default='effect') – The column name in the argument DataFrame passed to fn_cost. Effective when fn_cost is not None.

  • random_state (int, optional, default=None) – Random state seed

feature_names_in_

list of feature names seen during fit

outcome_

name of outcome

treatment_

list of treatment names identified during fit

adjustment_

list of adjustment names identified during fit

covariate_

list of covariate names identified during fit

instrument_

list of instrument names identified during fit

identifier_

identifier object or None. Used to identify treatment/adjustment/covariate/instrument if they were not specified during fit

y_encoder_

LabelEncoder object or None. Used to encode outcome if it is discrete.

preprocessor_

Pipeline object to preprocess data during fit

estimators_

estimators dict for each treatment where key is the treatment name and value is the EstimatorModel object

fit(data, outcome, *, treatment=None, adjustment=None, covariate=None, instrument=None, treatment_count_limit=None, copy=True, **kwargs)

Fit the Why object, steps:

  1. encode outcome if its dtype is not numeric

  2. identify treatment and adjustment/covariate/instrument

  3. encode treatment if discrete_treatment is True

  4. preprocess data

  5. fit causal estimators

Parameters:
  • data (pandas.DataFrame, required) – Training dataset.

  • outcome (str, required) – Name of the outcome.

  • treatment (list of str, optional) – Names of the treatment. If str, will be split into list with comma; if None, identified by identifier.

  • adjustment (list of str, optional, default=None) – Names of the adjustment. Identified by identifier if adjustment/covariate/instrument are all None.

  • covariate (list of str, optional, default=None) – Names of the covariate. Identified by identifier if adjustment/covariate/instrument are all None.

  • instrument (list of str, optional, default=None) – Names of the instrument. Identified by identifier if adjustment/covariate/instrument are all None.

  • treatment_count_limit (int, optional) – maximum treatment number, default min(5, 10% of total feature number).

  • copy (bool, default=True) – Set False to perform inplace transforming and avoid a copy of data.

Returns:

The fitted Why.

Return type:

instance of Why

identify(data, outcome, *, treatment=None, adjustment=None, covariate=None, instrument=None, treatment_count_limit=None)

Identify treatment and adjustment/covariate/instrument without fitting Why.

Parameters:
  • data (pandas.DataFrame, required) – Training dataset.

  • outcome (str, required) – Name of the outcome.

  • treatment (list of str, optional) – Names of the treatment. If str, will be split into list with comma; if None, identified by identifier.

  • adjustment (list of str, optional, default=None) – Names of the adjustment. Identified by identifier if adjustment/covariate/instrument are all None.

  • covariate (list of str, optional, default=None) – Names of the covariate. Identified by identifier if adjustment/covariate/instrument are all None.

  • instrument (list of str, optional, default=None) – Names of the instrument. Identified by identifier if adjustment/covariate/instrument are all None.

  • treatment_count_limit (int, optional) – maximum treatment number, default min(5, 10% of the number of features).

Returns:

tuple of identified treatment, adjustment, covariate, instrument

Rtypes:

tuple

causal_graph()

Get identified causal graph.

Returns:

Identified causal graph

Return type:

instance of CausalGraph

causal_effect(test_data=None, treatment=None, treat=None, control=None, target_outcome=None, quantity='ATE', return_detail=False, **kwargs)

Estimate the causal effect.

Parameters:
  • test_data (pandas.DataFrame, optional) – The test data to evaluate the causal effect. If None, the training data is used.

  • treatment (str or list, optional) – Treatment names, should be subset of attribute treatment_, default all elements in attribute treatment_

  • treat (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None

  • control (treatment value or list or ndarray or pandas.Series, default None) – This is similar to the cases of treat, by default None

  • target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.

  • quantity (str, optional, default 'ATE', optional) – ‘ATE’ or ‘ITE’, default ‘ATE’.

  • return_detail (bool, default False) – If True, return effect details in result.

  • kwargs (dict, optional) – Other options to call estimator.estimate().

Returns:

causal effect of each treatment. When quantity=’ATE’, the result DataFrame columns are:
  • mean: mean of causal effect,

  • min: minimum of causal effect,

  • max: maximum of causal effect,

  • detail (if return_detail is True ): causal effect ndarray;

in the case of discrete treatment, the result DataFrame indices are multiindex of (treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame indices are treatment names. When quantity=’ITE’, the result DataFrame are individual causal effect of each treatment, in the case of discrete treatment, the result DataFrame columns are multiindex of (treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame columns are treatment names.

Return type:

pandas.DataFrame

individual_causal_effect(test_data, control=None, target_outcome=None)

Estimate the causal effect for each individual.

Parameters:
  • test_data (pandas.DataFrame, required) – The test data to evaluate the causal effect.

  • control (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, control should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where control[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None

  • target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.

Returns:

individual causal effect of each treatment. The result DataFrame columns are the treatment names; In the case of discrete treatment, the result DataFrame indices are multiindex of (individual index in test_data, treatment name and treat_vs_control); in the case of continuous treatment, the result DataFrame indices are multiindex of (individual index in test_data, treatment name).

Return type:

pandas.DataFrame

whatif(test_data, new_value, treatment=None)

Get counterfactual predictions when treatment is changed to new_value from its observational counterpart.

Parameters:
  • test_data (pandas.DataFrame, required) – The test data to predict.

  • new_value (ndarray or pd.Series, required) – It should have the same length with test_data.

  • treatment (str, default None) – Treatment name. If str, it should be one of the fitted attribute treatment_. If None, the first element in the attribute treatment_ is used.

Returns:

The counterfactual prediction

Return type:

pandas.Series

score(test_data=None, treat=None, control=None, scorer='auto')

Scoring the fitted estimator models.

Parameters:
  • test_data (pandas.DataFrame, required) – The test data to score.

  • treat (treatment value or list or ndarray or pandas.Series, default None) – In the case of single discrete treatment, treat should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, treat should be a list where treat[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the treat value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, treat should be a float or a ndarray or pandas.Series, by default None

  • control (treatment value or list or ndarray or pandas.Series) – This is similar to the cases of treat, by default None

  • scorer (str, default 'auto') – Reserved.

Returns:

Score of the estimator models

Return type:

float

policy_interpreter(test_data, treatment=None, control=None, target_outcome=None, **kwargs)

Get the policy interpreter

Parameters:
  • test_data (pandas.DataFrame, required) – The test data to evaluate.

  • treatment (str or list, optional) – Treatment names, should be one or two element. default the first two elements in attribute treatment_

  • control (treatment value or list or ndarray or pandas.Series) – In the case of single discrete treatment, control should be an int or str of one of all possible treatment values which indicates the value of the intended treatment; in the case of multiple discrete treatment, control should be a list where control[i] indicates the value of the i-th intended treatment, for example, when there are multiple discrete treatments, list([‘run’, ‘read’]) means the control value of the first treatment is taken as ‘run’ and that of the second treatment is taken as ‘read’; in the case of continuous treatment, control should be a float or a ndarray or pandas.Series, by default None

  • target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.

  • kwargs (dict) – options to initialize the PolicyInterpreter.

Returns:

The fitted instance of PolicyInterpreter.

Return type:

instance of PolicyInterpreter

uplift_model(test_data, treatment=None, treat=None, control=None, target_outcome=None, name=None, random=None)

Get uplift model over one treatment.

Parameters:
  • test_data (pandas.DataFrame, required) – The test data to evaluate.

  • treatment (str or list, optional) – Treatment name. If str, it should be one of the fitted attribute treatment_. If None, the first element in the attribute treatment_ is used.

  • treat (treatment value, optional) – If None, the last element in the treatment encoder’s attribute classes_ is used.

  • control (treatment value, optional) – If None, the first element in the treatment encoder’s attribute classes_ is used.

  • target_outcome (outcome value, optional) – Only effective when the outcome is discrete. Default the last one in attribute y_encoder_.classes_.

  • name (str) – Lift name. If None, treat value is used.

  • random (str, default None) – Lift name for random generated data. if None, no random lift is generated.

Returns:

The fitted instance of UpliftModel.

Return type:

instance of UpliftModel

plot_causal_graph()

Plot the causal graph.

plot_policy_interpreter(test_data, treatment=None, control=None, **kwargs)

Plot the interpreter.

Returns:

The fitted instance of PolicyInterpreter.

Return type:

instance of PolicyInterpreter

References

[Pearl]
  1. Pearl. Causility : models, reasoning, and inference.

[Shpitser2006]
  1. Shpitser and J. Pearl. Identification of Joint Interventional Distributions in Recursive Semi-Markovian Causal Models.

[Neal2020]
  1. Neal. Introduction to Causal Inference.

[Funk2010]
  1. Funk, et al. Doubly Robust Estimation of Causal Effects.

[Chern2016]
  1. Chernozhukov, et al. Double Machine Learning for Treatment and Causal Parameters. arXiv:1608.00060.

[Athey2015]
  1. Athey and G. Imbens. Recursive Partitioning for Heterogeneous Causal Effects. arXiv: 1504.01132.

[Schuler]
  1. Schuler, et al. A comparison of methods for model selection when estimating individual treatment effects. arXiv:1804.05146.

[Nie]

X. Nie, et al. Quasi-Oracle estimation of heterogeneous treatment effects. arXiv: 1712.04912.

[Hartford]
  1. Hartford, et al. Deep IV: A Flexible Approach for Counterfactual Prediction. ICML 2017.

[Newey2002]
  1. Newey and J. Powell. Instrumental Variable Estimation of Nonparametric Models. Econometrica 71, no. 5 (2003): 1565–78.

[Kunzel2019]
  1. Kunzel2019, et al. Meta-Learners for Estimating Heterogeneous Treatment Effects using Machine Learning.

[Angrist1996]
  1. Angrist, et al. Identification of causal effects using instrumental variables. Journal of the American Statistical Association.

[Athey2020]
  1. Athey and S. Wager. Policy Learning with Observational Data. arXiv: 1702.02896.

[Spirtes2001]
  1. Spirtes, et al. Causation, Prediction, and Search.

[Zheng2018]
  1. Zheng, et al. DAGs with NO TEARS: Continuous Optimization for Structure Learning. arXiv: 1803.01422.

[Athey2018]
  1. Authey, et al. Generalized Random Forests. arXiv: 1610.01271

Indices and tables