causal_agent.synthetic package

class causal_agent.synthetic.PSMGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: ObservationalDataGenerator

Generate synthetic data for Propensity Score Matching (PSM)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

test_data(print_=False)[source]: Test the generated data

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

generate_data()

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.PSWGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: ObservationalDataGenerator

Generate synthetic data for Propensity Score Weighting (PSW)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

test_data(print_=False)[source]: Test the generated data

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

generate_data()

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.IVGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]

Bases: DataGenerator

Generate synthetic data for Instrumental Variables (IV) analysis. We assume two forms:

Encouragement Design:
Z -> D -> Y In this setting, encouragements (Z) is randomized. For instance, consider the administering of vaccines. We cannot force people to take vaccines, however we can encourage them to take the vaccine. We could run a vaccine awareness campaign, where we randomly pick participants, and inform them about the benefits of vaccine. The user can either comply (take the vaccine) or not comply (not take the vaccine). Likewise, in the control group, the user can comply (not take the vaccine) or defy (take the vaccine)
U

/ Z -> D -> Y

This is the classical setting where we have an unobserved confounder affecting both treatment (D) and outcome (Y).

Additional Attributes:

alpha (float): the effect of the instrument on the treatment (Z on D) encouragement (bool): whether or not this is an encouragement design beta_d (float): effect of the unobserved confounder (U) on treatment (D) beta_y (float): effect of the unobserved confounders (U) on outcome (Y)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.RDDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]

Bases: DataGenerator

Generate synthetic data for (sharp) Regression Discontinuity Design (RDD).

Additional Attributes:: cutoff (float): the cutoff for treatment assignment bandwidth (float): the bandwidth for the running variable we consider when estimating the treatment effects plot (bool): whether we plot the data or not

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.RCTGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Generate synthetic data for Randomized Controlled Trials (RCT)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.DiDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Generate synthetic data for Difference-in-Differences (DiD) analysis

Additional Attributes:

n_periods (int): number of time-periods

__init__(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

canonical_did_model()[source]: This is the classical DiD setting with two periods (pre and post treatment) and two groups (treatment and control)

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

twfe_model()[source]: Generate panel data for Two-Way Fixed Effects DiD model. This is a generalization of 2-period DiD for multi-year treatments

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

class causal_agent.synthetic.MultiTreatRCTGenerator(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Base class for generating synthetic data for multi-treatment RCTs

Additional Attributes:: true_effect_vec (np.ndarray): the treatment effect for different treatments.

__init__(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

Submodules

causal_agent.synthetic.generator module

class causal_agent.synthetic.generator.DataGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, n_treatments=1, true_effect=0, seed=111, heterogeneity=0)[source]

Bases: object

Base class for generating synthetic data

n_observations

Number of observations

Type:: int

n_continuous_covars

Number of covariates

Type:: int

n_covars

total number of covariates (continuous + binary)

Type:: int

n_treatments

Number of treatments

Type:: int

true_effect

True effect size

Type:: float

seed

Random seed for reproducibility

Type:: int

data

Generated data

Type:: pd.DataFrame

info

Dictionary to store additional information about the data

Type:: dict

method

the causal inference method assocated with the synthetic

Type:: str

mean

mean of the covariates

Type:: np.ndarray

covar

covariance matrix for the covariates

Type:: np.ndarray

heterogeneity

whether or not the treatment effects are heterogeneous

Type:: bool

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, n_treatments=1, true_effect=0, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

save_data(folder, filename)[source]

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates()[source]: Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

class causal_agent.synthetic.generator.MultiTreatRCTGenerator(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Base class for generating synthetic data for multi-treatment RCTs

Additional Attributes:: true_effect_vec (np.ndarray): the treatment effect for different treatments.

__init__(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.FrontDoorGenerator(n_observations, n_continuous_covars=2, n_binary_covars=2, mean=None, covar=None, seed=111, true_effect=2.0, heterogeneity=0)[source]

Bases: DataGenerator

Generates synthetic data satisfying the front-door criterion. D → M → Y, D ← U → Y

__init__(n_observations, n_continuous_covars=2, n_binary_covars=2, mean=None, covar=None, seed=111, true_effect=2.0, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.ObservationalDataGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Generate synthetic data for observational studies.

Additional Attributes:: self.weights (np.ndarray): the propoensity score weights for each observation

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

test_data(print_=False): Test the generated data, using the appropriate method.

class causal_agent.synthetic.generator.PSMGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: ObservationalDataGenerator

Generate synthetic data for Propensity Score Matching (PSM)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

test_data(print_=False)[source]: Test the generated data

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

generate_data()

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.PSWGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: ObservationalDataGenerator

Generate synthetic data for Propensity Score Weighting (PSW)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

test_data(print_=False)[source]: Test the generated data

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

generate_data()

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.RCTGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Generate synthetic data for Randomized Controlled Trials (RCT)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.IVGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]

Bases: DataGenerator

Generate synthetic data for Instrumental Variables (IV) analysis. We assume two forms:

Encouragement Design:
Z -> D -> Y In this setting, encouragements (Z) is randomized. For instance, consider the administering of vaccines. We cannot force people to take vaccines, however we can encourage them to take the vaccine. We could run a vaccine awareness campaign, where we randomly pick participants, and inform them about the benefits of vaccine. The user can either comply (take the vaccine) or not comply (not take the vaccine). Likewise, in the control group, the user can comply (not take the vaccine) or defy (take the vaccine)
U

/ Z -> D -> Y

This is the classical setting where we have an unobserved confounder affecting both treatment (D) and outcome (Y).

Additional Attributes:

alpha (float): the effect of the instrument on the treatment (Z on D) encouragement (bool): whether or not this is an encouragement design beta_d (float): effect of the unobserved confounder (U) on treatment (D) beta_y (float): effect of the unobserved confounders (U) on outcome (Y)

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.RDDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]

Bases: DataGenerator

Generate synthetic data for (sharp) Regression Discontinuity Design (RDD).

Additional Attributes:: cutoff (float): the cutoff for treatment assignment bandwidth (float): the bandwidth for the running variable we consider when estimating the treatment effects plot (bool): whether we plot the data or not

__init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

class causal_agent.synthetic.generator.DiDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

Bases: DataGenerator

Generate synthetic data for Difference-in-Differences (DiD) analysis

Additional Attributes:

n_periods (int): number of time-periods

__init__(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]

canonical_did_model()[source]: This is the classical DiD setting with two periods (pre and post treatment) and two groups (treatment and control)

generate_covariates(): Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.

save_data(folder, filename)

Saves the generated data as a CSV file

Parameters:

folder (str) – path to the folder where the data is saved
filename (str) – name of the file

twfe_model()[source]: Generate panel data for Two-Way Fixed Effects DiD model. This is a generalization of 2-period DiD for multi-year treatments

generate_data()[source]

Generates the synthetic data

Returns:: The generated data
Return type:: pd.DataFrame

test_data(print_=False)[source]: Test the generated data, using the appropriate method.

causal_agent.synthetic.io module

causal_agent.synthetic.prompts module

causal_agent.synthetic.prompts.generate_data_summary(df, n_cont_vars, n_bin_vars, method, cutoff=None)[source]

Generate a summary of the input dataset. The summary includes information about column headings for continuuous, binary, treatment, and outcome variables. Additionally, it also includes information on the method used to generate the dataset and the basic statistical summary.

Parameters:

df (pd.DataFrame) – The input dataset.
n_cont_vars (int) – Number of continuous variables in the dataset
n_bin_vars (int) – Number of binary variables in the dataset
method (str) – The method used to generate the dataset
cutff (float, None) – The cutoff value for RDD data

Returns:

Summary of the (raw) dataset.

Return type:

str

causal_agent.synthetic.prompts.create_prompt(summary, method, domain, history)[source]

Creates a prompt for the OpenAI API to generate a context for the given dataset

Parameters:

summary (str) – Summary of the dataset
method (str) – The method used to generate the dataset
domain (str) – The domain of the dataset
history (str) – Previous contexts that have been used. We use this to avoid overlap in contexts

causal_agent.synthetic.prompts.filter_question(question)[source]

Filter the question to remove explicit mentions of variables.

Parameters:: question (str) – The original causal query
Returns:: The filtered causal query
Return type:: str

causal_agent.synthetic.util module

causal_agent.synthetic.util.export_info(info, folder, name)[source]