causal_agent.synthetic package
- class causal_agent.synthetic.PSMGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
ObservationalDataGeneratorGenerate synthetic data for Propensity Score Matching (PSM)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- generate_data()
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- class causal_agent.synthetic.PSWGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
ObservationalDataGeneratorGenerate synthetic data for Propensity Score Weighting (PSW)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- generate_data()
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- class causal_agent.synthetic.IVGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]
Bases:
DataGenerator- Generate synthetic data for Instrumental Variables (IV) analysis. We assume two forms:
- Encouragement Design:
Z -> D -> Y In this setting, encouragements (Z) is randomized. For instance, consider the administering of vaccines. We cannot force people to take vaccines, however we can encourage them to take the vaccine. We could run a vaccine awareness campaign, where we randomly pick participants, and inform them about the benefits of vaccine. The user can either comply (take the vaccine) or not comply (not take the vaccine). Likewise, in the control group, the user can comply (not take the vaccine) or defy (take the vaccine)
U
/ Z -> D -> Y
This is the classical setting where we have an unobserved confounder affecting both treatment (D) and outcome (Y).
- Additional Attributes:
alpha (float): the effect of the instrument on the treatment (Z on D) encouragement (bool): whether or not this is an encouragement design beta_d (float): effect of the unobserved confounder (U) on treatment (D) beta_y (float): effect of the unobserved confounders (U) on outcome (Y)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.RDDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]
Bases:
DataGeneratorGenerate synthetic data for (sharp) Regression Discontinuity Design (RDD).
- Additional Attributes:
cutoff (float): the cutoff for treatment assignment bandwidth (float): the bandwidth for the running variable we consider when estimating the treatment effects plot (bool): whether we plot the data or not
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.RCTGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorGenerate synthetic data for Randomized Controlled Trials (RCT)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.DiDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorGenerate synthetic data for Difference-in-Differences (DiD) analysis
- Additional Attributes:
n_periods (int): number of time-periods
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- canonical_did_model()[source]
This is the classical DiD setting with two periods (pre and post treatment) and two groups (treatment and control)
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- save_data(folder, filename)
Saves the generated data as a CSV file
- twfe_model()[source]
Generate panel data for Two-Way Fixed Effects DiD model. This is a generalization of 2-period DiD for multi-year treatments
- class causal_agent.synthetic.MultiTreatRCTGenerator(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorBase class for generating synthetic data for multi-treatment RCTs
- Additional Attributes:
true_effect_vec (np.ndarray): the treatment effect for different treatments.
- __init__(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
Submodules
causal_agent.synthetic.generator module
- class causal_agent.synthetic.generator.DataGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, n_treatments=1, true_effect=0, seed=111, heterogeneity=0)[source]
Bases:
objectBase class for generating synthetic data
- data
Generated data
- Type:
pd.DataFrame
- mean
mean of the covariates
- Type:
np.ndarray
- covar
covariance matrix for the covariates
- Type:
np.ndarray
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, n_treatments=1, true_effect=0, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- class causal_agent.synthetic.generator.MultiTreatRCTGenerator(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorBase class for generating synthetic data for multi-treatment RCTs
- Additional Attributes:
true_effect_vec (np.ndarray): the treatment effect for different treatments.
- __init__(n_observations, n_continuous_covars, n_treatments, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, true_effect_vec=None, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.generator.FrontDoorGenerator(n_observations, n_continuous_covars=2, n_binary_covars=2, mean=None, covar=None, seed=111, true_effect=2.0, heterogeneity=0)[source]
Bases:
DataGeneratorGenerates synthetic data satisfying the front-door criterion. D → M → Y, D ← U → Y
- __init__(n_observations, n_continuous_covars=2, n_binary_covars=2, mean=None, covar=None, seed=111, true_effect=2.0, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.generator.ObservationalDataGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorGenerate synthetic data for observational studies.
- Additional Attributes:
self.weights (np.ndarray): the propoensity score weights for each observation
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- save_data(folder, filename)
Saves the generated data as a CSV file
- test_data(print_=False)
Test the generated data, using the appropriate method.
- class causal_agent.synthetic.generator.PSMGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
ObservationalDataGeneratorGenerate synthetic data for Propensity Score Matching (PSM)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- generate_data()
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- class causal_agent.synthetic.generator.PSWGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
ObservationalDataGeneratorGenerate synthetic data for Propensity Score Weighting (PSW)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- generate_data()
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- class causal_agent.synthetic.generator.RCTGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorGenerate synthetic data for Randomized Controlled Trials (RCT)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.generator.IVGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]
Bases:
DataGenerator- Generate synthetic data for Instrumental Variables (IV) analysis. We assume two forms:
- Encouragement Design:
Z -> D -> Y In this setting, encouragements (Z) is randomized. For instance, consider the administering of vaccines. We cannot force people to take vaccines, however we can encourage them to take the vaccine. We could run a vaccine awareness campaign, where we randomly pick participants, and inform them about the benefits of vaccine. The user can either comply (take the vaccine) or not comply (not take the vaccine). Likewise, in the control group, the user can comply (not take the vaccine) or defy (take the vaccine)
U
/ Z -> D -> Y
This is the classical setting where we have an unobserved confounder affecting both treatment (D) and outcome (Y).
- Additional Attributes:
alpha (float): the effect of the instrument on the treatment (Z on D) encouragement (bool): whether or not this is an encouragement design beta_d (float): effect of the unobserved confounder (U) on treatment (D) beta_y (float): effect of the unobserved confounders (U) on outcome (Y)
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, beta_d=1.0, beta_y=1.5, covar=None, true_effect=1.0, seed=111, heterogeneity=0, alpha=0.5, encouragement=False)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.generator.RDDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]
Bases:
DataGeneratorGenerate synthetic data for (sharp) Regression Discontinuity Design (RDD).
- Additional Attributes:
cutoff (float): the cutoff for treatment assignment bandwidth (float): the bandwidth for the running variable we consider when estimating the treatment effects plot (bool): whether we plot the data or not
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, mean=None, plot=False, covar=None, true_effect=1.0, seed=111, heterogeneity=0, cutoff=10, bandwidth=0.1)[source]
- generate_data()[source]
Generates the synthetic data
- Returns:
The generated data
- Return type:
pd.DataFrame
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- class causal_agent.synthetic.generator.DiDGenerator(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
Bases:
DataGeneratorGenerate synthetic data for Difference-in-Differences (DiD) analysis
- Additional Attributes:
n_periods (int): number of time-periods
- __init__(n_observations, n_continuous_covars, n_binary_covars=2, n_periods=2, mean=None, covar=None, true_effect=1.0, seed=111, heterogeneity=0)[source]
- canonical_did_model()[source]
This is the classical DiD setting with two periods (pre and post treatment) and two groups (treatment and control)
- generate_covariates()
Generate covariates. For continuous covariates, we use multivariate normal distribution, and for binary covars, we use binomial distribution. The non-binary covariates are discretized to their floor integer.
- save_data(folder, filename)
Saves the generated data as a CSV file
- twfe_model()[source]
Generate panel data for Two-Way Fixed Effects DiD model. This is a generalization of 2-period DiD for multi-year treatments
causal_agent.synthetic.io module
causal_agent.synthetic.prompts module
- causal_agent.synthetic.prompts.generate_data_summary(df, n_cont_vars, n_bin_vars, method, cutoff=None)[source]
Generate a summary of the input dataset. The summary includes information about column headings for continuuous, binary, treatment, and outcome variables. Additionally, it also includes information on the method used to generate the dataset and the basic statistical summary.
- Parameters:
- Returns:
Summary of the (raw) dataset.
- Return type: