Synthetic Data Generation System

This document provides comprehensive guidance on the synthetic data generation system used for testing, validation, and benchmarking of causal inference methods in CAIS. The system is a critical component that enables rigorous testing of the autonomous agent’s decision-making capabilities and method selection logic.

Overview

The synthetic data generation system is a foundational component of CAIS that enables:

  • Decision Tree Validation: Test the agent’s method selection logic with known ground truth scenarios

  • Method Performance Testing: Validate causal inference methods with controlled data generation parameters

  • Assumption Violation Testing: Generate data that violates specific method assumptions to test robustness

  • Agent Workflow Testing: Create comprehensive test scenarios for the complete autonomous analysis pipeline

  • Educational Examples: Provide clear examples for tutorials and documentation with known causal relationships

  • Benchmarking: Create standardized datasets for comparing method performance across different scenarios

The system generates realistic datasets that mirror real-world causal inference challenges while maintaining known causal relationships, enabling validation of both individual methods and the agent’s decision-making process.

System Architecture and Decision Tree Integration

The synthetic data generation system is tightly integrated with CAIS’s decision tree logic, enabling comprehensive testing of the autonomous agent’s method selection capabilities.

        graph TB
    subgraph "Decision Tree Testing Framework"
        SCENARIOS[Scenario Definitions]
        GENERATORS[Method-Specific Generators]
        VALIDATION[Ground Truth Validation]
    end

    subgraph "Agent Decision Points"
        EXPERIMENTAL[Experimental Design Detection]
        TEMPORAL[Temporal Structure Analysis]
        CONFOUNDING[Confounding Assessment]
        INSTRUMENTS[Instrument Validation]
    end

    subgraph "Method Generators"
        RCT[RCT Generator]
        DID[DiD Generator]
        IV[IV Generator]
        RDD[RDD Generator]
        PS[Propensity Score Generator]
        MULTI[Multi-Treatment RCT]
        FRONT[Front-Door Generator]
    end

    subgraph "Testing Scenarios"
        CANONICAL[Canonical Scenarios]
        VIOLATIONS[Assumption Violations]
        EDGE[Edge Cases]
        MIXED[Mixed Method Scenarios]
    end

    SCENARIOS --> GENERATORS
    GENERATORS --> RCT
    GENERATORS --> DID
    GENERATORS --> IV
    GENERATORS --> RDD
    GENERATORS --> PS
    GENERATORS --> MULTI
    GENERATORS --> FRONT

    GENERATORS --> EXPERIMENTAL
    GENERATORS --> TEMPORAL
    GENERATORS --> CONFOUNDING
    GENERATORS --> INSTRUMENTS

    VALIDATION --> CANONICAL
    VALIDATION --> VIOLATIONS
    VALIDATION --> EDGE
    VALIDATION --> MIXED
    

Decision Tree Validation Through Synthetic Data

The synthetic data system validates the agent’s decision tree logic by generating datasets with specific characteristics that should trigger particular method selections:

Experimental Design Detection:
  • RCT data with random treatment assignment tests the agent’s ability to detect experimental designs

  • Multi-treatment RCT data validates handling of complex experimental structures

  • Quasi-experimental data tests the distinction between experimental and observational studies

Temporal Structure Recognition:
  • Panel data with treatment timing variation tests DiD method selection

  • Cross-sectional data ensures DiD is not incorrectly selected

  • Time-series data with interventions validates temporal analysis capabilities

Confounding Assessment:
  • Observational data with measured confounders tests propensity score method selection

  • Data with unmeasured confounding validates the agent’s caution in method selection

  • Instrumental variable scenarios test the agent’s ability to leverage instruments

Method Exclusion Logic:
  • Weak instrument scenarios test first-stage F-statistic thresholds

  • Assumption violation scenarios validate the agent’s diagnostic capabilities

  • Edge cases test fallback method selection when primary methods fail

Data Generation Framework

Core Components

The synthetic data generation framework consists of several interconnected components that work together to create comprehensive test scenarios for the CAIS autonomous agent.

        graph TB
    subgraph "Generation Pipeline"
        CONFIG[Configuration System]
        BASE[Base Data Generator]
        METHODS[Method-Specific Generators]
        CONTEXT[Context Generation]
    end

    subgraph "Method Generators"
        RCT[RCT Generator]
        MULTI[Multi-Treatment RCT]
        DID_CAN[Canonical DiD]
        DID_TWFE[TWFE DiD]
        IV[IV Generator]
        IV_ENC[Encouragement Design]
        RDD[RDD Generator]
        PSM[PSM Generator]
        PSW[PSW Generator]
        FRONT[Front-Door Generator]
    end

    subgraph "Output Processing"
        STORAGE[Data Storage]
        METADATA[Metadata Management]
        CONTEXT_GEN[Context Generation]
        VALIDATION[Ground Truth Validation]
    end

    subgraph "Testing Integration"
        DECISION[Decision Tree Testing]
        AGENT[Agent Workflow Testing]
        BENCHMARK[Performance Benchmarking]
    end

    CONFIG --> BASE
    BASE --> METHODS
    METHODS --> RCT
    METHODS --> MULTI
    METHODS --> DID_CAN
    METHODS --> DID_TWFE
    METHODS --> IV
    METHODS --> IV_ENC
    METHODS --> RDD
    METHODS --> PSM
    METHODS --> PSW
    METHODS --> FRONT

    METHODS --> STORAGE
    STORAGE --> METADATA
    STORAGE --> CONTEXT_GEN
    STORAGE --> VALIDATION

    VALIDATION --> DECISION
    VALIDATION --> AGENT
    VALIDATION --> BENCHMARK
    

Configuration System

The configuration system (data_generation/settings.sh) provides centralized parameter management for all data generation processes:

# Dataset sizes for different methods
export RCT_SIZE=10
export MULTI_RCT_SIZE=5
export CANONICAL_DID_SIZE=5
export TWFE_DID_SIZE=5
export OBSERVATIONAL_SIZE=5
export IV_SIZE=5
export ENCOURAGEMENT_SIZE=5
export RDD_SIZE=5

# Observation counts
export MIN_OBS=300
export MAX_OBS=500
export DEFAULT_OBS=1000

# Covariate specifications
export N_CONTINUOUS=5
export N_BINARY=4

# Method-specific parameters
export MAX_TREATMENTS=5      # Multi-treatment RCT
export MAX_PERIODS=10        # TWFE DiD
export CUTOFF=25            # RDD cutoff range

This configuration system enables:

  • Consistent Parameter Management: Centralized control over data generation parameters

  • Scalable Testing: Easy adjustment of dataset sizes for different testing scenarios

  • Method-Specific Tuning: Tailored parameters for each causal inference method

  • Reproducible Results: Fixed parameters ensure consistent test outcomes

Base Data Generator Architecture

The DataGenerator base class provides common functionality for all method-specific generators:

class DataGenerator:
    """Base class for generating synthetic data with common functionality"""

    def __init__(self, n_observations, n_continuous_covars, n_binary_covars=2,
                 mean=None, covar=None, n_treatments=1, true_effect=0,
                 seed=111, heterogeneity=0):
        # Initialize parameters and random state
        np.random.seed(seed)
        self.n_observations = n_observations
        self.n_continuous_covars = n_continuous_covars
        self.n_covars = n_continuous_covars + n_binary_covars
        self.true_effect = true_effect
        self.method = None  # Set by subclasses

        # Generate covariate parameters
        if mean is None:
            self.mean = np.random.randint(3, 20, size=self.n_continuous_covars)
        if covar is None:
            self.covar = np.identity(self.n_continuous_covars)

    def generate_covariates(self):
        """Generate correlated continuous and binary covariates"""
        # Continuous covariates from multivariate normal
        X_c = np.random.multivariate_normal(
            mean=self.mean,
            cov=self.covar,
            size=self.n_observations
        )

        # Binary covariates from binomial
        p = np.random.uniform(0.3, 0.7)
        X_b = np.random.binomial(
            1, p,
            size=(self.n_observations, self.n_binary_covars)
        ).astype(int)

        # Combine and discretize
        covariates = np.hstack((X_c, X_b))
        return covariates.astype(int)

    def generate_data(self):
        """Generate complete synthetic dataset (implemented by subclasses)"""
        raise NotImplementedError("Invoke the method in the subclass")

    def test_data(self, print_=False):
        """Test generated data using appropriate method"""
        raise NotImplementedError("This method should be overridden by subclasses")

    def save_data(self, folder, filename):
        """Save generated data as CSV file"""
        if self.data is None:
            raise ValueError("Data not generated yet. Please generate data first.")

        path = Path(folder)
        path.mkdir(parents=True, exist_ok=True)
        if not filename.endswith('.csv'):
            filename += '.csv'
        self.data.to_csv(path / filename, index=False)

Key Features:

  • Reproducible Generation: Seed-based random number generation ensures consistent results

  • Flexible Covariate Structure: Configurable continuous and binary covariates with realistic correlations

  • Method-Agnostic Base: Common functionality shared across all causal inference methods

  • Validation Integration: Built-in testing capabilities for generated data

  • Standardized Output: Consistent data format and storage mechanisms

Base Data Generator

The foundation of the synthetic data system:

# causal_agent/synthetic/generator.py

from abc import ABC, abstractmethod
import numpy as np
import pandas as pd
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass

@dataclass
class DataGenerationConfig:
    """Configuration for synthetic data generation"""
    n_observations: int = 1000
    n_continuous_covars: int = 3
    n_binary_covars: int = 2
    true_effect: float = 1.0
    noise_level: float = 1.0
    seed: int = 42
    heterogeneity: bool = False

class BaseDataGenerator(ABC):
    """
    Base class for synthetic data generation with common functionality.

    This class provides the foundation for all method-specific data generators,
    including covariate generation, noise modeling, and metadata management.
    """

    def __init__(self, config: DataGenerationConfig):
        self.config = config
        self.data = None
        self.metadata = {}
        self.true_parameters = {}

        # Set random seed for reproducibility
        np.random.seed(config.seed)

        # Initialize covariate parameters
        self.covariate_means = np.random.uniform(-2, 2, config.n_continuous_covars)
        self.covariate_cov = self._generate_covariance_matrix()

    def _generate_covariance_matrix(self) -> np.ndarray:
        """Generate realistic covariance matrix for covariates"""
        n_vars = self.config.n_continuous_covars

        # Generate correlation matrix
        correlations = np.random.uniform(-0.5, 0.5, size=(n_vars, n_vars))
        correlations = (correlations + correlations.T) / 2  # Make symmetric
        np.fill_diagonal(correlations, 1.0)

        # Ensure positive definite
        eigenvals, eigenvecs = np.linalg.eigh(correlations)
        eigenvals = np.maximum(eigenvals, 0.1)  # Ensure positive eigenvalues
        correlations = eigenvecs @ np.diag(eigenvals) @ eigenvecs.T

        # Convert to covariance matrix
        std_devs = np.random.uniform(0.5, 2.0, n_vars)
        covariance = np.outer(std_devs, std_devs) * correlations

        return covariance

    def generate_covariates(self) -> np.ndarray:
        """Generate correlated continuous covariates"""
        return np.random.multivariate_normal(
            mean=self.covariate_means,
            cov=self.covariate_cov,
            size=self.config.n_observations
        )

    def generate_binary_covariates(self) -> np.ndarray:
        """Generate binary covariates"""
        return np.random.binomial(
            1, 0.5,
            size=(self.config.n_observations, self.config.n_binary_covars)
        )

    def add_noise(self, signal: np.ndarray) -> np.ndarray:
        """Add noise to signal with specified noise level"""
        noise = np.random.normal(0, self.config.noise_level, len(signal))
        return signal + noise

    @abstractmethod
    def generate_treatment(self, covariates: np.ndarray) -> np.ndarray:
        """Generate treatment assignment (method-specific)"""
        pass

    @abstractmethod
    def generate_outcome(
        self,
        treatment: np.ndarray,
        covariates: np.ndarray
    ) -> np.ndarray:
        """Generate outcome variable (method-specific)"""
        pass

    @abstractmethod
    def get_method_name(self) -> str:
        """Return the causal method this generator is designed for"""
        pass

    def generate_data(self) -> pd.DataFrame:
        """Generate complete synthetic dataset"""
        # Generate covariates
        continuous_covars = self.generate_covariates()
        binary_covars = self.generate_binary_covariates()

        # Generate treatment
        treatment = self.generate_treatment(continuous_covars)

        # Generate outcome
        outcome = self.generate_outcome(treatment, continuous_covars)

        # Create DataFrame
        data = pd.DataFrame()

        # Add continuous covariates
        for i in range(self.config.n_continuous_covars):
            data[f'X{i+1}'] = continuous_covars[:, i]

        # Add binary covariates
        for i in range(self.config.n_binary_covars):
            data[f'B{i+1}'] = binary_covars[:, i]

        # Add treatment and outcome
        data['treatment'] = treatment
        data['outcome'] = outcome

        # Store metadata
        self.metadata = {
            'method': self.get_method_name(),
            'n_observations': self.config.n_observations,
            'n_continuous_covars': self.config.n_continuous_covars,
            'n_binary_covars': self.config.n_binary_covars,
            'true_effect': self.config.true_effect,
            'noise_level': self.config.noise_level,
            'seed': self.config.seed,
            'heterogeneity': self.config.heterogeneity
        }

        self.data = data
        return data

    def get_true_parameters(self) -> Dict[str, Any]:
        """Return true parameters for validation"""
        return {
            'true_effect': self.config.true_effect,
            'treatment_variable': 'treatment',
            'outcome_variable': 'outcome',
            'covariates': [f'X{i+1}' for i in range(self.config.n_continuous_covars)] +
                        [f'B{i+1}' for i in range(self.config.n_binary_covars)],
            'method': self.get_method_name(),
            **self.true_parameters
        }

    def save_data(self, filepath: str, include_metadata: bool = True):
        """Save generated data and metadata"""
        if self.data is None:
            raise ValueError("No data generated. Call generate_data() first.")

        # Save data
        self.data.to_csv(filepath, index=False)

        # Save metadata
        if include_metadata:
            metadata_path = filepath.replace('.csv', '_metadata.json')
            import json
            with open(metadata_path, 'w') as f:
                json.dump({
                    'metadata': self.metadata,
                    'true_parameters': self.get_true_parameters()
                }, f, indent=2)

Method-Specific Generators and Decision Tree Testing

Each generator is designed to create data that tests specific aspects of the CAIS decision tree logic and method selection capabilities.

Randomized Controlled Trial (RCT) Generator

The RCT generator creates data with random treatment assignment, testing the agent’s ability to detect experimental designs and select appropriate analysis methods.

Decision Tree Testing:
  • Tests detection of random treatment assignment

  • Validates selection of simple difference-in-means analysis

  • Confirms rejection of more complex methods when randomization is present

class RCTGenerator(DataGenerator):
    """Generate synthetic data for Randomized Controlled Trials"""

    def generate_data(self):
        X = self.generate_covariates()
        cols = [f"X{i+1}" for i in range(self.n_covars)]
        df = pd.DataFrame(X, columns=cols)

        # Pure random assignment - key for decision tree testing
        df['D'] = np.random.binomial(1, 0.5, size=self.n_observations)

        # Outcome generation with treatment effect
        vec = np.random.uniform(0, 1, size=self.n_covars)
        intercept = np.random.normal(50, 3)
        noise = np.random.normal(0, 1, size=self.n_observations)
        df['Y'] = (intercept + X.dot(vec) +
                  self.true_effect * df['D'] + noise)

        self.data = df
        return df

    def test_data(self, print_=False):
        """Validate using simple OLS regression"""
        model = smf.ols('Y ~ D', data=self.data).fit()
        est = model.params['D']
        conf_int = model.conf_int().loc['D']

        result = f"TRUE ATE: {self.true_effect:.3f}, ESTIMATED ATE: {est:.3f}, " \
                f"95% CI: [{conf_int[0]:.3f}, {conf_int[1]:.3f}]"
        return result
Agent Testing Scenarios:
  • Random Assignment Detection: Agent should identify random treatment assignment

  • Method Selection: Should select difference-in-means or simple regression

  • Covariate Handling: Should recognize that covariate adjustment is optional but can improve precision

Multi-Treatment RCT Generator

Tests the agent’s handling of complex experimental designs with multiple treatment arms.

Decision Tree Testing:
  • Tests detection of multi-arm experimental designs

  • Validates handling of multiple treatment comparisons

  • Confirms appropriate statistical adjustments for multiple comparisons

class MultiTreatRCTGenerator(DataGenerator):
    """Generate synthetic data for multi-treatment RCTs"""

    def __init__(self, n_observations, n_continuous_covars, n_treatments,
                 true_effect_vec=None, **kwargs):
        super().__init__(n_observations, n_continuous_covars, **kwargs)
        self.n_treatments = n_treatments
        self.true_effect_vec = true_effect_vec or np.zeros(n_treatments)

    def generate_data(self):
        X = self.generate_covariates()
        cols = [f"X{i+1}" for i in range(self.n_covars)]
        df = pd.DataFrame(X, columns=cols)

        # Multi-arm randomization
        df['D'] = np.random.randint(0, self.n_treatments+1,
                                   size=self.n_observations)

        # Treatment effects vary by arm
        treatment_effects = np.array(self.true_effect_vec)
        df['treat_effect'] = treatment_effects[df['D']]

        # Outcome generation
        vec = np.random.uniform(0, 1, size=self.n_covars)
        intercept = np.random.normal(50, 3)
        noise = np.random.normal(0, 1, size=self.n_observations)
        df['Y'] = intercept + X.dot(vec) + df['treat_effect'] + noise

        df.drop(columns='treat_effect', inplace=True)
        self.data = df
        return df
Agent Testing Scenarios:
  • Multi-Arm Recognition: Agent should detect multiple treatment groups

  • Comparison Strategy: Should handle pairwise comparisons appropriately

  • Statistical Power: Should account for reduced power in multi-arm designs

Difference-in-Differences (DiD) Generators

Two DiD generators test different aspects of temporal analysis and panel data handling.

Canonical DiD Generator

Tests the agent’s ability to detect and analyze simple before-after treatment scenarios.

Decision Tree Testing:
  • Tests detection of panel structure with treatment timing

  • Validates parallel trends assumption checking

  • Confirms selection of DiD over other methods when appropriate

class DiDGenerator(DataGenerator):
    """Generate synthetic data for Difference-in-Differences analysis"""

    def canonical_did_model(self):
        """Classical 2x2 DiD with pre/post and treatment/control"""
        # Treatment assignment
        frac_treated = np.random.uniform(0.35, 0.65)
        n_treated = int(frac_treated * self.n_observations)
        treatment_status = np.zeros(self.n_observations, dtype=int)
        treatment_status[:n_treated] = 1
        np.random.shuffle(treatment_status)

        # Generate pre and post periods
        X = self.generate_covariates()
        cols = [f"X{i+1}" for i in range(self.n_covars)]
        covar_df = pd.DataFrame(X, columns=cols)

        # Time-invariant treatment effect and time effect
        treat_effect = np.random.normal(0, 1)
        time_effect = np.random.normal(0, 1)

        # Pre-period data
        pre_outcome = (intercept + covar_term + pre_noise +
                      treat_effect * treatment_status)
        pre_data = pd.DataFrame({
            'unit_id': unit_ids, 'post': 0, 'D': treatment_status,
            'Y': pre_outcome
        })

        # Post-period data with treatment effect
        post_outcome = (intercept + time_effect + covar_term +
                       self.true_effect * treatment_status +
                       treat_effect * treatment_status + post_noise)
        post_data = pd.DataFrame({
            'unit_id': unit_ids, 'post': 1, 'D': treatment_status,
            'Y': post_outcome
        })

        # Combine periods
        df = pd.concat([pre_data, post_data], ignore_index=True)
        return df.merge(covar_df, left_on="unit_id", right_index=True)

Two-Way Fixed Effects (TWFE) DiD Generator

Tests the agent’s handling of staggered treatment adoption and complex panel structures.

Decision Tree Testing:
  • Tests detection of staggered treatment timing

  • Validates handling of multiple time periods

  • Confirms appropriate use of fixed effects

def twfe_model(self):
    """Generate panel data for Two-Way Fixed Effects DiD"""
    # Create panel structure
    unit_ids = np.arange(1, self.n_observations + 1)
    time_periods = np.arange(0, self.n_periods)

    df = pd.DataFrame([(i, t) for i in unit_ids for t in time_periods],
                     columns=["unit", "time"])

    # Staggered treatment adoption
    frac_treated = np.random.uniform(0.35, 0.65)
    n_treated = int(frac_treated * self.n_observations)
    treated_units = np.random.choice(unit_ids, size=n_treated, replace=False)
    treatment_start = {unit: np.random.randint(1, self.n_periods)
                      for unit in treated_units}

    # Treatment indicator
    df["treat_post"] = df.apply(
        lambda row: int(row["unit"] in treatment_start and
                       row["time"] >= treatment_start[row["unit"]]),
        axis=1
    )

    # Fixed effects and outcome generation
    unit_effects = dict(zip(unit_ids, np.random.normal(0, 1.0, self.n_observations)))
    time_effects = dict(zip(time_periods, np.random.normal(0, 1, len(time_periods))))

    df["Y"] = (intercept + covar_term +
              df["unit"].map(unit_effects) +
              df["time"].map(time_effects) +
              self.true_effect * df["treat_post"] + noise)

    return df
Agent Testing Scenarios:
  • Panel Detection: Agent should identify panel data structure

  • Treatment Timing: Should detect staggered vs. simultaneous treatment

  • Fixed Effects: Should include appropriate fixed effects in analysis

  • Parallel Trends: Should test parallel trends assumption when possible

Instrumental Variables (IV) Generators

Two IV generators test different aspects of instrumental variable analysis and endogeneity handling.

Standard IV Generator

Tests the agent’s ability to detect and utilize instrumental variables for endogenous treatments.

Decision Tree Testing:
  • Tests detection of potential endogeneity

  • Validates instrument strength assessment (first-stage F-statistic)

  • Confirms appropriate use of 2SLS estimation

class IVGenerator(DataGenerator):
    """Generate synthetic data for Instrumental Variables analysis"""

    def generate_data(self):
        X = self.generate_covariates()

        # Instrument (exogenous)
        Z = np.random.normal(mean, 2, size=self.n_observations).astype(int)

        # Unobserved confounder (creates endogeneity)
        U = np.random.normal(0, 1, size=self.n_observations)

        # Endogenous treatment
        vec1 = np.random.normal(0, 0.5, size=self.n_covars)
        intercept1 = np.random.normal(30, 2)
        D = (self.alpha * Z + X @ vec1 +
            np.random.normal(size=self.n_observations) +
            intercept1)

        if not self.encouragement:
            D = D + self.beta_d * U  # Add endogeneity

        # Outcome with confounding
        intercept2 = np.random.normal(50, 3)
        vec2 = np.random.normal(0, 0.5, size=self.n_covars)
        Y = (self.true_effect * D + X @ vec2 +
            np.random.normal(size=self.n_observations) + intercept2)

        if not self.encouragement:
            Y = Y + self.beta_y * U  # Add confounding

        df = pd.DataFrame(X, columns=[f"X{i+1}" for i in range(self.n_covars)])
        df['Z'] = Z
        df['D'] = D.astype(int)
        df['Y'] = Y

        self.data = df
        return df

Encouragement Design Generator

Tests the agent’s handling of encouragement designs and compliance issues.

Decision Tree Testing:
  • Tests detection of encouragement design structure

  • Validates handling of partial compliance

  • Confirms appropriate LATE (Local Average Treatment Effect) interpretation

Agent Testing Scenarios:
  • Instrument Detection: Agent should identify potential instruments (Z variable)

  • Strength Assessment: Should calculate and evaluate first-stage F-statistic

  • Endogeneity Testing: Should test for endogeneity when possible

  • Method Selection: Should choose IV over OLS when endogeneity is detected

Regression Discontinuity (RDD) Generator

Tests the agent’s ability to detect and analyze regression discontinuity designs.

Decision Tree Testing:
  • Tests detection of running variable and cutoff

  • Validates bandwidth selection for local analysis

  • Confirms appropriate polynomial specification

class RDDGenerator(DataGenerator):
    """Generate synthetic data for Regression Discontinuity Design"""

    def generate_data(self):
        X = self.generate_covariates()
        cols = [f"X{i+1}" for i in range(self.n_covars)]
        df = pd.DataFrame(X, columns=cols)

        # Running variable around cutoff
        df['running_X'] = (np.random.normal(0, 2, size=self.n_observations) +
                          self.cutoff)

        # Sharp discontinuity in treatment
        df['D'] = (df['running_X'] >= self.cutoff).astype(int)

        # Outcome with smooth function and discontinuity
        df['running_centered'] = df['running_X'] - self.cutoff

        # Different slopes above and below cutoff
        m_below = 1.5
        m_above = 0.8

        df["Y"] = (intercept + self.true_effect * df['D'] +
                  m_below * df['running_centered'] * (1 - df['D']) +
                  m_above * df['running_centered'] * df['D'] +
                  X @ coeffs +
                  np.random.normal(0, 0.5, size=self.n_observations))

        self.data = df[[col for col in df.columns if col != 'running_centered']]
        return self.data
Agent Testing Scenarios:
  • Discontinuity Detection: Agent should identify running variable and cutoff

  • Bandwidth Selection: Should choose appropriate bandwidth for analysis

  • Specification Testing: Should test for appropriate polynomial order

  • Validity Checks: Should perform density and covariate balance tests

Propensity Score Generators

Two generators test different propensity score methods and observational data analysis.

Propensity Score Matching (PSM) Generator

Tests the agent’s ability to handle selection bias through matching methods.

Decision Tree Testing:
  • Tests detection of observational data with selection bias

  • Validates propensity score estimation and matching procedures

  • Confirms appropriate balance checking

class PSMGenerator(ObservationalDataGenerator):
    """Generate synthetic data for Propensity Score Matching"""

    def test_data(self, print_=False):
        """Test using propensity score matching"""
        lr = LogisticRegression(solver='lbfgs')
        X = self.data[[f"X{i+1}" for i in range(self.n_covars)]]
        lr.fit(X, self.data['D'])
        ps_hat = lr.predict_proba(X)[:, 1]

        # Perform 1:1 nearest neighbor matching
        treated = self.data[self.data['D'] == 1]
        control = self.data[self.data['D'] == 0]

        match_idxs = [np.abs(ps_hat[control.index] - ps_hat[i]).argmin()
                     for i in treated.index]
        matches = control.iloc[match_idxs]

        # Calculate ATT
        att = treated['Y'].mean() - matches['Y'].mean()

        result = f"Estimated ATT (matching): {att:.3f} | True: {self.true_effect}"
        return result

Propensity Score Weighting (PSW) Generator

Tests the agent’s ability to use inverse probability weighting for causal inference.

Decision Tree Testing:
  • Tests detection of observational data requiring reweighting

  • Validates inverse probability weighting procedures

  • Confirms appropriate weight calculation and trimming

Agent Testing Scenarios:
  • Selection Bias Detection: Agent should identify potential confounding

  • Propensity Score Estimation: Should estimate propensity scores appropriately

  • Method Choice: Should choose between matching and weighting based on data characteristics

  • Balance Assessment: Should check covariate balance after adjustment

Front-Door Criterion Generator

Tests the agent’s ability to handle mediation analysis and front-door identification.

Decision Tree Testing:
  • Tests detection of mediation structure (D → M → Y)

  • Validates front-door criterion application

  • Confirms appropriate sequential regression approach

class FrontDoorGenerator(DataGenerator):
    """Generate synthetic data satisfying the front-door criterion"""

    def generate_data(self):
        X = self.generate_covariates()
        cols = [f"X{i+1}" for i in range(self.n_covars)]
        df = pd.DataFrame(X, columns=cols)

        # Latent confounder U affects both D and Y
        U = np.random.normal(0, 1, self.n_observations)

        # Treatment depends on U and X (confounded)
        vec_d = np.random.uniform(0.5, 1.5, size=self.n_covars)
        df['D'] = (X @ vec_d + 0.8 * U +
                  np.random.normal(0, 1, self.n_observations)) > 0
        df['D'] = df['D'].astype(int)

        # Mediator depends on D and X (front-door path)
        vec_m = np.random.uniform(0.5, 1.5, size=self.n_covars)
        df['M'] = X @ vec_m + df['D'] * 1.5 + np.random.normal(0, 1, self.n_observations)

        # Outcome depends on M, U, and X (not directly on D)
        vec_y = np.random.uniform(0.5, 1.5, size=self.n_covars)
        df['Y'] = (50 + 2.0 * df['M'] + 1.0 * U + X @ vec_y +
                  np.random.normal(0, 1, self.n_observations))

        self.data = df
        return df
Agent Testing Scenarios:
  • Mediation Detection: Agent should identify mediator variables

  • Front-Door Validity: Should assess front-door criterion assumptions

  • Sequential Analysis: Should perform appropriate two-stage analysis

class RCTDataGenerator(BaseDataGenerator):
    """Generate data from randomized controlled trials"""

    def get_method_name(self) -> str:
        return "randomized_controlled_trial"

    def generate_treatment(self, covariates: np.ndarray) -> np.ndarray:
        """Generate randomly assigned treatment"""
        # Pure randomization - independent of covariates
        return np.random.binomial(1, 0.5, self.config.n_observations)

    def generate_outcome(
        self,
        treatment: np.ndarray,
        covariates: np.ndarray
    ) -> np.ndarray:
        """Generate outcome with treatment effect"""
        # Base outcome from covariates
        base_outcome = (
            2.0 +  # Intercept
            0.5 * covariates[:, 0] +  # Effect of X1
            0.3 * covariates[:, 1] +  # Effect of X2
            -0.2 * covariates[:, 2]   # Effect of X3
        )

        # Add treatment effect
        if self.config.heterogeneity:
            # Heterogeneous treatment effects
            treatment_effect = (
                self.config.true_effect *
                (1 + 0.5 * covariates[:, 0])  # Effect varies with X1
            )
        else:
            # Homogeneous treatment effect
            treatment_effect = self.config.true_effect

        outcome = base_outcome + treatment_effect * treatment

        # Add noise
        return self.add_noise(outcome)

Difference-in-Differences Generator

Generate panel data suitable for DiD analysis:

class DifferenceInDifferencesGenerator(BaseDataGenerator):
    """Generate panel data for Difference-in-Differences analysis"""

    def __init__(self, config: DataGenerationConfig, n_periods: int = 4, n_units: int = 50):
        super().__init__(config)
        self.n_periods = n_periods
        self.n_units = n_units
        self.config.n_observations = n_units * n_periods

    def get_method_name(self) -> str:
        return "difference_in_differences"

    def generate_data(self) -> pd.DataFrame:
        """Generate panel data with treatment timing variation"""
        data_list = []

        # Generate unit-specific effects
        unit_effects = np.random.normal(0, 1, self.n_units)

        # Generate time effects
        time_effects = np.random.normal(0, 0.5, self.n_periods)

        # Determine treatment timing (some units treated in period 3)
        treatment_units = np.random.choice(
            self.n_units,
            size=self.n_units // 2,
            replace=False
        )
        treatment_start_period = 2  # Treatment starts in period 3 (0-indexed)

        for unit in range(self.n_units):
            for period in range(self.n_periods):
                # Generate covariates (time-varying)
                covariates = np.random.multivariate_normal(
                    self.covariate_means,
                    self.covariate_cov
                )

                # Treatment assignment
                is_treated_unit = unit in treatment_units
                is_post_treatment = period >= treatment_start_period
                treatment = 1 if (is_treated_unit and is_post_treatment) else 0

                # Outcome generation
                outcome = (
                    unit_effects[unit] +  # Unit fixed effect
                    time_effects[period] +  # Time fixed effect
                    0.5 * covariates[0] +  # Covariate effects
                    0.3 * covariates[1] +
                    self.config.true_effect * treatment +  # Treatment effect
                    np.random.normal(0, self.config.noise_level)  # Noise
                )

                # Create row
                row = {
                    'unit_id': unit,
                    'time_period': period,
                    'treatment': treatment,
                    'outcome': outcome,
                    'treated_unit': int(is_treated_unit),
                    'post_treatment': int(is_post_treatment)
                }

                # Add covariates
                for i, covar in enumerate(covariates):
                    row[f'X{i+1}'] = covar

                data_list.append(row)

        self.data = pd.DataFrame(data_list)

        # Update metadata
        self.metadata.update({
            'n_units': self.n_units,
            'n_periods': self.n_periods,
            'treatment_start_period': treatment_start_period,
            'n_treated_units': len(treatment_units)
        })

        return self.data

Instrumental Variables Generator

Generate data with instrumental variables:

class InstrumentalVariableGenerator(BaseDataGenerator):
    """Generate data with instrumental variables for endogeneity"""

    def __init__(self, config: DataGenerationConfig, instrument_strength: float = 0.5):
        super().__init__(config)
        self.instrument_strength = instrument_strength

    def get_method_name(self) -> str:
        return "instrumental_variable"

    def generate_data(self) -> pd.DataFrame:
        """Generate data with endogenous treatment and valid instrument"""
        # Generate covariates
        covariates = self.generate_covariates()

        # Generate unobserved confounder
        unobserved_confounder = np.random.normal(0, 1, self.config.n_observations)

        # Generate instrument (exogenous)
        instrument = np.random.normal(0, 1, self.config.n_observations)

        # Generate endogenous treatment
        # Treatment depends on instrument, covariates, and unobserved confounder
        treatment_propensity = (
            self.instrument_strength * instrument +  # Instrument effect
            0.3 * covariates[:, 0] +  # Covariate effects
            0.2 * covariates[:, 1] +
            0.4 * unobserved_confounder  # Endogeneity source
        )

        treatment_prob = 1 / (1 + np.exp(-treatment_propensity))
        treatment = np.random.binomial(1, treatment_prob)

        # Generate outcome
        # Outcome depends on treatment, covariates, and unobserved confounder
        outcome = (
            2.0 +  # Intercept
            self.config.true_effect * treatment +  # Treatment effect
            0.5 * covariates[:, 0] +  # Covariate effects
            0.3 * covariates[:, 1] +
            -0.2 * covariates[:, 2] +
            0.6 * unobserved_confounder +  # Confounding
            np.random.normal(0, self.config.noise_level)  # Noise
        )

        # Create DataFrame
        data = pd.DataFrame({
            'treatment': treatment,
            'outcome': outcome,
            'instrument': instrument,
            'unobserved_confounder': unobserved_confounder  # For validation only
        })

        # Add covariates
        for i in range(self.config.n_continuous_covars):
            data[f'X{i+1}'] = covariates[:, i]

        # Store additional parameters
        self.true_parameters.update({
            'instrument_strength': self.instrument_strength,
            'instrument_variable': 'instrument',
            'first_stage_f_stat': self._calculate_first_stage_f_stat(instrument, treatment)
        })

        self.data = data
        return data

    def _calculate_first_stage_f_stat(self, instrument: np.ndarray, treatment: np.ndarray) -> float:
        """Calculate first-stage F-statistic for instrument strength"""
        from sklearn.linear_model import LinearRegression
        from scipy import stats

        # First stage regression: treatment ~ instrument
        X = instrument.reshape(-1, 1)
        reg = LinearRegression().fit(X, treatment)

        # Calculate F-statistic
        predictions = reg.predict(X)
        residuals = treatment - predictions

        mse = np.mean(residuals**2)
        coefficient = reg.coef_[0]
        se = np.sqrt(mse / np.sum((instrument - np.mean(instrument))**2))

        f_stat = (coefficient / se)**2
        return f_stat

Regression Discontinuity Generator

Generate data with discontinuous treatment assignment:

class RegressionDiscontinuityGenerator(BaseDataGenerator):
    """Generate data for Regression Discontinuity Design"""

    def __init__(self, config: DataGenerationConfig, cutoff: float = 0.0, bandwidth: float = 2.0):
        super().__init__(config)
        self.cutoff = cutoff
        self.bandwidth = bandwidth

    def get_method_name(self) -> str:
        return "regression_discontinuity"

    def generate_data(self) -> pd.DataFrame:
        """Generate data with discontinuous treatment assignment"""
        # Generate running variable (forcing variable)
        running_variable = np.random.uniform(
            self.cutoff - self.bandwidth,
            self.cutoff + self.bandwidth,
            self.config.n_observations
        )

        # Generate covariates
        covariates = self.generate_covariates()

        # Treatment assignment based on cutoff
        treatment = (running_variable >= self.cutoff).astype(int)

        # Generate outcome with discontinuity at cutoff
        # Smooth function of running variable
        smooth_outcome = (
            2.0 +  # Intercept
            0.5 * running_variable +  # Smooth trend
            -0.1 * running_variable**2 +  # Quadratic trend
            0.3 * covariates[:, 0] +  # Covariate effects
            0.2 * covariates[:, 1]
        )

        # Add treatment effect (discontinuity)
        outcome = smooth_outcome + self.config.true_effect * treatment

        # Add noise
        outcome = self.add_noise(outcome)

        # Create DataFrame
        data = pd.DataFrame({
            'treatment': treatment,
            'outcome': outcome,
            'running_variable': running_variable
        })

        # Add covariates
        for i in range(self.config.n_continuous_covars):
            data[f'X{i+1}'] = covariates[:, i]

        # Store additional parameters
        self.true_parameters.update({
            'cutoff': self.cutoff,
            'bandwidth': self.bandwidth,
            'running_variable': 'running_variable'
        })

        self.data = data
        return data

Propensity Score Generator

Generate observational data suitable for propensity score methods:

class PropensityScoreGenerator(BaseDataGenerator):
    """Generate observational data for propensity score methods"""

    def __init__(self, config: DataGenerationConfig, selection_strength: float = 1.0):
        super().__init__(config)
        self.selection_strength = selection_strength

    def get_method_name(self) -> str:
        return "propensity_score_matching"

    def generate_treatment(self, covariates: np.ndarray) -> np.ndarray:
        """Generate treatment with selection on observables"""
        # Treatment propensity depends on covariates
        propensity_logit = (
            -0.5 +  # Intercept (affects overall treatment rate)
            self.selection_strength * 0.8 * covariates[:, 0] +  # Strong selection
            self.selection_strength * 0.6 * covariates[:, 1] +  # Moderate selection
            self.selection_strength * 0.4 * covariates[:, 2]    # Weak selection
        )

        propensity_prob = 1 / (1 + np.exp(-propensity_logit))
        treatment = np.random.binomial(1, propensity_prob)

        # Store true propensity scores for validation
        self.true_parameters['true_propensity_scores'] = propensity_prob

        return treatment

    def generate_outcome(
        self,
        treatment: np.ndarray,
        covariates: np.ndarray
    ) -> np.ndarray:
        """Generate outcome with confounding"""
        # Base outcome depends on same covariates that affect treatment
        base_outcome = (
            3.0 +  # Intercept
            0.7 * covariates[:, 0] +  # Confounding variable
            0.5 * covariates[:, 1] +  # Confounding variable
            0.3 * covariates[:, 2] +  # Confounding variable
            -0.2 * covariates[:, 0] * covariates[:, 1]  # Interaction
        )

        # Add treatment effect
        if self.config.heterogeneity:
            # Heterogeneous effects based on covariates
            treatment_effect = (
                self.config.true_effect *
                (1 + 0.3 * covariates[:, 0])
            )
        else:
            treatment_effect = self.config.true_effect

        outcome = base_outcome + treatment_effect * treatment

        return self.add_noise(outcome)

Data Generation Workflow and Scripts

The synthetic data generation system includes a comprehensive workflow for creating, contextualizing, and validating synthetic datasets. This section documents the complete process from configuration to final dataset preparation.

Generation Pipeline Overview

The data generation process follows a structured pipeline:

        graph LR
    subgraph "Configuration"
        CONFIG[settings.sh]
        PARAMS[Parameter Setup]
    end

    subgraph "Data Generation"
        SCRIPTS[Generation Scripts]
        GENERATORS[Method Generators]
        DATA[Raw Datasets]
    end

    subgraph "Context Creation"
        LLM[LLM Context Generation]
        LABELS[Variable Labels]
        STORIES[Background Stories]
        QUERIES[Causal Queries]
    end

    subgraph "Finalization"
        RENAME[Column Renaming]
        METADATA[Metadata Creation]
        VALIDATION[Ground Truth Files]
    end

    CONFIG --> PARAMS
    PARAMS --> SCRIPTS
    SCRIPTS --> GENERATORS
    GENERATORS --> DATA
    DATA --> LLM
    LLM --> LABELS
    LLM --> STORIES
    LLM --> QUERIES
    LABELS --> RENAME
    STORIES --> METADATA
    QUERIES --> VALIDATION
    

Step 1: Configuration and Parameter Setup

The generation process begins with configuration in data_generation/settings.sh:

# Base directory for all synthetic data
export BASE_FOLDER="data_generation/samples/synthetic"

# Dataset sizes for each method
export RCT_SIZE=10
export MULTI_RCT_SIZE=5
export CANONICAL_DID_SIZE=5
export TWFE_DID_SIZE=5
export OBSERVATIONAL_SIZE=5
export IV_SIZE=5
export ENCOURAGEMENT_SIZE=5
export RDD_SIZE=5

# Observation count ranges
export MIN_OBS=300
export MAX_OBS=500
export DEFAULT_OBS=1000

# Special parameters for TWFE (smaller for computational efficiency)
export DEFAULT_OBS_TWFE=100
export MIN_OBS_TWFE=50
export MAX_OBS_TWFE=100

# Covariate specifications
export N_CONTINUOUS=5        # Maximum continuous covariates
export N_BINARY=4           # Maximum binary covariates

# Method-specific parameters
export MAX_TREATMENTS=5      # Multi-treatment RCT arms
export MAX_PERIODS=10        # TWFE time periods
export CUTOFF=25            # RDD cutoff range
Configuration Features:
  • Scalable Testing: Easily adjust dataset sizes for different testing needs

  • Method-Specific Tuning: Tailored parameters for each causal method

  • Resource Management: Smaller datasets for computationally intensive methods

  • Reproducible Setup: Consistent parameters across all generation runs

Step 2: Raw Data Generation

Individual method scripts generate raw synthetic datasets:

Single Method Generation:

# Generate RCT data
bash data_generation/create_data/create_rct_data.sh

# Generate DiD data
bash data_generation/create_data/create_did_canonical_data.sh

# Generate IV data
bash data_generation/create_data/create_iv_data.sh

Batch Generation:

# Generate all methods at once
bash data_generation/create_synthetic_data_all.sh

Each generation script follows this pattern:

#!/bin/sh
source data_generation/settings.sh

METHOD="rct"
METADATA_FOLDER="${BASE_FOLDER}/${METHOD}/metadata"
DATA_FOLDER="${BASE_FOLDER}/${METHOD}/data"

python main/generate_synthetic.py \
    -md ${METADATA_FOLDER} \
    -d ${DATA_FOLDER} \
    -m ${METHOD} \
    -s ${DEFAULT_SIZE} \
    -mb ${N_BINARY} \
    -mc ${N_CONTINUOUS} \
    -o ${DEFAULT_OBS}

Output Structure:

data_generation/samples/synthetic/
├── rct/
│   ├── data/
│   │   ├── rct_data_0.csv
│   │   ├── rct_data_1.csv
│   │   └── ...
│   └── metadata/
│       └── rct.json
├── did_canonical/
│   ├── data/
│   └── metadata/
└── ...

Step 3: Context Generation with LLM Integration

The system uses LLM integration to generate realistic contexts for synthetic datasets, making them suitable for testing the complete CAIS workflow.

Context Generation Process:

# Generate context for single method
bash data_generation/create_context/create_context_rct.sh

# Generate contexts for all methods
bash data_generation/create_context_all.sh

LLM Prompt Engineering:

The context generation uses sophisticated prompts to create realistic scenarios:

def create_prompt(summary, method, domain, history):
    """Creates a prompt for generating realistic dataset contexts"""

    method_names = {
        "rct": "Randomized Control Trial",
        "did_canonical": "Canonical Difference in Differences",
        "iv": "Instrumental Variable",
        "rdd": "Regression Discontinuity Design",
        # ... other methods
    }

    domain_guides = {
        "education": "Education data often includes student performance, "
                    "school-level features, socioeconomic background...",
        "healthcare": "Healthcare data may include treatments, diagnoses, "
                     "hospital visits, recovery outcomes...",
        "labor": "Labor datasets typically include income, education, "
                "job type, employment history...",
        "policy": "Policy evaluation data may track program participation, "
                 "regional differences, economic impact..."
    }

    prompt = f"""
    You are generating realistic contexts for synthetic datasets.

    Dataset: {method_names[method]} study in the {domain} domain.

    Dataset Summary: {summary}

    Previously Used Contexts (avoid duplication): {history}

    Tasks:
    1. Propose a realistic real-world scenario
    2. Assign realistic variable names in snake_case
    3. Provide one-line descriptions for each variable
    4. Write background paragraph about data collection
    5. Create a natural language causal question
    6. Write a 1-2 sentence summary

    Return as JSON with keys: variable_labels, description, question, summary, domain
    """

    return prompt

Context Output Example:

{
  "variable_labels": {
    "X1": "years_education",
    "X2": "household_income",
    "X3": "urban_residence",
    "D": "job_training_program",
    "Y": "monthly_earnings"
  },
  "description": "This dataset was collected from a randomized evaluation of a job training program conducted by the Department of Labor in 2019-2020. Participants were randomly assigned to receive either intensive job training or standard employment services.",
  "question": "What is the impact of the job training program on participants' monthly earnings?",
  "summary": "Randomized trial data measuring the effect of job training on employment outcomes.",
  "domain": "labor"
}

Step 4: Data Finalization and Integration

The final step combines raw data with generated contexts to create analysis-ready datasets:

# Finalize all synthetic datasets
bash data_generation/finalize_synthetic_dataset.sh

Finalization Process:

  1. Column Renaming: Replace generic names (X1, X2, D, Y) with realistic variable names

  2. Metadata Integration: Combine generation parameters with contextual information

  3. Ground Truth Files: Create files with known causal effects for validation

  4. Analysis-Ready Format: Prepare datasets for CAIS agent testing

Final Output Structure:

data_generation/samples/synthetic/
├── synthetic_data/           # Renamed datasets
│   ├── rct_data_0.csv
│   ├── did_canonical_data_0.csv
│   └── ...
├── data_info/               # Ground truth files
│   ├── rct_info.csv
│   ├── did_canonical_info.csv
│   └── ...
└── [method]/
    ├── data/                # Original datasets
    ├── metadata/            # Generation metadata
    └── description/         # LLM-generated contexts

Ground Truth File Format:

data_files,natural_language_query,data_description,method,answer,keywords
rct_data_0.csv,"What is the impact of job training on earnings?","Randomized trial of job training program...","rct","1.23","Causality, Treatment effect"

Logging and Quality Control

The generation system includes comprehensive logging for quality control and debugging:

Logging Configuration (data_generation/log_config.ini):

[loggers]
keys=root,observational_data_logger,did_data_logger,iv_data_logger,rct_data_logger

[handlers]
keys=consoleHandler,obsHandler,didHandler,ivHandler,rctHandler

[formatters]
keys=simpleFormatter,complexFormatter

[logger_rct_data_logger]
level=DEBUG
handlers=consoleHandler,rctHandler
qualname=rct_data_logger
propagate=0
Quality Control Features:
  • Generation Validation: Each generator tests its output against known ground truth

  • Statistical Verification: Automated checks of treatment effects and method assumptions

  • Context Quality: LLM-generated contexts are validated for realism and consistency

  • Reproducibility: All generation steps are logged with parameters and random seeds

Batch Processing and Agent Testing

The system supports batch processing for comprehensive agent testing:

Agent Testing Script (data_generation/run_agent.py):

def run_caia(desc, question, df):
    """Run CAIS agent on synthetic dataset"""
    return run_causal_analysis(
        query=question,
        dataset_path=df,
        dataset_description=desc
    )

def main():
    """Process multiple datasets and collect results"""
    meta_df = pd.read_csv(args.csv_meta)
    results = {}

    for idx, row in meta_df.iterrows():
        data_path = os.path.join(args.data_dir, str(row["data_files"]))

        try:
            res = run_caia(
                desc=row["data_description"],
                question=row["natural_language_query"],
                df=data_path,
            )

            # Format results for validation
            formatted_result = {
                "query": row["natural_language_query"],
                "method": row["method"],
                "true_answer": row["answer"],
                "agent_result": res['results']['results'],
                "explanation": res.get("explanation", ""),
                "method_selected": res['results']['results'].get("method_used")
            }

            results[idx] = formatted_result

        except Exception as e:
            results[idx] = {"error": str(e)}

    # Save comprehensive results
    with open(args.output_json, "w") as f:
        json.dump(results, f, indent=2)
Testing Capabilities:
  • Method Selection Validation: Compare agent’s method choice with expected method

  • Effect Estimation Accuracy: Compare estimated effects with known ground truth

  • Decision Tree Logic: Validate decision tree paths for different data types

  • Error Handling: Test agent behavior with edge cases and assumption violations

Scenario Generation and Testing

The synthetic data system supports various testing scenarios to validate different aspects of the CAIS agent.

Assumption Violation Scenarios

Generate data that violates specific method assumptions to test agent robustness:

**Parallel Trends Violation (DiD)**:

Tests the agent’s ability to detect and handle violations of the parallel trends assumption in difference-in-differences analysis.

def generate_parallel_trends_violation(base_generator, violation_strength=0.5):
    """Generate DiD data with differential pre-trends"""
    data = base_generator.generate_data()

    # Add differential time trends for treated units
    treated_units = data['treated_unit'] == 1
    time_trend_violation = (
        violation_strength *
        data['time_period'] *
        treated_units.astype(int)
    )

    data['outcome'] += time_trend_violation
    return data

Agent Testing: Should detect trend violations through pre-treatment trend tests and either warn users or suggest alternative methods.

Weak Instrument (IV):

Tests the agent’s handling of weak instruments that violate the relevance assumption.

def generate_weak_instrument(base_generator, weak_strength=0.1):
    """Generate IV data with weak first-stage relationship"""
    base_generator.instrument_strength = weak_strength
    data = base_generator.generate_data()

    # Calculate first-stage F-statistic for validation
    first_stage_f = calculate_first_stage_f_stat(
        data['instrument'],
        data['treatment']
    )

    return data, first_stage_f

Agent Testing: Should calculate first-stage F-statistic and warn when F < 10, potentially suggesting alternative methods.

Unmeasured Confounding (Propensity Score):

Tests the agent’s behavior when key confounders are unmeasured, violating the unconfoundedness assumption.

def generate_unmeasured_confounding(base_generator, confounding_strength=0.8):
    """Generate data with unmeasured confounding"""
    data = base_generator.generate_data()

    # Add unmeasured confounder affecting both treatment and outcome
    n_obs = len(data)
    unmeasured_confounder = np.random.normal(0, 1, n_obs)

    # Retrospectively adjust treatment probabilities
    treatment_adjustment = confounding_strength * unmeasured_confounder
    adjusted_probs = 1 / (1 + np.exp(-treatment_adjustment))
    data['treatment'] = np.random.binomial(1, adjusted_probs)

    # Add confounding to outcome
    data['outcome'] += confounding_strength * unmeasured_confounder

    return data

Agent Testing: Should perform sensitivity analyses and warn about potential unmeasured confounding when balance tests fail.

Manipulation of Running Variable (RDD):

Tests the agent’s ability to detect manipulation around the cutoff in regression discontinuity designs.

def generate_rdd_manipulation(base_generator, manipulation_strength=0.3):
    """Generate RDD data with running variable manipulation"""
    data = base_generator.generate_data()

    # Add manipulation near cutoff
    near_cutoff = np.abs(data['running_variable'] - base_generator.cutoff) < 0.5
    manipulation_effect = (
        manipulation_strength *
        np.random.normal(0, 1, len(data)) *
        near_cutoff
    )

    data['running_variable'] += manipulation_effect

    # Recalculate treatment based on manipulated running variable
    data['treatment'] = (data['running_variable'] >= base_generator.cutoff).astype(int)

    return data

Agent Testing: Should perform McCrary density tests and detect discontinuities in the running variable distribution.

Edge Case and Robustness Testing

The system generates challenging edge cases to test agent robustness:

Small Sample Sizes:

def generate_small_sample_data(method="rct", n_obs=30):
    """Generate small sample data to test statistical power"""
    config = DataGenerationConfig(n_observations=n_obs)
    generator = get_generator_class(method)(config)

    data = generator.generate_data()

    # Calculate expected statistical power
    effect_size = config.true_effect / config.noise_level
    power = calculate_statistical_power(n_obs, effect_size)

    return data, power

Agent Testing: Should warn about low statistical power and suggest larger samples or alternative methods.

High-Dimensional Data:

def generate_high_dimensional_data(method="observational", n_covariates=50):
    """Generate data with many covariates to test curse of dimensionality"""
    config = DataGenerationConfig(
        n_continuous_covars=n_covariates,
        n_observations=200  # Relatively small sample
    )

    generator = PropensityScoreGenerator(config)
    data = generator.generate_data()

    return data

Agent Testing: Should detect high-dimensional settings and suggest regularization or dimension reduction.

Extreme Outliers:

def generate_outlier_data(base_generator, outlier_fraction=0.05):
    """Generate data with extreme outliers"""
    data = base_generator.generate_data()

    n_outliers = int(outlier_fraction * len(data))
    outlier_indices = np.random.choice(len(data), n_outliers, replace=False)

    # Add extreme values to outcome
    outlier_values = np.random.choice([-1, 1], n_outliers) * np.random.uniform(5, 10, n_outliers)
    data.loc[outlier_indices, 'outcome'] += outlier_values

    return data

Agent Testing: Should detect outliers and suggest robust estimation methods or outlier removal.

Missing Data Patterns:

def generate_missing_data(base_generator, missing_pattern="random", missing_rate=0.15):
    """Generate data with various missing data patterns"""
    data = base_generator.generate_data()

    if missing_pattern == "random":
        # Missing completely at random
        for col in data.columns:
            if col not in ['treatment', 'outcome']:
                n_missing = int(missing_rate * len(data))
                missing_indices = np.random.choice(len(data), n_missing, replace=False)
                data.loc[missing_indices, col] = np.nan

    elif missing_pattern == "informative":
        # Missing not at random - higher missingness for treated units
        treated_indices = data[data['treatment'] == 1].index
        for col in data.columns:
            if col not in ['treatment', 'outcome']:
                # Higher missing rate for treated units
                treated_missing = np.random.choice(
                    treated_indices,
                    int(missing_rate * 1.5 * len(treated_indices)),
                    replace=False
                )
                data.loc[treated_missing, col] = np.nan

    return data

Agent Testing: Should detect missing data patterns and suggest appropriate handling methods (imputation, complete case analysis, etc.).

Usage Examples and Best Practices

Complete Workflow Example

Here’s a complete example of generating and testing synthetic data:

# Step 1: Configure and generate base data
from causal_agent.synthetic import RCTGenerator, DataGenerationConfig

config = DataGenerationConfig(
    n_observations=1000,
    n_continuous_covars=3,
    n_binary_covars=2,
    true_effect=1.5,
    noise_level=1.0,
    seed=42
)

generator = RCTGenerator(config)
data = generator.generate_data()

# Step 2: Generate realistic context
from causal_agent.synthetic.prompts import create_prompt, generate_data_summary

summary = generate_data_summary(
    data,
    n_cont_vars=3,
    n_bin_vars=2,
    method="rct"
)

prompt = create_prompt(summary, "rct", "education", "")
# Use LLM to generate context (implementation depends on LLM provider)
context = generate_context_with_llm(prompt)

# Step 3: Rename columns with realistic names
data_renamed = data.rename(columns=context['variable_labels'])

# Step 4: Test with CAIS agent
from causal_agent.agent import run_causal_analysis

result = run_causal_analysis(
    query=context['question'],
    dataset_path=data_renamed,
    dataset_description=context['description']
)

# Step 5: Validate results
true_effect = config.true_effect
estimated_effect = result['results']['results']['causal_effect']

print(f"True effect: {true_effect}")
print(f"Estimated effect: {estimated_effect}")
print(f"Method selected: {result['results']['results']['method_used']}")
print(f"Expected method: RCT/Difference-in-means")

Batch Testing Example

For comprehensive testing across multiple methods and scenarios:

def run_comprehensive_test_suite():
    """Run comprehensive test suite across all methods and scenarios"""

    methods = ['rct', 'did_canonical', 'iv', 'rdd', 'observational']
    scenarios = ['canonical', 'assumption_violation', 'small_sample', 'outliers']

    results = {}

    for method in methods:
        for scenario in scenarios:
            print(f"Testing {method} with {scenario} scenario...")

            # Generate appropriate data
            if scenario == 'canonical':
                data, true_params = generate_canonical_data(method)
            elif scenario == 'assumption_violation':
                data, true_params = generate_violation_data(method)
            elif scenario == 'small_sample':
                data, true_params = generate_small_sample_data(method)
            elif scenario == 'outliers':
                data, true_params = generate_outlier_data(method)

            # Test with agent
            try:
                result = test_with_agent(data, true_params)
                results[f"{method}_{scenario}"] = {
                    'success': True,
                    'method_correct': result['method_used'] == true_params['expected_method'],
                    'effect_accuracy': abs(result['effect'] - true_params['true_effect']),
                    'explanation_quality': evaluate_explanation(result['explanation'])
                }
            except Exception as e:
                results[f"{method}_{scenario}"] = {
                    'success': False,
                    'error': str(e)
                }

    return results

Best Practices for Synthetic Data Generation

Parameter Selection:
  • Use realistic effect sizes (typically 0.1 to 2.0 standard deviations)

  • Vary sample sizes to test statistical power considerations

  • Include appropriate noise levels to simulate real-world data

  • Use correlated covariates to reflect realistic data structures

Validation Procedures:
  • Always test generated data with known statistical methods

  • Verify that true parameters can be recovered under ideal conditions

  • Check that assumption violations produce expected biases

  • Validate that edge cases trigger appropriate agent responses

Context Generation:
  • Use domain-specific terminology and scenarios

  • Ensure variable names are realistic and interpretable

  • Create plausible data collection stories

  • Generate natural language questions that avoid statistical jargon

Testing Integration:
  • Test complete agent workflow, not just individual methods

  • Validate decision tree logic with appropriate data characteristics

  • Check error handling and edge case responses

  • Ensure explanations are accurate and helpful

Documentation and Reproducibility:
  • Document all generation parameters and random seeds

  • Save metadata alongside generated datasets

  • Include ground truth information for validation

  • Maintain version control for generation scripts and parameters

Integration with CAIS Testing Framework

The synthetic data generation system is fully integrated with the CAIS testing and validation framework, enabling comprehensive evaluation of the autonomous agent’s capabilities.

Continuous Integration Testing

The synthetic data system supports automated testing in CI/CD pipelines:

# .github/workflows/synthetic_data_tests.yml
name: Synthetic Data Validation

on: [push, pull_request]

jobs:
  test-synthetic-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Generate synthetic datasets
        run: |
          bash data_generation/create_synthetic_data_all.sh

      - name: Test agent on synthetic data
        run: |
          python tests/test_synthetic_data_integration.py

      - name: Validate decision tree logic
        run: |
          python tests/test_decision_tree_validation.py

Performance Benchmarking

The system enables systematic performance benchmarking across different data characteristics:

class SyntheticDataBenchmark:
    """Benchmark CAIS performance on synthetic data"""

    def __init__(self):
        self.results = {}
        self.benchmark_configs = self._generate_benchmark_configs()

    def _generate_benchmark_configs(self):
        """Generate configurations for systematic benchmarking"""
        configs = []

        # Vary sample sizes
        for n_obs in [100, 500, 1000, 5000]:
            # Vary effect sizes
            for effect_size in [0.1, 0.5, 1.0, 2.0]:
                # Vary noise levels
                for noise in [0.5, 1.0, 2.0]:
                    configs.append({
                        'n_observations': n_obs,
                        'true_effect': effect_size,
                        'noise_level': noise
                    })

        return configs

    def run_benchmark_suite(self):
        """Run comprehensive benchmark across all configurations"""
        methods = ['rct', 'did_canonical', 'iv', 'rdd', 'observational']

        for method in methods:
            method_results = []

            for config in self.benchmark_configs:
                # Generate data
                generator = self._get_generator(method, config)
                data = generator.generate_data()

                # Test with agent
                start_time = time.time()
                result = self._test_with_agent(data, generator.get_true_parameters())
                execution_time = time.time() - start_time

                # Record results
                method_results.append({
                    'config': config,
                    'execution_time': execution_time,
                    'method_correct': result['method_used'] == method,
                    'effect_accuracy': abs(result['effect'] - config['true_effect']),
                    'confidence_interval_coverage': self._check_ci_coverage(result, config),
                    'explanation_quality': self._evaluate_explanation(result['explanation'])
                })

            self.results[method] = method_results

        return self.results

Quality Assurance and Validation

The system includes comprehensive quality assurance measures:

Statistical Validation:

def validate_synthetic_data_quality(data, true_parameters):
    """Comprehensive validation of synthetic data quality"""
    validation_results = {}

    # Check basic statistical properties
    validation_results['sample_size'] = len(data)
    validation_results['missing_data_rate'] = data.isnull().sum().sum() / data.size

    # Validate treatment assignment
    if 'treatment' in data.columns:
        treatment_rate = data['treatment'].mean()
        validation_results['treatment_rate'] = treatment_rate
        validation_results['treatment_balance'] = abs(treatment_rate - 0.5) < 0.1

    # Validate covariate balance (for observational data)
    if true_parameters.get('method') in ['propensity_score_matching', 'propensity_score_weighting']:
        balance_stats = calculate_covariate_balance(data)
        validation_results['covariate_balance'] = balance_stats

    # Validate known relationships
    if 'instrument' in data.columns:
        first_stage_f = calculate_first_stage_f_stat(data['instrument'], data['treatment'])
        validation_results['instrument_strength'] = first_stage_f
        validation_results['weak_instrument'] = first_stage_f < 10

    # Validate effect recovery
    estimated_effect = estimate_treatment_effect(data, true_parameters['method'])
    true_effect = true_parameters['true_effect']
    validation_results['effect_bias'] = abs(estimated_effect - true_effect)
    validation_results['effect_recovery_success'] = validation_results['effect_bias'] < 0.2

    return validation_results

Decision Tree Logic Validation:

def validate_decision_tree_logic(synthetic_datasets):
    """Validate that agent makes correct method selections"""
    validation_results = {}

    for dataset_name, (data, true_params) in synthetic_datasets.items():
        # Run agent analysis
        agent_result = run_causal_analysis(
            query=true_params['query'],
            dataset_path=data,
            dataset_description=true_params['description']
        )

        # Check method selection
        expected_method = true_params['expected_method']
        selected_method = agent_result['results']['results']['method_used']

        validation_results[dataset_name] = {
            'method_selection_correct': selected_method == expected_method,
            'expected_method': expected_method,
            'selected_method': selected_method,
            'decision_explanation': agent_result.get('explanation', ''),
            'effect_estimate': agent_result['results']['results']['causal_effect'],
            'true_effect': true_params['true_effect']
        }

    return validation_results

Future Enhancements and Extensions

Planned Improvements

The synthetic data generation system continues to evolve with planned enhancements:

Advanced Scenario Generation:
  • Mediation Analysis: More sophisticated front-door and mediation scenarios

  • Network Effects: Data with spillover effects and network structures

  • Time-Varying Treatments: Complex temporal treatment patterns

  • Survival Analysis: Time-to-event outcomes with censoring

Enhanced Realism:
  • Real Data Mimicking: Generate synthetic data that closely mimics real dataset characteristics

  • Domain-Specific Generators: Specialized generators for healthcare, education, economics

  • Complex Confounding: More realistic confounding structures based on real-world patterns

Improved Testing Capabilities:
  • Adversarial Testing: Generate data specifically designed to challenge the agent

  • Robustness Testing: Systematic testing of agent behavior under various assumption violations

  • Scalability Testing: Large-scale datasets for performance evaluation

Contributing to the Synthetic Data System

Researchers and developers can contribute to the synthetic data system:

Adding New Generators:

class NewMethodGenerator(DataGenerator):
    """Template for adding new method generators"""

    def __init__(self, config, method_specific_params):
        super().__init__(config)
        self.method_specific_params = method_specific_params
        self.method = "new_method"

    def generate_data(self):
        """Implement method-specific data generation logic"""
        # 1. Generate covariates using base class
        X = self.generate_covariates()

        # 2. Generate treatment using method-specific logic
        treatment = self._generate_treatment(X)

        # 3. Generate outcome with known causal effect
        outcome = self._generate_outcome(treatment, X)

        # 4. Create DataFrame and return
        data = self._create_dataframe(X, treatment, outcome)
        self.data = data
        return data

    def test_data(self, print_=False):
        """Implement validation using appropriate statistical method"""
        # Test that true effect can be recovered
        pass

Testing New Scenarios:

def test_new_scenario():
    """Template for testing new scenarios"""
    # 1. Generate data with specific characteristics
    data = generate_scenario_data()

    # 2. Define expected agent behavior
    expected_method = "expected_method_name"
    expected_warnings = ["assumption_violation", "low_power"]

    # 3. Test with agent
    result = run_causal_analysis(query, data, description)

    # 4. Validate results
    assert result['method_used'] == expected_method
    assert all(warning in result['warnings'] for warning in expected_warnings)
Documentation Standards:
  • Document all generation parameters and their effects

  • Provide clear examples of when to use each generator

  • Include validation procedures for new methods

  • Explain integration with decision tree logic

Conclusion

The synthetic data generation system is a cornerstone of the CAIS testing and validation framework. It enables:

  • Comprehensive Testing: Systematic evaluation of agent decision-making across diverse scenarios

  • Method Validation: Rigorous testing of causal inference methods with known ground truth

  • Decision Tree Validation: Verification that the agent selects appropriate methods for different data characteristics

  • Robustness Assessment: Testing agent behavior under assumption violations and edge cases

  • Performance Benchmarking: Systematic evaluation of computational performance and statistical accuracy

The system’s integration with LLM-based context generation creates realistic testing scenarios that closely mirror real-world causal inference challenges, ensuring that CAIS performs reliably across diverse applications and domains.

For researchers and practitioners using CAIS, the synthetic data system provides confidence in the agent’s capabilities and helps identify appropriate use cases and limitations. For developers contributing to CAIS, it provides a comprehensive testing framework that ensures new features and methods integrate properly with the existing decision tree logic and maintain high standards of statistical accuracy and reliability.ut_dir / filename

generator.save_data(str(filepath))

datasets.append({

‘filepath’: str(filepath), ‘config’: config, ‘true_parameters’: generator.get_true_parameters()

})

return datasets

def generate_comprehensive_suite(self):

“””Generate comprehensive test suite for all methods””” methods = [

‘rct’, ‘difference_in_differences’, ‘instrumental_variable’, ‘regression_discontinuity’, ‘propensity_score_matching’

]

all_datasets = {}

for method in methods:

print(f”Generating datasets for {method}…”) datasets = self.generate_method_suite(method) all_datasets[method] = datasets

# Save master index self._save_dataset_index(all_datasets)

return all_datasets

def _get_generator_class(self, method_name: str):

“””Get generator class for method””” generators = {

‘rct’: RCTDataGenerator, ‘difference_in_differences’: DifferenceInDifferencesGenerator, ‘instrumental_variable’: InstrumentalVariableGenerator, ‘regression_discontinuity’: RegressionDiscontinuityGenerator, ‘propensity_score_matching’: PropensityScoreGenerator

} return generators[method_name]

def _vary_config(self, base_config: DataGenerationConfig, seed: int):

“””Create varied configuration for diversity””” config = DataGenerationConfig(

n_observations=base_config.n_observations + np.random.randint(-200, 200), n_continuous_covars=max(2, base_config.n_continuous_covars + np.random.randint(-1, 2)), n_binary_covars=max(1, base_config.n_binary_covars + np.random.randint(-1, 2)), true_effect=base_config.true_effect + np.random.normal(0, 0.2), noise_level=max(0.1, base_config.noise_level + np.random.normal(0, 0.1)), seed=base_config.seed + seed, heterogeneity=np.random.choice([True, False])

) return config

def _save_dataset_index(self, all_datasets: Dict):

“””Save index of all generated datasets””” index_path = self.output_dir / “dataset_index.json”

# Convert to serializable format serializable_index = {} for method, datasets in all_datasets.items():

serializable_index[method] = [] for dataset in datasets:

serializable_index[method].append({

‘filepath’: dataset[‘filepath’], ‘config’: dataset[‘config’].__dict__, ‘true_parameters’: dataset[‘true_parameters’]

})

import json with open(index_path, ‘w’) as f:

json.dump(serializable_index, f, indent=2)

Data Validation

Validate generated synthetic data:

class SyntheticDataValidator:
    """Validate synthetic data quality and properties"""

    def __init__(self):
        self.validation_results = {}

    def validate_dataset(
        self,
        data: pd.DataFrame,
        true_parameters: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Comprehensive validation of synthetic dataset"""

        results = {
            'basic_properties': self._validate_basic_properties(data),
            'statistical_properties': self._validate_statistical_properties(data),
            'causal_structure': self._validate_causal_structure(data, true_parameters),
            'method_specific': self._validate_method_specific(data, true_parameters)
        }

        results['overall_quality'] = self._assess_overall_quality(results)

        return results

    def _validate_basic_properties(self, data: pd.DataFrame) -> Dict[str, Any]:
        """Validate basic data properties"""
        return {
            'shape': data.shape,
            'missing_values': data.isnull().sum().to_dict(),
            'data_types': data.dtypes.to_dict(),
            'duplicates': data.duplicated().sum(),
            'treatment_balance': data['treatment'].value_counts().to_dict() if 'treatment' in data.columns else None
        }

    def _validate_statistical_properties(self, data: pd.DataFrame) -> Dict[str, Any]:
        """Validate statistical properties"""
        numeric_cols = data.select_dtypes(include=[np.number]).columns

        return {
            'means': data[numeric_cols].mean().to_dict(),
            'std_devs': data[numeric_cols].std().to_dict(),
            'correlations': data[numeric_cols].corr().to_dict(),
            'outliers': self._detect_outliers(data[numeric_cols])
        }

    def _validate_causal_structure(
        self,
        data: pd.DataFrame,
        true_parameters: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Validate causal structure matches intended design"""

        # Estimate treatment effect using simple method
        if 'treatment' in data.columns and 'outcome' in data.columns:
            treated = data[data['treatment'] == 1]['outcome']
            control = data[data['treatment'] == 0]['outcome']

            estimated_effect = treated.mean() - control.mean()
            true_effect = true_parameters.get('true_effect', 0)

            return {
                'estimated_effect': estimated_effect,
                'true_effect': true_effect,
                'effect_bias': abs(estimated_effect - true_effect),
                'effect_recovery_ratio': estimated_effect / true_effect if true_effect != 0 else None
            }

        return {}

    def _validate_method_specific(
        self,
        data: pd.DataFrame,
        true_parameters: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Method-specific validation"""
        method = true_parameters.get('method', '')

        if method == 'instrumental_variable':
            return self._validate_iv_properties(data, true_parameters)
        elif method == 'regression_discontinuity':
            return self._validate_rdd_properties(data, true_parameters)
        elif method == 'difference_in_differences':
            return self._validate_did_properties(data, true_parameters)

        return {}

    def _detect_outliers(self, data: pd.DataFrame) -> Dict[str, int]:
        """Detect outliers using IQR method"""
        outliers = {}

        for col in data.columns:
            Q1 = data[col].quantile(0.25)
            Q3 = data[col].quantile(0.75)
            IQR = Q3 - Q1

            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            outliers[col] = ((data[col] < lower_bound) | (data[col] > upper_bound)).sum()

        return outliers

    def _assess_overall_quality(self, results: Dict[str, Any]) -> str:
        """Assess overall data quality"""
        issues = []

        # Check for basic issues
        if results['basic_properties']['duplicates'] > 0:
            issues.append("duplicates")

        if any(v > 0 for v in results['basic_properties']['missing_values'].values()):
            issues.append("missing_values")

        # Check causal structure
        if 'effect_bias' in results['causal_structure']:
            if results['causal_structure']['effect_bias'] > 0.5:
                issues.append("high_effect_bias")

        if len(issues) == 0:
            return "excellent"
        elif len(issues) <= 2:
            return "good"
        else:
            return "needs_improvement"

Testing Integration

Using Synthetic Data in Tests

Integrate synthetic data generation with the testing framework:

# tests/fixtures/synthetic_data.py

import pytest
from causal_agent.synthetic.generator import *

@pytest.fixture
def rct_data():
    """Generate RCT data for testing"""
    config = DataGenerationConfig(n_observations=500, true_effect=1.5)
    generator = RCTDataGenerator(config)
    return generator.generate_data(), generator.get_true_parameters()

@pytest.fixture
def did_data():
    """Generate DiD data for testing"""
    config = DataGenerationConfig(n_observations=1000, true_effect=2.0)
    generator = DifferenceInDifferencesGenerator(config, n_periods=4, n_units=50)
    return generator.generate_data(), generator.get_true_parameters()

@pytest.fixture
def iv_data():
    """Generate IV data for testing"""
    config = DataGenerationConfig(n_observations=800, true_effect=1.2)
    generator = InstrumentalVariableGenerator(config, instrument_strength=0.6)
    return generator.generate_data(), generator.get_true_parameters()

# Example test using synthetic data
def test_method_with_synthetic_data(rct_data):
    """Test causal method with synthetic RCT data"""
    data, true_params = rct_data

    # Run method
    from causal_agent.methods.experimental.diff_in_means.estimator import estimate_diff_in_means

    variables = Variables(
        treatment_variable='treatment',
        outcome_variable='outcome',
        covariates=[col for col in data.columns if col.startswith('X')],
        is_rct=True
    )

    results = estimate_diff_in_means(data, variables)

    # Validate against true parameters
    true_effect = true_params['true_effect']
    estimated_effect = results['effect_estimate']

    # Allow for sampling variation
    assert abs(estimated_effect - true_effect) < 0.5
    assert results['p_value'] < 0.05  # Should be significant

Continuous Integration

Integrate synthetic data testing into CI/CD:

# .github/workflows/synthetic_data_tests.yml

name: Synthetic Data Tests

on: [push, pull_request]

jobs:
  synthetic-data-tests:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2

    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: 3.9

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install -e .

    - name: Generate synthetic test data
      run: |
        python -c "
        from causal_agent.synthetic.generator import BatchDataGenerator
        generator = BatchDataGenerator('test_synthetic_data')
        generator.generate_comprehensive_suite()
        "

    - name: Run synthetic data validation
      run: |
        pytest tests/synthetic/ -v --cov=causal_agent.synthetic

    - name: Run method tests with synthetic data
      run: |
        pytest tests/unit/methods/ -v -k "synthetic"

Best Practices

Data Generation Guidelines

  • Realistic Parameters: Use parameter values that reflect real-world scenarios

  • Known Ground Truth: Always maintain known causal relationships for validation

  • Diverse Scenarios: Generate data covering various conditions and edge cases

  • Reproducibility: Use fixed seeds for reproducible test datasets

  • Documentation: Clearly document the causal structure and assumptions

Validation Standards

  • Effect Recovery: Validate that methods recover true effects within reasonable bounds

  • Assumption Testing: Generate data that both satisfies and violates method assumptions

  • Statistical Properties: Ensure generated data has realistic statistical properties

  • Edge Case Coverage: Test with small samples, outliers, and missing data

  • Performance Benchmarking: Use large datasets to test scalability

Testing Integration

  • Automated Generation: Integrate data generation into CI/CD pipelines

  • Comprehensive Coverage: Test all methods with appropriate synthetic data

  • Performance Monitoring: Track method performance across different data scenarios

  • Regression Testing: Use synthetic data to detect performance regressions

  • Documentation Examples: Use synthetic data for clear, reproducible examples

The synthetic data generation system provides a robust foundation for testing, validating, and benchmarking causal inference methods in CAIS, ensuring reliability and accuracy across diverse real-world scenarios.