Basic Usage

This guide covers the fundamental workflows for conducting causal analysis with CAIS. Whether you’re new to causal inference or experienced with other tools, this section will help you get started with common analysis patterns.

Core Workflow

CAIS follows a structured workflow that guides you through the causal analysis process:

  1. Input Parsing: Understanding your research question and dataset

  2. Dataset Analysis: Examining data structure and variable types

  3. Query Interpretation: Identifying treatment, outcome, and control variables

  4. Method Selection: Choosing the appropriate causal inference method

  5. Method Validation: Checking assumptions and prerequisites

  6. Method Execution: Running the analysis with diagnostics

  7. Result Interpretation: Generating explanations and insights

  8. Output Formatting: Presenting results in a structured format

Python API Usage

Single Analysis

The most common use case is analyzing a single dataset with a specific causal question:

from causal_agent import run_causal_analysis

# Basic analysis
result = run_causal_analysis(
    query="What is the effect of job training on earnings?",
    dataset_path="data/lalonde_data.csv",
    dataset_description="LaLonde job training experiment data"
)

# Access key results
effect_estimate = result['results']['results']['effect_estimate']
method_used = result['results']['results']['method_used']
treatment_var = result['results']['variables']['treatment_variable']
outcome_var = result['results']['variables']['outcome_variable']

print(f"Method: {method_used}")
print(f"Effect of {treatment_var} on {outcome_var}: {effect_estimate}")

Understanding Results

Causal Agent returns a structured dictionary with comprehensive analysis results:

# Example result structure
{
    'results': {
        'results': {
            'effect_estimate': 1794.34,
            'standard_error': 632.85,
            'confidence_interval': [554.95, 3033.73],
            'p_value': 0.0045,
            'method_used': 'propensity_score_matching'
        },
        'variables': {
            'treatment_variable': 'treat',
            'outcome_variable': 're78',
            'covariates': ['age', 'education', 'black', 'hispanic', 'married']
        },
        'diagnostics': {
            'balance_statistics': {...},
            'assumption_checks': {...}
        },
        'explanation': "The analysis found a significant positive effect..."
    }
}

Command Line Interface

For quick analyses or integration into scripts, use the CLI:

Single Analysis

# Basic command
causal_agent run data/lalonde_data.csv "What is the effect of job training on earnings?"

# With dataset description
causal_agent run data/lalonde_data.csv \
    "What is the effect of job training on earnings?" \
    --desc "LaLonde job training experiment with treatment and control groups"

# Specify LLM provider and model
causal_agent run data/lalonde_data.csv \
    "What is the effect of job training on earnings?" \
    --llm-provider anthropic \
    --llm-name claude-3-5-sonnet-latest

Common Analysis Patterns

Experimental Data (RCT)

When you have randomized controlled trial data:

result = run_causal_analysis(
    query="What is the treatment effect in this randomized experiment?",
    dataset_path="data/rct_data.csv",
    dataset_description="Randomized controlled trial with treatment and control groups"
)

Observational Data

For observational studies where you need to control for confounders:

result = run_causal_analysis(
    query="What is the effect of education on income, controlling for background factors?",
    dataset_path="data/observational_data.csv",
    dataset_description="Survey data with education, income, and demographic variables"
)

Time Series / Panel Data

For difference-in-differences or other temporal analyses:

result = run_causal_analysis(
    query="What was the effect of the policy change on outcomes over time?",
    dataset_path="data/panel_data.csv",
    dataset_description="Panel data with pre/post policy implementation periods"
)

Instrumental Variables

When you have an instrument for causal identification:

result = run_causal_analysis(
    query="What is the effect of education on wages using distance to college as an instrument?",
    dataset_path="data/iv_data.csv",
    dataset_description="Data with education, wages, and distance to college as instrument"
)

Regression Discontinuity

For sharp cutoff designs:

result = run_causal_analysis(
    query="What is the effect of the scholarship program on test scores?",
    dataset_path="data/rdd_data.csv",
    dataset_description="Student data with test scores and scholarship eligibility cutoff"
)

Working with Results

Extracting Key Information

# Get the main causal effect estimate
effect = result['results']['results']['effect_estimate']
se = result['results']['results']['standard_error']
ci = result['results']['results'].get('confidence_interval', None)

# Check statistical significance
p_value = result['results']['results']['p_value']
is_significant = p_value < 0.05

# Get variable information
variables = result['results']['variables']
treatment = variables['treatment_variable']
outcome = variables['outcome_variable']
covariates = variables.get('covariates', [])

Interpreting Diagnostics

# Access diagnostic information
diagnostics = result['results']['results']['diagnostics']

# For propensity score methods
if 'balance_statistics' in diagnostics:
    balance = diagnostics['balance_statistics']
    print("Covariate balance after matching:")
    for var, stats in balance.items():
        print(f"  {var}: standardized difference = {stats['std_diff']:.3f}")

# For IV methods
if 'first_stage_f_stat' in diagnostics:
    f_stat = diagnostics['first_stage_f_stat']
    print(f"First stage F-statistic: {f_stat:.2f}")
    if f_stat < 10:
        print("Warning: Weak instrument (F < 10)")

Error Handling

CAIS provides informative error messages when issues occur:

result = run_causal_analysis(
    query="What is the effect of X on Y?",
    dataset_path="data/problematic_data.csv"
)

# Check for errors
if 'error' in result:
    print(f"Analysis failed: {result['error']}")
else:
    # Process successful results
    effect = result['results']['results']['effect_estimate']

Common Issues and Solutions

Missing Variables

If CAIS can’t identify treatment or outcome variables, be more specific in your query:

# Instead of: "What causes what?"
# Use: "What is the effect of education on income?"
Data Format Issues

Ensure your CSV has proper headers and numeric variables are correctly formatted:

import pandas as pd
df = pd.read_csv("data.csv")
print(df.dtypes)  # Check variable types
print(df.head())  # Check data format
Method Selection Issues

If the automatic method selection isn’t appropriate, the explanation will indicate why certain methods were chosen or rejected.

Best Practices

Data Preparation

  1. Clean Variable Names: Use descriptive, consistent variable names

  2. Handle Missing Data: Address missing values before analysis

  3. Check Data Types: Ensure treatment variables are properly coded (0/1 for binary)

  4. Document Your Data: Provide clear dataset descriptions

Query Formulation

  1. Be Specific: Clearly state treatment and outcome variables

  2. Use Causal Language: Frame questions in terms of effects and causation

  3. Provide Context: Include relevant background information in dataset descriptions

Result Interpretation

  1. Check Assumptions: Review diagnostic tests and assumption checks

  2. Consider Effect Size: Look beyond statistical significance to practical significance

  3. Validate Results: Compare with domain knowledge and alternative methods

  4. Document Decisions: Keep track of analysis choices for reproducibility

Next Steps