Your First Causal Analysis

This guide provides a comprehensive walkthrough of conducting your first causal analysis with CAIS, helping you understand both the technical steps and the causal inference concepts behind them.

Understanding Causal Questions 

Before diving into the analysis, it’s important to understand what makes a good causal question and how CAIS approaches causal inference.

What is Causal Inference?

Causal inference goes beyond correlation to answer questions like:

Does X cause Y? (e.g., “Does education cause higher income?”)
What would happen if…? (e.g., “What would happen to crime rates if we increased police presence?”)
How much of an effect? (e.g., “By how much does job training increase earnings?”)

Key Concept: Correlation ≠ Causation

# Example: Ice cream sales and drowning deaths are correlated
# But ice cream doesn't cause drowning - both increase in summer!

# CAIS helps identify true causal relationships by:
# 1. Selecting appropriate methods
# 2. Controlling for confounding variables
# 3. Testing assumptions

Good vs. Poor Causal Questions 

Good Causal Questions:

✓ "Does participating in job training increase income?"
✓ "What is the effect of class size on student performance?"
✓ "How does a new drug affect patient recovery time?"

Poor Causal Questions:

✗ "What factors are associated with income?" (descriptive, not causal)
✗ "Is there a relationship between X and Y?" (correlation question)
✗ "What predicts outcome Z?" (prediction, not causation)

Step-by-Step Analysis Walkthrough 

Let’s work through a complete analysis using a real-world example: the effect of education on earnings.

Step 1: Problem Setup 

import pandas as pd
from causal_agent import run_causal_analysis
import os

# Set up your environment
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

# Define our causal question
causal_question = "What is the effect of college education on annual income?"

print(f"Research Question: {causal_question}")

Step 2: Data Preparation 

# Load sample education-income dataset
# (In practice, you'd load your own data)
data = pd.read_csv('data/all_data/education_income.csv')

# Examine the data structure
print("Dataset Overview:")
print(f"Shape: {data.shape}")
print(f"Columns: {list(data.columns)}")
print("\nFirst few rows:")
print(data.head())

# Check for missing values
print("\nMissing values:")
print(data.isnull().sum())

Expected Output:

Dataset Overview:
Shape: (2500, 12)
Columns: ['person_id', 'college_degree', 'annual_income', 'age', 'gender',
         'work_experience', 'industry', 'region', 'family_background',
         'cognitive_ability', 'motivation', 'health_status']

Step 3: Running the Analysis 

# Provide a detailed dataset description
dataset_description = """
This dataset contains information about working adults and their educational and
economic outcomes. Each row represents one individual with the following key variables:

- college_degree: Whether the person has a college degree (treatment variable)
- annual_income: Person's annual income in dollars (outcome variable)
- age, gender, work_experience: Demographic controls
- industry, region: Economic context variables
- family_background: Socioeconomic background (potential confounder)
- cognitive_ability, motivation: Individual characteristics (potential confounders)
- health_status: Health-related factors

The data comes from a longitudinal survey of working adults aged 25-65.
"""

# Run the causal analysis
print("Running causal analysis...")
result = run_causal_analysis(
    query=causal_question,
    dataset_path='education_income.csv',
    dataset_description=dataset_description
)

print("Analysis completed!")

Step 4: Understanding the Results 

# Extract key results
method_used = result['results']['results']['method_used']
 treatment_var = result['results']['variables']['treatment_variable']
 outcome_var = result['results']['variables']['outcome_variable']
 effect_estimate = result['results']['results']['effect_estimate']
 std_error = result['results']['results']['standard_error']
 p_value = result['results']['results']['p_value']
 confidence_interval = result['results']['results'].get('confidence_interval', 'Not available')


print("=== ANALYSIS RESULTS ===")
print(f"Method Selected: {method_used}")
print(f"Treatment Variable: {treatment_var}")
print(f"Outcome Variable: {outcome_var}")
print(f"Causal Effect Estimate: ${effect_estimate:,.2f}")
print(f"Standard Error: ${std_error:,.2f}")
print(f"P-value: {p_value:.4f}")
print(f"95% Confidence Interval: {confidence_interval}")

# Statistical significance
is_significant = p_value < 0.05
print(f"Statistically Significant: {'Yes' if is_significant else 'No'}")

Sample Output:

=== ANALYSIS RESULTS ===
Method Selected: Propensity Score Matching
Treatment Variable: college_degree
Outcome Variable: annual_income
Causal Effect Estimate: $18,450.00
Standard Error: $2,340.00
P-value: 0.0001
95% Confidence Interval: [$13,864, $23,036]
Statistically Significant: Yes

Step 5: Interpreting the Results 

# Get the AI interpretation
interpretation = result['explanation']['final_explanation_text']
print("\n=== AI INTERPRETATION ===")
print(interpretation)

# Manual interpretation guide
print("\n=== INTERPRETATION GUIDE ===")

if is_significant:
    print(f"✓ The analysis suggests that having a college degree CAUSES an increase")
    print(f"  in annual income of approximately ${effect_estimate:,.0f}.")
    print(f"✓ This effect is statistically significant (p = {p_value:.4f} < 0.05).")
    print(f"✓ We can be 95% confident the true effect is between {confidence_interval}.")
else:
    print(f"✗ No statistically significant causal effect was found.")
    print(f"✗ The estimated effect of ${effect_estimate:,.0f} could be due to chance.")

print(f"\n📊 Method Used: {method_used}")
print("   This method was selected based on the characteristics of your data.")

Step 6: Examining Method Selection 

Understanding why CAIS selected a particular method helps you trust and validate the results:

# Check method selection reasoning (if available)
method_reasoning = result['results'].get('method_reasoning', 'Not provided')
print(f"\n=== WHY THIS METHOD? ===")
print(f"Selected Method: {method_used}")
print(f"Reasoning: {method_reasoning}")

# Common methods and when they're used:
method_guide = {
    'Randomized Controlled Trial': 'Data appears to be from a randomized experiment',
    'Propensity Score Matching': 'Observational data with good covariate balance possible',
    'Difference-in-Differences': 'Panel data with treatment timing variation',
    'Instrumental Variables': 'Strong instrumental variable detected',
    'Regression Discontinuity': 'Sharp cutoff/threshold detected in treatment assignment',
    'Linear Regression': 'Simple observational data with controls'
}

if method_used in method_guide:
    print(f"Typical use case: {method_guide[method_used]}")

Step 7: Validating Results 

Always validate your causal analysis results:

# Check for potential issues
print("\n=== VALIDATION CHECKLIST ===")

# 1. Effect size reasonableness
print(f"1. Effect Size: ${effect_estimate:,.0f}")
if abs(effect_estimate) > 100000:
    print("   ⚠️  Large effect - double-check data and units")
else:
    print("   ✓ Reasonable effect size")

# 2. Statistical significance
print(f"2. P-value: {p_value:.4f}")
if p_value < 0.05:
    print("   ✓ Statistically significant")
else:
    print("   ⚠️  Not statistically significant")

# 3. Sample size
sample_size = len(data)
print(f"3. Sample Size: {sample_size:,}")
if sample_size < 100:
    print("   ⚠️  Small sample size - results may be unreliable")
else:
    print("   ✓ Adequate sample size")

# 4. Missing data
missing_pct = (data.isnull().sum().sum() / (len(data) * len(data.columns))) * 100
print(f"4. Missing Data: {missing_pct:.1f}%")
if missing_pct > 10:
    print("   ⚠️  High missing data percentage")
else:
    print("   ✓ Low missing data")

Common Patterns and What They Mean 

Different Types of Results 

Strong Positive Effect:

# Example result interpretation
if effect_estimate > 0 and p_value < 0.01:
    print("Strong evidence that treatment INCREASES the outcome")
    print("Effect is both large and highly statistically significant")

No Effect:

if abs(effect_estimate) < std_error and p_value > 0.05:
    print("No evidence of causal effect")
    print("Treatment appears to have no impact on outcome")

Negative Effect:

if effect_estimate < 0 and p_value < 0.05:
    print("Evidence that treatment DECREASES the outcome")
    print("This could be beneficial or harmful depending on context")

Understanding Confidence Intervals 

# Interpreting confidence intervals
def interpret_confidence_interval(ci_lower, ci_upper, effect):
    print(f"95% Confidence Interval: [{ci_lower:,.0f}, {ci_upper:,.0f}]")

    if ci_lower > 0:
        print("✓ We're confident the effect is positive")
    elif ci_upper < 0:
        print("✓ We're confident the effect is negative")
    else:
        print("⚠️  Confidence interval includes zero - effect uncertain")

    width = ci_upper - ci_lower
    precision = width / abs(effect) if effect != 0 else float('inf')

    if precision < 0.5:
        print("✓ Precise estimate (narrow confidence interval)")
    else:
        print("⚠️  Imprecise estimate (wide confidence interval)")

Troubleshooting Common Issues 

Data Quality Issues 

# Check for common data problems
def diagnose_data_issues(data):
    issues = []

    # Check for constant variables
    for col in data.columns:
        if data[col].nunique() <= 1:
            issues.append(f"Variable '{col}' has no variation")

    # Check for extreme outliers
    numeric_cols = data.select_dtypes(include=['number']).columns
    for col in numeric_cols:
        q99 = data[col].quantile(0.99)
        q01 = data[col].quantile(0.01)
        if (q99 / q01) > 1000:  # Very large range
            issues.append(f"Variable '{col}' has extreme outliers")

    # Check sample size
    if len(data) < 50:
        issues.append("Sample size very small (< 50 observations)")

    return issues

# Run diagnostics
issues = diagnose_data_issues(data)
if issues:
    print("⚠️  Data Quality Issues Found:")
    for issue in issues:
        print(f"   - {issue}")
else:
    print("✓ No major data quality issues detected")

Method Selection Issues 

# If you disagree with method selection
print("\n=== IF YOU DISAGREE WITH METHOD SELECTION ===")
print("CAIS selected:", method_used)
print("\nConsider these factors:")
print("1. Is your data from a randomized experiment? → RCT methods")
print("2. Do you have before/after data? → Difference-in-Differences")
print("3. Is there a sharp cutoff in treatment? → Regression Discontinuity")
print("4. Do you have a valid instrument? → Instrumental Variables")
print("5. Is this observational data? → Matching or Regression methods")

Result Interpretation Issues 

# Common interpretation mistakes
print("\n=== COMMON INTERPRETATION MISTAKES ===")
print("❌ 'X predicts Y' → ✅ 'X causes Y'")
print("❌ 'X is associated with Y' → ✅ 'X has a causal effect on Y'")
print("❌ 'The model shows...' → ✅ 'The causal analysis suggests...'")
print("❌ Ignoring confidence intervals → ✅ Report uncertainty")
print("❌ Over-interpreting small effects → ✅ Consider practical significance")

Next Steps and Advanced Topics 

Now that you’ve completed your first analysis, here are suggested next steps:

Immediate Next Steps:

Try different datasets: Experiment with various domains and data types
Compare methods: Run the same analysis with different method specifications
Explore batch processing: Analyze multiple datasets systematically

Advanced Learning:

Method deep-dives: Causal Inference Methods - Learn about each causal inference method
Domain tutorials: Tutorials & Examples - See examples in your field
Configuration options: Advanced Usage - Customize CAIS behavior

Congratulations! You’ve completed your first comprehensive causal analysis with CAIS. You’re now ready to tackle more complex analyses and explore advanced features.

Continue to User Guide for advanced usage patterns, or explore Tutorials & Examples for domain-specific examples!