Dataset Properties and Method Selection Gallery

This gallery demonstrates how different dataset characteristics lead CAIS to select different causal inference methods. Each example shows the decision tree path and explains why specific methods are chosen or excluded.

Overview

CAIS uses a systematic decision tree to select the most appropriate causal inference method based on your data characteristics. This gallery provides visual examples of how different data properties lead to different method selections.

Key Decision Factors: - Randomization status - Data structure (cross-sectional, panel, etc.) - Treatment variable type (binary, continuous, categorical) - Available instruments - Covariate richness and overlap

Gallery Examples

Example 1: Perfect Randomized Experiment

Dataset Characteristics: - Randomized controlled trial - Binary treatment assignment - Rich baseline covariates - Perfect compliance

        flowchart TD
    A[RCT Dataset] --> B{Is this randomized?}
    B -->|Yes ✓| C{Are covariates available?}
    C -->|Yes ✓| D[Linear Regression<br/>with Covariates]

    style A fill:#e3f2fd
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: Linear Regression with Covariates

Decision Path:
1. Randomization check: ✅ PASSED (balanced assignment)
2. Covariate assessment: ✅ AVAILABLE (baseline measures)
3. Selected method: Linear regression with covariates

Why this method?
✓ Randomization ensures causal identification
✓ Covariates improve precision (reduce standard errors)
✓ Transparent and interpretable results
✓ Optimal for experimental data

Example Datasets: Learning mindset intervention, A/B tests, clinical trials

—

Example 2: Observational Data with Rich Covariates

Dataset Characteristics: - Non-randomized observational study - Binary treatment - Rich set of confounding variables - Good covariate overlap

        flowchart TD
    A[Observational Dataset] --> B{Is this randomized?}
    B -->|No ✗| C{Panel data available?}
    C -->|No ✗| D{Running variable?}
    D -->|No ✗| E{Binary treatment?}
    E -->|Yes ✓| F{Instrumental variable?}
    F -->|No ✗| G{Rich covariates?}
    G -->|Yes ✓| H{Good overlap?}
    H -->|Yes ✓| I[Propensity Score<br/>Matching]

    style A fill:#e3f2fd
    style B fill:#ffebee
    style C fill:#ffebee
    style D fill:#ffebee
    style E fill:#fff3e0
    style F fill:#ffebee
    style G fill:#fff3e0
    style H fill:#fff3e0
    style I fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: Propensity Score Matching

Decision Path:
1. Randomization check: ❌ FAILED (selection bias detected)
2. Panel data check: ❌ NOT AVAILABLE
3. Running variable check: ❌ NOT AVAILABLE
4. Treatment type: ✅ BINARY
5. Instrumental variable: ❌ NOT AVAILABLE
6. Covariate richness: ✅ RICH COVARIATES
7. Overlap assessment: ✅ GOOD OVERLAP
8. Selected method: Propensity score matching

Why this method?
✓ Handles selection bias through matching
✓ Rich covariates enable credible matching
✓ Good overlap ensures valid comparisons
✓ Transparent balance assessment

Example Datasets: Hospital treatment effects, job training programs, educational interventions

—

Example 3: Panel Data with Treatment Timing

Dataset Characteristics: - Panel data (multiple time periods) - Treatment timing varies across units - Clear before/after periods - Parallel trends plausible

        flowchart TD
    A[Panel Dataset] --> B{Is this randomized?}
    B -->|No ✗| C{Panel data available?}
    C -->|Yes ✓| D{Treatment timing varies?}
    D -->|Yes ✓| E[Difference-in-Differences]

    style A fill:#e3f2fd
    style B fill:#ffebee
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: Difference-in-Differences

Decision Path:
1. Randomization check: ❌ FAILED
2. Panel data check: ✅ AVAILABLE (multiple time periods)
3. Treatment timing: ✅ VARIES across units
4. Selected method: Difference-in-differences

Why this method?
✓ Exploits timing variation for identification
✓ Controls for time-invariant confounders
✓ Handles unobserved heterogeneity
✓ Robust to selection on observables and unobservables

Key assumption: Parallel trends between treatment and control

Example Datasets: Policy evaluations, minimum wage studies, healthcare reforms

—

Example 4: Sharp Regression Discontinuity

Dataset Characteristics: - Continuous running variable - Sharp cutoff for treatment assignment - Treatment probability jumps discontinuously - No manipulation of running variable

        flowchart TD
    A[RDD Dataset] --> B{Is this randomized?}
    B -->|No ✗| C{Panel data available?}
    C -->|No ✗| D{Running variable with cutoff?}
    D -->|Yes ✓| E{Sharp discontinuity?}
    E -->|Yes ✓| F[Regression Discontinuity<br/>Design]

    style A fill:#e3f2fd
    style B fill:#ffebee
    style C fill:#ffebee
    style D fill:#fff3e0
    style E fill:#fff3e0
    style F fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: Regression Discontinuity Design

Decision Path:
1. Randomization check: ❌ FAILED
2. Panel data check: ❌ NOT AVAILABLE
3. Running variable: ✅ DETECTED (continuous assignment variable)
4. Discontinuity: ✅ SHARP (treatment probability jumps)
5. Selected method: Regression discontinuity design

Why this method?
✓ Exploits discontinuous assignment rule
✓ Local randomization around cutoff
✓ Credible identification strategy
✓ Transparent assumptions

Key assumption: Continuity of potential outcomes at cutoff

Example Datasets: Scholarship eligibility, policy thresholds, age-based programs

—

Example 5: Instrumental Variables

Dataset Characteristics: - Endogenous treatment assignment - Valid instrumental variable available - Strong first-stage relationship - Credible exclusion restriction

        flowchart TD
    A[IV Dataset] --> B{Is this randomized?}
    B -->|No ✗| C{Panel data available?}
    C -->|No ✗| D{Running variable?}
    D -->|No ✗| E{Binary treatment?}
    E -->|Yes ✓| F{Instrumental variable?}
    F -->|Yes ✓| G{Valid instrument?}
    G -->|Yes ✓| H[Instrumental Variables]

    style A fill:#e3f2fd
    style B fill:#ffebee
    style C fill:#ffebee
    style D fill:#ffebee
    style E fill:#fff3e0
    style F fill:#fff3e0
    style G fill:#fff3e0
    style H fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: Instrumental Variables

Decision Path:
1. Randomization check: ❌ FAILED
2. Panel data check: ❌ NOT AVAILABLE
3. Running variable check: ❌ NOT AVAILABLE
4. Treatment type: ✅ BINARY
5. Instrumental variable: ✅ DETECTED
6. Instrument validation: ✅ VALID (relevance + exogeneity)
7. Selected method: Instrumental variables

Why this method?
✓ Handles unmeasured confounding
✓ Valid instrument provides exogenous variation
✓ Strong first-stage relationship
✓ Credible exclusion restriction

Key assumptions: Relevance, exogeneity, exclusion restriction

Example Datasets: Marketing campaigns with server downtime, education with distance instruments

—

Example 6: Continuous Treatment with IV

Dataset Characteristics: - Continuous treatment variable - Endogeneity concerns - Valid instrumental variable - No clear cutoff or panel structure

        flowchart TD
    A[Continuous Treatment] --> B{Is this randomized?}
    B -->|No ✗| C{Panel data available?}
    C -->|No ✗| D{Running variable?}
    D -->|No ✗| E{Binary treatment?}
    E -->|No ✗| F{Continuous treatment}
    F --> G{Instrumental variable?}
    G -->|Yes ✓| H[Instrumental Variables<br/>Continuous Treatment]

    style A fill:#e3f2fd
    style B fill:#ffebee
    style C fill:#ffebee
    style D fill:#ffebee
    style E fill:#ffebee
    style F fill:#fff3e0
    style G fill:#fff3e0
    style H fill:#e8f5e8

Agent Decision Process:

🎯 Method Selection: IV with Continuous Treatment

Decision Path:
1. Randomization check: ❌ FAILED
2. Panel data check: ❌ NOT AVAILABLE
3. Running variable check: ❌ NOT AVAILABLE
4. Treatment type: ✅ CONTINUOUS
5. Instrumental variable: ✅ AVAILABLE
6. Selected method: IV with continuous treatment

Why this method?
✓ Handles continuous endogenous treatment
✓ Valid instrument provides identification
✓ Can estimate dose-response relationships
✓ Flexible functional form specification

Example Datasets: Advertising intensity, education years, healthcare dosage

Method Exclusion Examples

Understanding why methods are excluded is as important as understanding why they’re selected.

Example 7: Why Not Difference-in-Differences?

Dataset: Cross-sectional observational data with rich covariates

❌ Difference-in-Differences: EXCLUDED

Data Requirements Not Met:
- Requires: Panel data with multiple time periods
- Available: Cross-sectional data (single time point)
- Missing: Pre-treatment outcome measurements
- Missing: Variation in treatment timing

Alternative Selected: Propensity Score Matching
- Uses available rich covariates
- Handles selection bias through matching
- Appropriate for cross-sectional data

—

Example 8: Why Not Regression Discontinuity?

Dataset: Observational data without clear assignment rule

❌ Regression Discontinuity: EXCLUDED

Data Requirements Not Met:
- Requires: Continuous running variable with sharp cutoff
- Available: Discretionary treatment assignment
- Missing: Clear assignment rule or threshold
- Problem: No discontinuous treatment probability

Alternative Selected: Propensity Score Methods
- Handles discretionary assignment
- Uses observed characteristics for matching
- Appropriate for non-rule-based assignment

—

Example 9: Why Not Instrumental Variables?

Dataset: Randomized experiment with perfect compliance

❌ Instrumental Variables: EXCLUDED

Not Needed:
- Randomization already provides identification
- No endogeneity concerns in experimental data
- IV would be less efficient than direct analysis
- Perfect compliance eliminates need for instruments

Selected Method: Linear Regression with Covariates
- Leverages randomization for identification
- More efficient than IV approach
- Simpler interpretation and implementation

Dataset Property Decision Matrix

This matrix shows how different combinations of data characteristics lead to method selection:

Method Selection Matrix
Randomized	Panel Data	Running Var	Instrument	Treatment Type	Selected Method
✅ Yes	Any	Any	Any	Binary	Linear Regression + Covariates
✅ Yes	Any	Any	Any	Continuous	Linear Regression + Covariates
❌ No	✅ Yes	Any	Any	Any	Difference-in-Differences
❌ No	❌ No	✅ Yes	Any	Any	Regression Discontinuity
❌ No	❌ No	❌ No	✅ Yes	Binary	Instrumental Variables
❌ No	❌ No	❌ No	✅ Yes	Continuous	IV Continuous Treatment
❌ No	❌ No	❌ No	❌ No	Binary	Propensity Score Methods
❌ No	❌ No	❌ No	❌ No	Continuous	Linear Regression + Controls

Common Decision Patterns

Pattern 1: Experimental Data Priority

Rule: Randomized experiments always preferred when available

Priority Hierarchy:
Randomized Controlled Trial → Linear Regression + Covariates
Natural Experiment (RDD/IV) → RDD or IV
Quasi-Experiment (DiD) → Difference-in-Differences
Observational (Matching) → Propensity Score Methods
Observational (Regression) → Linear Regression + Controls

Pattern 2: Data Structure Drives Method

Rule: Method selection follows data availability hierarchy

Data Structure Priority:
Randomization → Experimental methods
Panel + Timing → Difference-in-Differences
Running Variable → Regression Discontinuity
Valid Instrument → Instrumental Variables
Rich Covariates → Propensity Score Methods
Limited Data → Linear Regression

Pattern 3: Treatment Type Considerations

Rule: Treatment variable type affects method choice within categories

Treatment Type Adaptations:
- Binary Treatment: Standard methods (matching, IV, etc.)
- Continuous Treatment: Specialized versions (generalized PS, IV)
- Categorical Treatment: Multinomial approaches
- Time-Varying Treatment: Dynamic methods

Next Steps

Apply to Your Data: Use the decision framework with your datasets
Explore Case Studies: See detailed examples in Case Studies
Read Method Documentation: Deep dive into specific methods in Causal Inference Methods

Related Resources: - Method Selection Decision Tree - Complete decision tree documentation - Case Studies - Detailed case studies by domain - Quickstart Tutorial - Quick start guide for CAIS