Dataset Properties and Method Selection Gallery =============================================== This gallery demonstrates how different dataset characteristics lead CAIS to select different causal inference methods. Each example shows the decision tree path and explains why specific methods are chosen or excluded. Overview -------- CAIS uses a systematic decision tree to select the most appropriate causal inference method based on your data characteristics. This gallery provides visual examples of how different data properties lead to different method selections. **Key Decision Factors**: - Randomization status - Data structure (cross-sectional, panel, etc.) - Treatment variable type (binary, continuous, categorical) - Available instruments - Covariate richness and overlap Gallery Examples ---------------- Example 1: Perfect Randomized Experiment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Randomized controlled trial - Binary treatment assignment - Rich baseline covariates - Perfect compliance .. mermaid:: flowchart TD A[RCT Dataset] --> B{Is this randomized?} B -->|Yes ✓| C{Are covariates available?} C -->|Yes ✓| D[Linear Regression
with Covariates] style A fill:#e3f2fd style B fill:#e8f5e8 style C fill:#fff3e0 style D fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: Linear Regression with Covariates Decision Path: 1. Randomization check: ✅ PASSED (balanced assignment) 2. Covariate assessment: ✅ AVAILABLE (baseline measures) 3. Selected method: Linear regression with covariates Why this method? ✓ Randomization ensures causal identification ✓ Covariates improve precision (reduce standard errors) ✓ Transparent and interpretable results ✓ Optimal for experimental data **Example Datasets**: Learning mindset intervention, A/B tests, clinical trials --- Example 2: Observational Data with Rich Covariates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Non-randomized observational study - Binary treatment - Rich set of confounding variables - Good covariate overlap .. mermaid:: flowchart TD A[Observational Dataset] --> B{Is this randomized?} B -->|No ✗| C{Panel data available?} C -->|No ✗| D{Running variable?} D -->|No ✗| E{Binary treatment?} E -->|Yes ✓| F{Instrumental variable?} F -->|No ✗| G{Rich covariates?} G -->|Yes ✓| H{Good overlap?} H -->|Yes ✓| I[Propensity Score
Matching] style A fill:#e3f2fd style B fill:#ffebee style C fill:#ffebee style D fill:#ffebee style E fill:#fff3e0 style F fill:#ffebee style G fill:#fff3e0 style H fill:#fff3e0 style I fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: Propensity Score Matching Decision Path: 1. Randomization check: ❌ FAILED (selection bias detected) 2. Panel data check: ❌ NOT AVAILABLE 3. Running variable check: ❌ NOT AVAILABLE 4. Treatment type: ✅ BINARY 5. Instrumental variable: ❌ NOT AVAILABLE 6. Covariate richness: ✅ RICH COVARIATES 7. Overlap assessment: ✅ GOOD OVERLAP 8. Selected method: Propensity score matching Why this method? ✓ Handles selection bias through matching ✓ Rich covariates enable credible matching ✓ Good overlap ensures valid comparisons ✓ Transparent balance assessment **Example Datasets**: Hospital treatment effects, job training programs, educational interventions --- Example 3: Panel Data with Treatment Timing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Panel data (multiple time periods) - Treatment timing varies across units - Clear before/after periods - Parallel trends plausible .. mermaid:: flowchart TD A[Panel Dataset] --> B{Is this randomized?} B -->|No ✗| C{Panel data available?} C -->|Yes ✓| D{Treatment timing varies?} D -->|Yes ✓| E[Difference-in-Differences] style A fill:#e3f2fd style B fill:#ffebee style C fill:#fff3e0 style D fill:#fff3e0 style E fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: Difference-in-Differences Decision Path: 1. Randomization check: ❌ FAILED 2. Panel data check: ✅ AVAILABLE (multiple time periods) 3. Treatment timing: ✅ VARIES across units 4. Selected method: Difference-in-differences Why this method? ✓ Exploits timing variation for identification ✓ Controls for time-invariant confounders ✓ Handles unobserved heterogeneity ✓ Robust to selection on observables and unobservables Key assumption: Parallel trends between treatment and control **Example Datasets**: Policy evaluations, minimum wage studies, healthcare reforms --- Example 4: Sharp Regression Discontinuity ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Continuous running variable - Sharp cutoff for treatment assignment - Treatment probability jumps discontinuously - No manipulation of running variable .. mermaid:: flowchart TD A[RDD Dataset] --> B{Is this randomized?} B -->|No ✗| C{Panel data available?} C -->|No ✗| D{Running variable with cutoff?} D -->|Yes ✓| E{Sharp discontinuity?} E -->|Yes ✓| F[Regression Discontinuity
Design] style A fill:#e3f2fd style B fill:#ffebee style C fill:#ffebee style D fill:#fff3e0 style E fill:#fff3e0 style F fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: Regression Discontinuity Design Decision Path: 1. Randomization check: ❌ FAILED 2. Panel data check: ❌ NOT AVAILABLE 3. Running variable: ✅ DETECTED (continuous assignment variable) 4. Discontinuity: ✅ SHARP (treatment probability jumps) 5. Selected method: Regression discontinuity design Why this method? ✓ Exploits discontinuous assignment rule ✓ Local randomization around cutoff ✓ Credible identification strategy ✓ Transparent assumptions Key assumption: Continuity of potential outcomes at cutoff **Example Datasets**: Scholarship eligibility, policy thresholds, age-based programs --- Example 5: Instrumental Variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Endogenous treatment assignment - Valid instrumental variable available - Strong first-stage relationship - Credible exclusion restriction .. mermaid:: flowchart TD A[IV Dataset] --> B{Is this randomized?} B -->|No ✗| C{Panel data available?} C -->|No ✗| D{Running variable?} D -->|No ✗| E{Binary treatment?} E -->|Yes ✓| F{Instrumental variable?} F -->|Yes ✓| G{Valid instrument?} G -->|Yes ✓| H[Instrumental Variables] style A fill:#e3f2fd style B fill:#ffebee style C fill:#ffebee style D fill:#ffebee style E fill:#fff3e0 style F fill:#fff3e0 style G fill:#fff3e0 style H fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: Instrumental Variables Decision Path: 1. Randomization check: ❌ FAILED 2. Panel data check: ❌ NOT AVAILABLE 3. Running variable check: ❌ NOT AVAILABLE 4. Treatment type: ✅ BINARY 5. Instrumental variable: ✅ DETECTED 6. Instrument validation: ✅ VALID (relevance + exogeneity) 7. Selected method: Instrumental variables Why this method? ✓ Handles unmeasured confounding ✓ Valid instrument provides exogenous variation ✓ Strong first-stage relationship ✓ Credible exclusion restriction Key assumptions: Relevance, exogeneity, exclusion restriction **Example Datasets**: Marketing campaigns with server downtime, education with distance instruments --- Example 6: Continuous Treatment with IV ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset Characteristics**: - Continuous treatment variable - Endogeneity concerns - Valid instrumental variable - No clear cutoff or panel structure .. mermaid:: flowchart TD A[Continuous Treatment] --> B{Is this randomized?} B -->|No ✗| C{Panel data available?} C -->|No ✗| D{Running variable?} D -->|No ✗| E{Binary treatment?} E -->|No ✗| F{Continuous treatment} F --> G{Instrumental variable?} G -->|Yes ✓| H[Instrumental Variables
Continuous Treatment] style A fill:#e3f2fd style B fill:#ffebee style C fill:#ffebee style D fill:#ffebee style E fill:#ffebee style F fill:#fff3e0 style G fill:#fff3e0 style H fill:#e8f5e8 **Agent Decision Process**: .. code-block:: text 🎯 Method Selection: IV with Continuous Treatment Decision Path: 1. Randomization check: ❌ FAILED 2. Panel data check: ❌ NOT AVAILABLE 3. Running variable check: ❌ NOT AVAILABLE 4. Treatment type: ✅ CONTINUOUS 5. Instrumental variable: ✅ AVAILABLE 6. Selected method: IV with continuous treatment Why this method? ✓ Handles continuous endogenous treatment ✓ Valid instrument provides identification ✓ Can estimate dose-response relationships ✓ Flexible functional form specification **Example Datasets**: Advertising intensity, education years, healthcare dosage Method Exclusion Examples ------------------------- Understanding why methods are excluded is as important as understanding why they're selected. Example 7: Why Not Difference-in-Differences? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset**: Cross-sectional observational data with rich covariates .. code-block:: text ❌ Difference-in-Differences: EXCLUDED Data Requirements Not Met: - Requires: Panel data with multiple time periods - Available: Cross-sectional data (single time point) - Missing: Pre-treatment outcome measurements - Missing: Variation in treatment timing Alternative Selected: Propensity Score Matching - Uses available rich covariates - Handles selection bias through matching - Appropriate for cross-sectional data --- Example 8: Why Not Regression Discontinuity? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset**: Observational data without clear assignment rule .. code-block:: text ❌ Regression Discontinuity: EXCLUDED Data Requirements Not Met: - Requires: Continuous running variable with sharp cutoff - Available: Discretionary treatment assignment - Missing: Clear assignment rule or threshold - Problem: No discontinuous treatment probability Alternative Selected: Propensity Score Methods - Handles discretionary assignment - Uses observed characteristics for matching - Appropriate for non-rule-based assignment --- Example 9: Why Not Instrumental Variables? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Dataset**: Randomized experiment with perfect compliance .. code-block:: text ❌ Instrumental Variables: EXCLUDED Not Needed: - Randomization already provides identification - No endogeneity concerns in experimental data - IV would be less efficient than direct analysis - Perfect compliance eliminates need for instruments Selected Method: Linear Regression with Covariates - Leverages randomization for identification - More efficient than IV approach - Simpler interpretation and implementation Dataset Property Decision Matrix ------------------------------- This matrix shows how different combinations of data characteristics lead to method selection: .. list-table:: Method Selection Matrix :header-rows: 1 :widths: 15 15 15 15 15 25 * - Randomized - Panel Data - Running Var - Instrument - Treatment Type - Selected Method * - ✅ Yes - Any - Any - Any - Binary - Linear Regression + Covariates * - ✅ Yes - Any - Any - Any - Continuous - Linear Regression + Covariates * - ❌ No - ✅ Yes - Any - Any - Any - Difference-in-Differences * - ❌ No - ❌ No - ✅ Yes - Any - Any - Regression Discontinuity * - ❌ No - ❌ No - ❌ No - ✅ Yes - Binary - Instrumental Variables * - ❌ No - ❌ No - ❌ No - ✅ Yes - Continuous - IV Continuous Treatment * - ❌ No - ❌ No - ❌ No - ❌ No - Binary - Propensity Score Methods * - ❌ No - ❌ No - ❌ No - ❌ No - Continuous - Linear Regression + Controls Common Decision Patterns ----------------------- Pattern 1: Experimental Data Priority ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rule**: Randomized experiments always preferred when available .. code-block:: text Priority Hierarchy: 1. Randomized Controlled Trial → Linear Regression + Covariates 2. Natural Experiment (RDD/IV) → RDD or IV 3. Quasi-Experiment (DiD) → Difference-in-Differences 4. Observational (Matching) → Propensity Score Methods 5. Observational (Regression) → Linear Regression + Controls Pattern 2: Data Structure Drives Method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rule**: Method selection follows data availability hierarchy .. code-block:: text Data Structure Priority: 1. Randomization → Experimental methods 2. Panel + Timing → Difference-in-Differences 3. Running Variable → Regression Discontinuity 4. Valid Instrument → Instrumental Variables 5. Rich Covariates → Propensity Score Methods 6. Limited Data → Linear Regression Pattern 3: Treatment Type Considerations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Rule**: Treatment variable type affects method choice within categories .. code-block:: text Treatment Type Adaptations: - Binary Treatment: Standard methods (matching, IV, etc.) - Continuous Treatment: Specialized versions (generalized PS, IV) - Categorical Treatment: Multinomial approaches - Time-Varying Treatment: Dynamic methods Next Steps ---------- 1. **Apply to Your Data**: Use the decision framework with your datasets 2. **Explore Case Studies**: See detailed examples in :doc:`../case_studies/index` 3. **Read Method Documentation**: Deep dive into specific methods in :doc:`../../methods/index` **Related Resources**: - :doc:`../../methods/decision_tree` - Complete decision tree documentation - :doc:`../case_studies/index` - Detailed case studies by domain - :doc:`../../getting_started/quickstart` - Quick start guide for CAIS