Regression Discontinuity Design (RDD)

Regression Discontinuity Design (RDD) is a quasi-experimental method that exploits arbitrary cutoffs in treatment assignment rules to identify causal effects. RDD compares units just above and below a threshold to estimate local treatment effects.

When to Use RDD

Ideal Conditions: - Treatment assignment is determined by a continuous variable (running variable) crossing a threshold - Assignment rule is strictly enforced and known - Units cannot precisely manipulate the running variable around the cutoff - Sufficient observations near the cutoff

Common Applications: - Educational interventions (test score cutoffs for remedial programs) - Financial aid eligibility (income thresholds) - Policy interventions (age cutoffs, geographic boundaries) - Medical treatments (clinical thresholds for treatment) - Electoral systems (vote share thresholds)

Not Suitable When: - Treatment assignment is not based on a clear cutoff - Running variable can be easily manipulated - Insufficient observations near the cutoff - Multiple simultaneous cutoffs exist

Theoretical Background

The RDD Framework

Basic Setup: - Running Variable (X): Continuous variable determining treatment assignment - Cutoff (c): Threshold value where treatment assignment changes - Treatment Assignment: $D_i = 1$ if $X_i \\geq c$, :math:`D_i = 0$ if :math:`X_i < c$

Sharp RDD: Treatment assignment is a deterministic function of the running variable:

\[\begin{split}D_i = \\begin{cases} 1 & \\text{if } X_i \\geq c \\\\ 0 & \\text{if } X_i < c \\end{cases}\end{split}\]

Fuzzy RDD: Treatment probability changes discontinuously at the cutoff, but assignment is not deterministic:

\[\begin{split}P(D_i = 1 | X_i) = \\begin{cases} g_1(X_i) & \\text{if } X_i \\geq c \\\\ g_0(X_i) & \\text{if } X_i < c \\end{cases}\end{split}\]

Where :math:`g_1(c) \neq g_0(c)$.

RDD Estimand: The treatment effect at the cutoff:

\[\begin{split}\\tau_{RDD} = E[Y_i(1) - Y_i(0) | X_i = c]\end{split}\]

Sharp RDD Estimation:

\[\begin{split}\\hat{\\tau}_{RDD} = \\lim_{x \\to c^+} E[Y_i | X_i = x] - \\lim_{x \\to c^-} E[Y_i | X_i = x]\end{split}\]

Key Assumptions

Continuity of Potential Outcomes

Definition: Potential outcomes are continuous at the cutoff in the absence of treatment.

Mathematical: :math:`\lim_{x \to c^+} E[Y_i(0) | X_i = x] = \lim_{x \to c^-} E[Y_i(0) | X_i = x]$

Why it matters: This is the core identifying assumption that allows causal interpretation.

Testing: Check for discontinuities in covariates at the cutoff.
No Precise Manipulation

Definition: Units cannot precisely control their value of the running variable around the cutoff.

Why it matters: If units can manipulate assignment, selection bias is reintroduced.

Testing: McCrary density test for discontinuities in running variable density.
No Other Discontinuities

Definition: No other treatments or interventions change discontinuously at the same cutoff.

Why it matters: Other discontinuous changes would confound the treatment effect.

Testing: Examine institutional rules and policy changes around the cutoff.

Types of RDD

Sharp RDD

Characteristics: - Treatment assignment is deterministic based on running variable - All units above cutoff receive treatment, all below do not - Simpler analysis and interpretation

Estimation: - Compare outcomes just above and below cutoff - Use local linear regression or other nonparametric methods - Focus on observations within optimal bandwidth

Fuzzy RDD

Characteristics: - Treatment probability changes at cutoff but assignment is not deterministic - Some units above cutoff don’t receive treatment (non-compliance) - Some units below cutoff receive treatment (always-takers)

Estimation: - Use instrumental variables approach - Running variable above/below cutoff as instrument for treatment - Estimates Local Average Treatment Effect (LATE) for compliers

Two-Stage Approach: First Stage: :math:`D_i = \alpha_0 + \alpha_1 \mathbf{1}(X_i \geq c) + f(X_i) + \epsilon_i$ Second Stage: :math:`Y_i = \beta_0 + \tau \hat{D_i} + g(X_i) + u_i$

Implementation in Causal Agent

Sharp RDD Analysis

from causal_agent import CausalAgent

# Causal Agent automatically detects RDD design
agent = CausalAgent()
result = agent.analyze(
    data=rdd_data,
    treatment='above_cutoff',
    outcome='test_score',
    running_var='prior_score',
    cutoff_value=70
)

print(f"RDD Treatment Effect: {result.ate}")
print(f"95% Confidence Interval: {result.confidence_interval}")
print(f"Bandwidth used: {result.bandwidth}")

Fuzzy RDD Analysis

# Fuzzy RDD with imperfect compliance
result = agent.analyze(
    data=rdd_data,
    treatment='actually_treated',  # actual treatment received
    outcome='test_score',
    running_var='prior_score',
    cutoff_value=70,
    method='fuzzy_rdd'
)

print(f"LATE Estimate: {result.late}")
print(f"First-stage jump: {result.first_stage_jump}")

Bandwidth Selection

# Custom bandwidth selection
result = agent.analyze(
    data=rdd_data,
    treatment='above_cutoff',
    outcome='test_score',
    running_var='prior_score',
    cutoff_value=70,
    bandwidth_method='optimal',  # or 'cross_validation', 'rule_of_thumb'
    bandwidth_value=5.0  # manual bandwidth
)

Diagnostic Tests and Validation

Manipulation Testing

Test whether units can precisely manipulate the running variable:

# McCrary density test
manipulation_test = agent.mccrary_test(
    data=rdd_data,
    running_var='prior_score',
    cutoff_value=70
)

print(f"McCrary test p-value: {manipulation_test.p_value}")
print(f"Density discontinuity: {manipulation_test.discontinuity}")

What to look for: - Non-significant p-value (no evidence of manipulation) - Smooth density around the cutoff - No unusual bunching just above or below cutoff

Covariate Balance Testing

Check for discontinuities in predetermined characteristics:

# Test balance of covariates at cutoff
balance_test = agent.covariate_balance_rdd(
    data=rdd_data,
    covariates=['age', 'gender', 'socioeconomic_status'],
    running_var='prior_score',
    cutoff_value=70
)

print("Covariate balance results:")
for var, result in balance_test.items():
    print(f"{var}: discontinuity = {result.discontinuity:.3f}, p = {result.p_value:.3f}")

Interpretation: - Non-significant discontinuities support validity - Significant jumps suggest potential confounding - Pattern of imbalances may indicate manipulation

Bandwidth Sensitivity

Test robustness to bandwidth choice:

# Sensitivity to bandwidth selection
bandwidth_sensitivity = agent.bandwidth_sensitivity(
    data=rdd_data,
    treatment='above_cutoff',
    outcome='test_score',
    running_var='prior_score',
    cutoff_value=70,
    bandwidth_range=[2, 3, 4, 5, 6, 7, 8]
)

print("Bandwidth sensitivity results:")
for bw, estimate in bandwidth_sensitivity.items():
    print(f"Bandwidth {bw}: Effect = {estimate.effect:.3f} (SE = {estimate.se:.3f})")

Placebo Cutoff Tests

Test for treatment effects at fake cutoffs:

# Placebo tests at alternative cutoffs
placebo_tests = agent.placebo_cutoff_tests(
    data=rdd_data,
    treatment='above_cutoff',
    outcome='test_score',
    running_var='prior_score',
    true_cutoff=70,
    placebo_cutoffs=[65, 67.5, 72.5, 75]
)

print("Placebo test results:")
for cutoff, result in placebo_tests.items():
    print(f"Cutoff {cutoff}: Effect = {result.effect:.3f}, p = {result.p_value:.3f}")

Interpretation: - Non-significant effects at placebo cutoffs support validity - Significant effects suggest confounding or model misspecification

Functional Form Testing

Test sensitivity to polynomial order and functional form:

# Test different polynomial orders
functional_form_test = agent.functional_form_sensitivity(
    data=rdd_data,
    treatment='above_cutoff',
    outcome='test_score',
    running_var='prior_score',
    cutoff_value=70,
    polynomial_orders=[1, 2, 3, 4]
)

Best Practices

Design and Data Collection

Running Variable Selection: - Choose variables that determine treatment assignment - Ensure precise measurement around cutoff - Document assignment rules clearly - Consider multiple running variables if relevant

Sample Size Planning: - Focus observations near the cutoff - Ensure adequate power for local effects - Consider optimal sample allocation - Plan for potential manipulation

Data Quality: - Verify assignment rule implementation - Check for measurement error in running variable - Document any exceptions or overrides - Collect rich covariate data for validation

Analysis Implementation

Bandwidth Selection: - Use data-driven optimal bandwidth methods - Report sensitivity to bandwidth choice - Consider different bandwidths for different outcomes - Balance bias-variance tradeoff

Functional Form: - Start with local linear regression - Test sensitivity to polynomial order - Consider nonparametric methods - Avoid overfitting with high-order polynomials

Standard Errors: - Use robust standard errors - Consider clustering if appropriate - Account for bandwidth selection uncertainty - Report confidence intervals

Validation and Robustness

Assumption Testing: - Always conduct manipulation tests - Check covariate balance at cutoff - Test for other discontinuities - Examine institutional details

Sensitivity Analysis: - Vary bandwidth systematically - Test different functional forms - Exclude observations very close to cutoff - Use alternative estimation methods

Transparency: - Report all diagnostic tests - Show graphical evidence - Discuss institutional context - Acknowledge limitations

Common Pitfalls and Solutions

Pitfall: Using inappropriate bandwidth Solution: Use optimal bandwidth methods and test sensitivity

Pitfall: Ignoring manipulation possibilities Solution: Always conduct McCrary tests and examine institutional incentives

Pitfall: Overfitting with high-order polynomials Solution: Use local linear regression and test functional form sensitivity

Pitfall: Misinterpreting local effects as global Solution: Clearly state that RDD estimates local effects at the cutoff

Pitfall: Inadequate sample size near cutoff Solution: Focus data collection near cutoff and conduct power analysis

Example: Educational Remediation Program

Research Question: What is the effect of mandatory tutoring on student achievement?

Setting: Students with test scores below 70 are required to attend tutoring - Running Variable: Prior test score (0-100) - Cutoff: Score of 70 - Treatment: Mandatory tutoring participation - Outcome: End-of-year test score

Analysis:

# Sharp RDD analysis
result = agent.analyze(
    data=education_rdd,
    treatment='mandatory_tutoring',
    outcome='end_year_score',
    running_var='prior_test_score',
    cutoff_value=70
)

# Validation tests
manipulation_test = agent.mccrary_test(
    data=education_rdd,
    running_var='prior_test_score',
    cutoff_value=70
)

balance_test = agent.covariate_balance_rdd(
    data=education_rdd,
    covariates=['age', 'gender', 'free_lunch'],
    running_var='prior_test_score',
    cutoff_value=70
)

print(f"RDD Treatment Effect: {result.ate:.2f} points")
print(f"95% CI: [{result.ci_lower:.2f}, {result.ci_upper:.2f}]")
print(f"McCrary test p-value: {manipulation_test.p_value:.3f}")

Results Interpretation: Students just below the cutoff (required to attend tutoring) scored X points higher on the end-of-year test compared to students just above the cutoff. The McCrary test shows no evidence of score manipulation (p = 0.XX).

Advanced RDD Methods

Multi-Cutoff RDD

When multiple cutoffs exist:

# Multiple cutoffs analysis
result = agent.analyze(
    data=multi_cutoff_data,
    treatment='treatment_intensity',
    outcome='outcome_var',
    running_var='score',
    cutoff_values=[50, 70, 85],
    method='multi_cutoff_rdd'
)

Geographic RDD

Using geographic boundaries as cutoffs:

# Geographic discontinuity
result = agent.analyze(
    data=geographic_data,
    treatment='policy_exposure',
    outcome='outcome_var',
    running_var='distance_to_boundary',
    cutoff_value=0,
    method='geographic_rdd'
)

Regression Kink Design

When treatment intensity (rather than probability) changes at cutoff:

# Regression kink design
result = agent.analyze(
    data=kink_data,
    treatment='treatment_intensity',
    outcome='outcome_var',
    running_var='eligibility_score',
    cutoff_value=75,
    method='regression_kink'
)