Your First Causal Analysis
=========================

This guide provides a comprehensive walkthrough of conducting your first causal analysis with CAIS, helping you understand both the technical steps and the causal inference concepts behind them.

.. contents:: Contents
   :local:
   :depth: 3

Understanding Causal Questions
------------------------------

Before diving into the analysis, it's important to understand what makes a good causal question and how CAIS approaches causal inference.

What is Causal Inference?
~~~~~~~~~~~~~~~~~~~~~~~~~~

Causal inference goes beyond correlation to answer questions like:

* **Does X cause Y?** (e.g., "Does education cause higher income?")
* **What would happen if...?** (e.g., "What would happen to crime rates if we increased police presence?")
* **How much of an effect?** (e.g., "By how much does job training increase earnings?")

**Key Concept:** Correlation ≠ Causation

.. code-block:: python

   # Example: Ice cream sales and drowning deaths are correlated
   # But ice cream doesn't cause drowning - both increase in summer!
   
   # CAIS helps identify true causal relationships by:
   # 1. Selecting appropriate methods
   # 2. Controlling for confounding variables
   # 3. Testing assumptions

Good vs. Poor Causal Questions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Good Causal Questions:**

.. code-block:: text

   ✓ "Does participating in job training increase income?"
   ✓ "What is the effect of class size on student performance?"
   ✓ "How does a new drug affect patient recovery time?"

**Poor Causal Questions:**

.. code-block:: text

   ✗ "What factors are associated with income?" (descriptive, not causal)
   ✗ "Is there a relationship between X and Y?" (correlation question)
   ✗ "What predicts outcome Z?" (prediction, not causation)

Step-by-Step Analysis Walkthrough
----------------------------------

Let's work through a complete analysis using a real-world example: the effect of education on earnings.

Step 1: Problem Setup
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import pandas as pd
   from causal_agent import run_causal_analysis
   import os
   
   # Set up your environment
   os.environ['OPENAI_API_KEY'] = 'your-api-key-here'
   
   # Define our causal question
   causal_question = "What is the effect of college education on annual income?"
   
   print(f"Research Question: {causal_question}")

Step 2: Data Preparation
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Load sample education-income dataset
   # (In practice, you'd load your own data)
   data = pd.read_csv('data/all_data/education_income.csv')
   
   # Examine the data structure
   print("Dataset Overview:")
   print(f"Shape: {data.shape}")
   print(f"Columns: {list(data.columns)}")
   print("\nFirst few rows:")
   print(data.head())
   
   # Check for missing values
   print("\nMissing values:")
   print(data.isnull().sum())

**Expected Output:**

.. code-block:: text

   Dataset Overview:
   Shape: (2500, 12)
   Columns: ['person_id', 'college_degree', 'annual_income', 'age', 'gender', 
            'work_experience', 'industry', 'region', 'family_background', 
            'cognitive_ability', 'motivation', 'health_status']

Step 3: Running the Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Provide a detailed dataset description
   dataset_description = """
   This dataset contains information about working adults and their educational and 
   economic outcomes. Each row represents one individual with the following key variables:
   
   - college_degree: Whether the person has a college degree (treatment variable)
   - annual_income: Person's annual income in dollars (outcome variable)
   - age, gender, work_experience: Demographic controls
   - industry, region: Economic context variables
   - family_background: Socioeconomic background (potential confounder)
   - cognitive_ability, motivation: Individual characteristics (potential confounders)
   - health_status: Health-related factors
   
   The data comes from a longitudinal survey of working adults aged 25-65.
   """
   
   # Run the causal analysis
   print("Running causal analysis...")
   result = run_causal_analysis(
       query=causal_question,
       dataset_path='education_income.csv',
       dataset_description=dataset_description
   )
   
   print("Analysis completed!")

Step 4: Understanding the Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Extract key results
   method_used = result['results']['results']['method_used']
    treatment_var = result['results']['variables']['treatment_variable']
    outcome_var = result['results']['variables']['outcome_variable']
    effect_estimate = result['results']['results']['effect_estimate']
    std_error = result['results']['results']['standard_error']
    p_value = result['results']['results']['p_value']
    confidence_interval = result['results']['results'].get('confidence_interval', 'Not available')

   
   print("=== ANALYSIS RESULTS ===")
   print(f"Method Selected: {method_used}")
   print(f"Treatment Variable: {treatment_var}")
   print(f"Outcome Variable: {outcome_var}")
   print(f"Causal Effect Estimate: ${effect_estimate:,.2f}")
   print(f"Standard Error: ${std_error:,.2f}")
   print(f"P-value: {p_value:.4f}")
   print(f"95% Confidence Interval: {confidence_interval}")
   
   # Statistical significance
   is_significant = p_value < 0.05
   print(f"Statistically Significant: {'Yes' if is_significant else 'No'}")

**Sample Output:**

.. code-block:: text

   === ANALYSIS RESULTS ===
   Method Selected: Propensity Score Matching
   Treatment Variable: college_degree
   Outcome Variable: annual_income
   Causal Effect Estimate: $18,450.00
   Standard Error: $2,340.00
   P-value: 0.0001
   95% Confidence Interval: [$13,864, $23,036]
   Statistically Significant: Yes

Step 5: Interpreting the Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Get the AI interpretation
   interpretation = result['explanation']['final_explanation_text']
   print("\n=== AI INTERPRETATION ===")
   print(interpretation)
   
   # Manual interpretation guide
   print("\n=== INTERPRETATION GUIDE ===")
   
   if is_significant:
       print(f"✓ The analysis suggests that having a college degree CAUSES an increase")
       print(f"  in annual income of approximately ${effect_estimate:,.0f}.")
       print(f"✓ This effect is statistically significant (p = {p_value:.4f} < 0.05).")
       print(f"✓ We can be 95% confident the true effect is between {confidence_interval}.")
   else:
       print(f"✗ No statistically significant causal effect was found.")
       print(f"✗ The estimated effect of ${effect_estimate:,.0f} could be due to chance.")
   
   print(f"\n📊 Method Used: {method_used}")
   print("   This method was selected based on the characteristics of your data.")

Step 6: Examining Method Selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Understanding why CAIS selected a particular method helps you trust and validate the results:

.. code-block:: python

   # Check method selection reasoning (if available)
   method_reasoning = result['results'].get('method_reasoning', 'Not provided')
   print(f"\n=== WHY THIS METHOD? ===")
   print(f"Selected Method: {method_used}")
   print(f"Reasoning: {method_reasoning}")
   
   # Common methods and when they're used:
   method_guide = {
       'Randomized Controlled Trial': 'Data appears to be from a randomized experiment',
       'Propensity Score Matching': 'Observational data with good covariate balance possible',
       'Difference-in-Differences': 'Panel data with treatment timing variation',
       'Instrumental Variables': 'Strong instrumental variable detected',
       'Regression Discontinuity': 'Sharp cutoff/threshold detected in treatment assignment',
       'Linear Regression': 'Simple observational data with controls'
   }
   
   if method_used in method_guide:
       print(f"Typical use case: {method_guide[method_used]}")

Step 7: Validating Results
~~~~~~~~~~~~~~~~~~~~~~~~~~

Always validate your causal analysis results:

.. code-block:: python

   # Check for potential issues
   print("\n=== VALIDATION CHECKLIST ===")
   
   # 1. Effect size reasonableness
   print(f"1. Effect Size: ${effect_estimate:,.0f}")
   if abs(effect_estimate) > 100000:
       print("   ⚠️  Large effect - double-check data and units")
   else:
       print("   ✓ Reasonable effect size")
   
   # 2. Statistical significance
   print(f"2. P-value: {p_value:.4f}")
   if p_value < 0.05:
       print("   ✓ Statistically significant")
   else:
       print("   ⚠️  Not statistically significant")
   
   # 3. Sample size
   sample_size = len(data)
   print(f"3. Sample Size: {sample_size:,}")
   if sample_size < 100:
       print("   ⚠️  Small sample size - results may be unreliable")
   else:
       print("   ✓ Adequate sample size")
   
   # 4. Missing data
   missing_pct = (data.isnull().sum().sum() / (len(data) * len(data.columns))) * 100
   print(f"4. Missing Data: {missing_pct:.1f}%")
   if missing_pct > 10:
       print("   ⚠️  High missing data percentage")
   else:
       print("   ✓ Low missing data")

Common Patterns and What They Mean
-----------------------------------

Different Types of Results
~~~~~~~~~~~~~~~~~~~~~~~~~~

**Strong Positive Effect:**

.. code-block:: python

   # Example result interpretation
   if effect_estimate > 0 and p_value < 0.01:
       print("Strong evidence that treatment INCREASES the outcome")
       print("Effect is both large and highly statistically significant")

**No Effect:**

.. code-block:: python

   if abs(effect_estimate) < std_error and p_value > 0.05:
       print("No evidence of causal effect")
       print("Treatment appears to have no impact on outcome")

**Negative Effect:**

.. code-block:: python

   if effect_estimate < 0 and p_value < 0.05:
       print("Evidence that treatment DECREASES the outcome")
       print("This could be beneficial or harmful depending on context")

Understanding Confidence Intervals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Interpreting confidence intervals
   def interpret_confidence_interval(ci_lower, ci_upper, effect):
       print(f"95% Confidence Interval: [{ci_lower:,.0f}, {ci_upper:,.0f}]")
       
       if ci_lower > 0:
           print("✓ We're confident the effect is positive")
       elif ci_upper < 0:
           print("✓ We're confident the effect is negative")
       else:
           print("⚠️  Confidence interval includes zero - effect uncertain")
       
       width = ci_upper - ci_lower
       precision = width / abs(effect) if effect != 0 else float('inf')
       
       if precision < 0.5:
           print("✓ Precise estimate (narrow confidence interval)")
       else:
           print("⚠️  Imprecise estimate (wide confidence interval)")

Troubleshooting Common Issues
-----------------------------

Data Quality Issues
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Check for common data problems
   def diagnose_data_issues(data):
       issues = []
       
       # Check for constant variables
       for col in data.columns:
           if data[col].nunique() <= 1:
               issues.append(f"Variable '{col}' has no variation")
       
       # Check for extreme outliers
       numeric_cols = data.select_dtypes(include=['number']).columns
       for col in numeric_cols:
           q99 = data[col].quantile(0.99)
           q01 = data[col].quantile(0.01)
           if (q99 / q01) > 1000:  # Very large range
               issues.append(f"Variable '{col}' has extreme outliers")
       
       # Check sample size
       if len(data) < 50:
           issues.append("Sample size very small (< 50 observations)")
       
       return issues
   
   # Run diagnostics
   issues = diagnose_data_issues(data)
   if issues:
       print("⚠️  Data Quality Issues Found:")
       for issue in issues:
           print(f"   - {issue}")
   else:
       print("✓ No major data quality issues detected")

Method Selection Issues
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # If you disagree with method selection
   print("\n=== IF YOU DISAGREE WITH METHOD SELECTION ===")
   print("CAIS selected:", method_used)
   print("\nConsider these factors:")
   print("1. Is your data from a randomized experiment? → RCT methods")
   print("2. Do you have before/after data? → Difference-in-Differences")
   print("3. Is there a sharp cutoff in treatment? → Regression Discontinuity")
   print("4. Do you have a valid instrument? → Instrumental Variables")
   print("5. Is this observational data? → Matching or Regression methods")

Result Interpretation Issues
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Common interpretation mistakes
   print("\n=== COMMON INTERPRETATION MISTAKES ===")
   print("❌ 'X predicts Y' → ✅ 'X causes Y'")
   print("❌ 'X is associated with Y' → ✅ 'X has a causal effect on Y'")
   print("❌ 'The model shows...' → ✅ 'The causal analysis suggests...'")
   print("❌ Ignoring confidence intervals → ✅ Report uncertainty")
   print("❌ Over-interpreting small effects → ✅ Consider practical significance")

Next Steps and Advanced Topics
-------------------------------

Now that you've completed your first analysis, here are suggested next steps:

**Immediate Next Steps:**

1. **Try different datasets:** Experiment with various domains and data types
2. **Compare methods:** Run the same analysis with different method specifications
3. **Explore batch processing:** Analyze multiple datasets systematically

**Advanced Learning:**

1. **Method deep-dives:** :doc:`../methods/index` - Learn about each causal inference method
2. **Domain tutorials:** :doc:`../tutorials/index` - See examples in your field
3. **Configuration options:** :doc:`../user_guide/advanced_usage` - Customize CAIS behavior


Congratulations! You've completed your first comprehensive causal analysis with CAIS. You're now ready to tackle more complex analyses and explore advanced features.

Continue to :doc:`../user_guide/index` for advanced usage patterns, or explore :doc:`../tutorials/index` for domain-specific examples!