Basic Usage
This guide covers the fundamental workflows for conducting causal analysis with CAIS. Whether you’re new to causal inference or experienced with other tools, this section will help you get started with common analysis patterns.
Core Workflow
CAIS follows a structured workflow that guides you through the causal analysis process:
Input Parsing: Understanding your research question and dataset
Dataset Analysis: Examining data structure and variable types
Query Interpretation: Identifying treatment, outcome, and control variables
Method Selection: Choosing the appropriate causal inference method
Method Validation: Checking assumptions and prerequisites
Method Execution: Running the analysis with diagnostics
Result Interpretation: Generating explanations and insights
Output Formatting: Presenting results in a structured format
Python API Usage
Single Analysis
The most common use case is analyzing a single dataset with a specific causal question:
from causal_agent import run_causal_analysis
# Basic analysis
result = run_causal_analysis(
query="What is the effect of job training on earnings?",
dataset_path="data/lalonde_data.csv",
dataset_description="LaLonde job training experiment data"
)
# Access key results
effect_estimate = result['results']['results']['effect_estimate']
method_used = result['results']['results']['method_used']
treatment_var = result['results']['variables']['treatment_variable']
outcome_var = result['results']['variables']['outcome_variable']
print(f"Method: {method_used}")
print(f"Effect of {treatment_var} on {outcome_var}: {effect_estimate}")
Understanding Results
Causal Agent returns a structured dictionary with comprehensive analysis results:
# Example result structure
{
'results': {
'results': {
'effect_estimate': 1794.34,
'standard_error': 632.85,
'confidence_interval': [554.95, 3033.73],
'p_value': 0.0045,
'method_used': 'propensity_score_matching'
},
'variables': {
'treatment_variable': 'treat',
'outcome_variable': 're78',
'covariates': ['age', 'education', 'black', 'hispanic', 'married']
},
'diagnostics': {
'balance_statistics': {...},
'assumption_checks': {...}
},
'explanation': "The analysis found a significant positive effect..."
}
}
Command Line Interface
For quick analyses or integration into scripts, use the CLI:
Single Analysis
# Basic command
causal_agent run data/lalonde_data.csv "What is the effect of job training on earnings?"
# With dataset description
causal_agent run data/lalonde_data.csv \
"What is the effect of job training on earnings?" \
--desc "LaLonde job training experiment with treatment and control groups"
# Specify LLM provider and model
causal_agent run data/lalonde_data.csv \
"What is the effect of job training on earnings?" \
--llm-provider anthropic \
--llm-name claude-3-5-sonnet-latest
Common Analysis Patterns
Experimental Data (RCT)
When you have randomized controlled trial data:
result = run_causal_analysis(
query="What is the treatment effect in this randomized experiment?",
dataset_path="data/rct_data.csv",
dataset_description="Randomized controlled trial with treatment and control groups"
)
Observational Data
For observational studies where you need to control for confounders:
result = run_causal_analysis(
query="What is the effect of education on income, controlling for background factors?",
dataset_path="data/observational_data.csv",
dataset_description="Survey data with education, income, and demographic variables"
)
Time Series / Panel Data
For difference-in-differences or other temporal analyses:
result = run_causal_analysis(
query="What was the effect of the policy change on outcomes over time?",
dataset_path="data/panel_data.csv",
dataset_description="Panel data with pre/post policy implementation periods"
)
Instrumental Variables
When you have an instrument for causal identification:
result = run_causal_analysis(
query="What is the effect of education on wages using distance to college as an instrument?",
dataset_path="data/iv_data.csv",
dataset_description="Data with education, wages, and distance to college as instrument"
)
Regression Discontinuity
For sharp cutoff designs:
result = run_causal_analysis(
query="What is the effect of the scholarship program on test scores?",
dataset_path="data/rdd_data.csv",
dataset_description="Student data with test scores and scholarship eligibility cutoff"
)
Working with Results
Extracting Key Information
# Get the main causal effect estimate
effect = result['results']['results']['effect_estimate']
se = result['results']['results']['standard_error']
ci = result['results']['results'].get('confidence_interval', None)
# Check statistical significance
p_value = result['results']['results']['p_value']
is_significant = p_value < 0.05
# Get variable information
variables = result['results']['variables']
treatment = variables['treatment_variable']
outcome = variables['outcome_variable']
covariates = variables.get('covariates', [])
Interpreting Diagnostics
# Access diagnostic information
diagnostics = result['results']['results']['diagnostics']
# For propensity score methods
if 'balance_statistics' in diagnostics:
balance = diagnostics['balance_statistics']
print("Covariate balance after matching:")
for var, stats in balance.items():
print(f" {var}: standardized difference = {stats['std_diff']:.3f}")
# For IV methods
if 'first_stage_f_stat' in diagnostics:
f_stat = diagnostics['first_stage_f_stat']
print(f"First stage F-statistic: {f_stat:.2f}")
if f_stat < 10:
print("Warning: Weak instrument (F < 10)")
Error Handling
CAIS provides informative error messages when issues occur:
result = run_causal_analysis(
query="What is the effect of X on Y?",
dataset_path="data/problematic_data.csv"
)
# Check for errors
if 'error' in result:
print(f"Analysis failed: {result['error']}")
else:
# Process successful results
effect = result['results']['results']['effect_estimate']
Common Issues and Solutions
- Missing Variables
If CAIS can’t identify treatment or outcome variables, be more specific in your query:
# Instead of: "What causes what?" # Use: "What is the effect of education on income?"
- Data Format Issues
Ensure your CSV has proper headers and numeric variables are correctly formatted:
import pandas as pd df = pd.read_csv("data.csv") print(df.dtypes) # Check variable types print(df.head()) # Check data format
- Method Selection Issues
If the automatic method selection isn’t appropriate, the explanation will indicate why certain methods were chosen or rejected.
Best Practices
Data Preparation
Clean Variable Names: Use descriptive, consistent variable names
Handle Missing Data: Address missing values before analysis
Check Data Types: Ensure treatment variables are properly coded (0/1 for binary)
Document Your Data: Provide clear dataset descriptions
Query Formulation
Be Specific: Clearly state treatment and outcome variables
Use Causal Language: Frame questions in terms of effects and causation
Provide Context: Include relevant background information in dataset descriptions
Result Interpretation
Check Assumptions: Review diagnostic tests and assumption checks
Consider Effect Size: Look beyond statistical significance to practical significance
Validate Results: Compare with domain knowledge and alternative methods
Document Decisions: Keep track of analysis choices for reproducibility
Next Steps
For more advanced features and customization options, see Advanced Usage
To process multiple datasets efficiently, see Batch Processing
For LLM provider setup and configuration, see Configuration
For detailed method documentation, see Causal Inference Methods