Advanced Usage
This guide covers advanced features and customization options for power users who need more control over the causal analysis process. Learn how to fine-tune method selection, customize analysis parameters, and integrate CAIS into complex workflows.
Advanced Configuration
Environment Variables
CAIS can be configured through environment variables for consistent behavior across analyses:
# LLM Configuration
export LLM_PROVIDER="anthropic"
export LLM_MODEL="claude-3-5-sonnet-latest"
export ANTHROPIC_API_KEY="your-api-key"
# Analysis Configuration
export CAIS_DEFAULT_CONFIDENCE_LEVEL="0.95"
export CAIS_VERBOSE_LOGGING="true"
You can also set these in a .env file in your project directory:
# .env file
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o
OPENAI_API_KEY=your-api-key-here
CAIS_DEFAULT_CONFIDENCE_LEVEL=0.95
Programmatic Configuration
For more dynamic configuration, you can set environment variables programmatically:
import os
from causal_agent import run_causal_analysis
# Configure for specific analysis
os.environ["LLM_PROVIDER"] = "anthropic"
os.environ["LLM_MODEL"] = "claude-3-5-sonnet-latest"
result = run_causal_analysis(
query="What is the effect of treatment on outcome?",
dataset_path="data.csv"
)
Method Selection Control
Understanding Automatic Selection
CAIS uses a decision tree to automatically select appropriate causal inference methods based on:
Data structure (cross-sectional, panel, time series)
Variable types (binary, continuous, categorical)
Available instruments or discontinuities
Experimental design indicators
You can examine the method selection reasoning:
result = run_causal_analysis(
query="What is the effect of education on income?",
dataset_path="data.csv"
)
# Check which method was selected and why
method_info = result['results']['method_info']
print(f"Selected method: {method_info['method_name']}")
print(f"Reasoning: {method_info['selection_reasoning']}")
Influencing Method Selection
While CAIS automatically selects methods, you can influence selection through query phrasing:
# Suggest instrumental variable approach
result = run_causal_analysis(
query="What is the effect of education on wages using distance to college as an instrument?",
dataset_path="data.csv"
)
# Suggest difference-in-differences
result = run_causal_analysis(
query="What was the effect of the policy change over time comparing treated and control regions?",
dataset_path="panel_data.csv"
)
# Suggest regression discontinuity
result = run_causal_analysis(
query="What is the effect of the program for those just above the eligibility cutoff?",
dataset_path="rdd_data.csv"
)
Custom Analysis Workflows
Multi-Method Comparison
Compare results across different causal inference methods:
import pandas as pd
from causal_agent import run_causal_analysis
def compare_methods(dataset_path, base_query, method_hints):
"""Compare causal estimates across different methods."""
results = {}
for method_name, query_hint in method_hints.items():
full_query = f"{base_query} {query_hint}"
result = run_causal_analysis(
query=full_query,
dataset_path=dataset_path
)
if 'error' not in result:
results[method_name] = {
'effect': result['results']['results']['effect_estimate'],
'se': result['results']['results']['standard_error'],
'method': result['results']['results']['method_used']
}
return pd.DataFrame(results).T
# Example usage
method_hints = {
'matching': 'using propensity score matching',
'regression': 'using regression adjustment',
'weighting': 'using inverse probability weighting'
}
comparison = compare_methods(
"data/observational_data.csv",
"What is the effect of treatment on outcome",
method_hints
)
print(comparison)
Sensitivity Analysis
Test the robustness of your causal conclusions:
def sensitivity_analysis(dataset_path, query, perturbations):
"""Run sensitivity analysis with different data perturbations."""
import numpy as np
import pandas as pd
base_result = run_causal_analysis(query=query, dataset_path=dataset_path)
base_effect = base_result['results']['results']['effect_estimate']
results = {'base': base_effect}
# Load and modify data for sensitivity tests
df = pd.read_csv(dataset_path)
for name, perturbation_func in perturbations.items():
# Apply perturbation
df_modified = perturbation_func(df.copy())
temp_path = f"temp_{name}.csv"
df_modified.to_csv(temp_path, index=False)
# Run analysis on modified data
result = run_causal_analysis(query=query, dataset_path=temp_path)
if 'error' not in result:
results[name] = result['results']['results']['effect_estimate']
# Clean up
import os
os.remove(temp_path)
return results
# Example perturbations
perturbations = {
'drop_5pct': lambda df: df.sample(frac=0.95),
'add_noise': lambda df: df.assign(
outcome=df['outcome'] + np.random.normal(0, df['outcome'].std() * 0.1, len(df))
)
}
sensitivity_results = sensitivity_analysis(
"data.csv",
"What is the effect of treatment on outcome?",
perturbations
)
Integration Patterns
Jupyter Notebook Integration
For interactive analysis and visualization:
import matplotlib.pyplot as plt
import seaborn as sns
from causal_agent import run_causal_analysis
# Run analysis
result = run_causal_analysis(
query="What is the effect of education on income?",
dataset_path="data.csv"
)
# Extract and visualize results
effect = result['results']['results']['effect_estimate']
ci = result['results']['results']['confidence_interval']
# Create visualization
fig, ax = plt.subplots(figsize=(8, 6))
ax.errorbar([0], [effect], yerr=[[effect - ci[0]], [ci[1] - effect]],
fmt='o', capsize=5, capthick=2)
ax.axhline(y=0, color='r', linestyle='--', alpha=0.5)
ax.set_ylabel('Causal Effect Estimate')
ax.set_title('Treatment Effect with 95% Confidence Interval')
plt.show()
Pipeline Integration
Integrate CAIS into data processing pipelines:
from typing import Dict, Any, List
import pandas as pd
class CausalAnalysisPipeline:
"""Pipeline for automated causal analysis."""
def __init__(self, llm_provider: str = "openai"):
import os
os.environ["LLM_PROVIDER"] = llm_provider
def analyze_dataset(self, dataset_info: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze a single dataset."""
result = run_causal_analysis(
query=dataset_info['query'],
dataset_path=dataset_info['path'],
dataset_description=dataset_info.get('description')
)
return {
'dataset_id': dataset_info['id'],
'query': dataset_info['query'],
'effect_estimate': result['results']['results']['effect_estimate'],
'method_used': result['results']['results']['method_used'],
'significant': result['results']['results']['p_value'] < 0.05
}
def batch_analyze(self, datasets: List[Dict[str, Any]]) -> pd.DataFrame:
"""Analyze multiple datasets."""
results = []
for dataset_info in datasets:
try:
result = self.analyze_dataset(dataset_info)
results.append(result)
except Exception as e:
results.append({
'dataset_id': dataset_info['id'],
'error': str(e)
})
return pd.DataFrame(results)
# Usage
pipeline = CausalAnalysisPipeline(llm_provider="anthropic")
datasets = [
{
'id': 'study1',
'path': 'data/study1.csv',
'query': 'What is the effect of treatment A on outcome Y?',
'description': 'RCT data from study 1'
},
{
'id': 'study2',
'path': 'data/study2.csv',
'query': 'What is the effect of intervention B on metric Z?',
'description': 'Observational data from study 2'
}
]
results_df = pipeline.batch_analyze(datasets)
Custom Data Preprocessing
Data Validation and Cleaning
Implement custom data validation before analysis:
import pandas as pd
import numpy as np
from typing import Tuple, List
def validate_and_clean_data(df: pd.DataFrame,
treatment_col: str,
outcome_col: str) -> Tuple[pd.DataFrame, List[str]]:
"""Validate and clean data for causal analysis."""
warnings = []
df_clean = df.copy()
# Check for missing values
missing_treatment = df_clean[treatment_col].isna().sum()
missing_outcome = df_clean[outcome_col].isna().sum()
if missing_treatment > 0:
warnings.append(f"Dropping {missing_treatment} rows with missing treatment")
df_clean = df_clean.dropna(subset=[treatment_col])
if missing_outcome > 0:
warnings.append(f"Dropping {missing_outcome} rows with missing outcome")
df_clean = df_clean.dropna(subset=[outcome_col])
# Validate treatment variable
unique_treatments = df_clean[treatment_col].nunique()
if unique_treatments != 2:
warnings.append(f"Treatment variable has {unique_treatments} unique values, expected 2")
# Check for outliers in outcome
Q1 = df_clean[outcome_col].quantile(0.25)
Q3 = df_clean[outcome_col].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df_clean[outcome_col] < (Q1 - 1.5 * IQR)) |
(df_clean[outcome_col] > (Q3 + 1.5 * IQR))).sum()
if outliers > 0:
warnings.append(f"Found {outliers} potential outliers in outcome variable")
return df_clean, warnings
# Usage before analysis
df = pd.read_csv("data.csv")
df_clean, warnings = validate_and_clean_data(df, 'treatment', 'outcome')
for warning in warnings:
print(f"Warning: {warning}")
# Save cleaned data
df_clean.to_csv("data_cleaned.csv", index=False)
# Run analysis on cleaned data
result = run_causal_analysis(
query="What is the effect of treatment on outcome?",
dataset_path="data_cleaned.csv"
)
Performance Optimization
Caching Results
Cache analysis results to avoid recomputation:
import hashlib
import json
import os
from functools import wraps
def cache_analysis(cache_dir: str = "analysis_cache"):
"""Decorator to cache analysis results."""
os.makedirs(cache_dir, exist_ok=True)
def decorator(func):
@wraps(func)
def wrapper(query: str, dataset_path: str, **kwargs):
# Create cache key
cache_key = hashlib.md5(
f"{query}_{dataset_path}_{json.dumps(kwargs, sort_keys=True)}".encode()
).hexdigest()
cache_file = os.path.join(cache_dir, f"{cache_key}.json")
# Check if cached result exists
if os.path.exists(cache_file):
with open(cache_file, 'r') as f:
return json.load(f)
# Run analysis and cache result
result = func(query, dataset_path, **kwargs)
with open(cache_file, 'w') as f:
json.dump(result, f, indent=2)
return result
return wrapper
return decorator
# Apply caching to analysis function
@cache_analysis()
def cached_causal_analysis(query: str, dataset_path: str, **kwargs):
return run_causal_analysis(query, dataset_path, **kwargs)
# Usage - subsequent calls with same parameters will use cache
result1 = cached_causal_analysis(
"What is the effect of treatment on outcome?",
"data.csv"
)
result2 = cached_causal_analysis( # This will use cached result
"What is the effect of treatment on outcome?",
"data.csv"
)
Best Practices for Advanced Usage
Version Control: Track analysis configurations and results
Documentation: Document custom workflows and parameter choices
Testing: Validate analyses on known datasets before production use
Monitoring: Log analysis performance and error rates
Reproducibility: Use fixed random seeds and version pinning
Next Steps
For batch processing workflows, see Batch Processing
For LLM provider configuration, see Configuration
For method-specific details, see Causal Inference Methods
For developer documentation, see Development