Advanced Usage ============== This guide covers advanced features and customization options for power users who need more control over the causal analysis process. Learn how to fine-tune method selection, customize analysis parameters, and integrate CAIS into complex workflows. Advanced Configuration ---------------------- Environment Variables ~~~~~~~~~~~~~~~~~~~~~ CAIS can be configured through environment variables for consistent behavior across analyses: .. code-block:: bash # LLM Configuration export LLM_PROVIDER="anthropic" export LLM_MODEL="claude-3-5-sonnet-latest" export ANTHROPIC_API_KEY="your-api-key" # Analysis Configuration export CAIS_DEFAULT_CONFIDENCE_LEVEL="0.95" export CAIS_VERBOSE_LOGGING="true" You can also set these in a `.env` file in your project directory: .. code-block:: bash # .env file LLM_PROVIDER=openai LLM_MODEL=gpt-4o OPENAI_API_KEY=your-api-key-here CAIS_DEFAULT_CONFIDENCE_LEVEL=0.95 Programmatic Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~ For more dynamic configuration, you can set environment variables programmatically: .. code-block:: python import os from causal_agent import run_causal_analysis # Configure for specific analysis os.environ["LLM_PROVIDER"] = "anthropic" os.environ["LLM_MODEL"] = "claude-3-5-sonnet-latest" result = run_causal_analysis( query="What is the effect of treatment on outcome?", dataset_path="data.csv" ) Method Selection Control ------------------------ Understanding Automatic Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CAIS uses a decision tree to automatically select appropriate causal inference methods based on: - Data structure (cross-sectional, panel, time series) - Variable types (binary, continuous, categorical) - Available instruments or discontinuities - Experimental design indicators You can examine the method selection reasoning: .. code-block:: python result = run_causal_analysis( query="What is the effect of education on income?", dataset_path="data.csv" ) # Check which method was selected and why method_info = result['results']['method_info'] print(f"Selected method: {method_info['method_name']}") print(f"Reasoning: {method_info['selection_reasoning']}") Influencing Method Selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~ While CAIS automatically selects methods, you can influence selection through query phrasing: .. code-block:: python # Suggest instrumental variable approach result = run_causal_analysis( query="What is the effect of education on wages using distance to college as an instrument?", dataset_path="data.csv" ) # Suggest difference-in-differences result = run_causal_analysis( query="What was the effect of the policy change over time comparing treated and control regions?", dataset_path="panel_data.csv" ) # Suggest regression discontinuity result = run_causal_analysis( query="What is the effect of the program for those just above the eligibility cutoff?", dataset_path="rdd_data.csv" ) Custom Analysis Workflows ------------------------- Multi-Method Comparison ~~~~~~~~~~~~~~~~~~~~~~ Compare results across different causal inference methods: .. code-block:: python import pandas as pd from causal_agent import run_causal_analysis def compare_methods(dataset_path, base_query, method_hints): """Compare causal estimates across different methods.""" results = {} for method_name, query_hint in method_hints.items(): full_query = f"{base_query} {query_hint}" result = run_causal_analysis( query=full_query, dataset_path=dataset_path ) if 'error' not in result: results[method_name] = { 'effect': result['results']['results']['effect_estimate'], 'se': result['results']['results']['standard_error'], 'method': result['results']['results']['method_used'] } return pd.DataFrame(results).T # Example usage method_hints = { 'matching': 'using propensity score matching', 'regression': 'using regression adjustment', 'weighting': 'using inverse probability weighting' } comparison = compare_methods( "data/observational_data.csv", "What is the effect of treatment on outcome", method_hints ) print(comparison) Sensitivity Analysis ~~~~~~~~~~~~~~~~~~~ Test the robustness of your causal conclusions: .. code-block:: python def sensitivity_analysis(dataset_path, query, perturbations): """Run sensitivity analysis with different data perturbations.""" import numpy as np import pandas as pd base_result = run_causal_analysis(query=query, dataset_path=dataset_path) base_effect = base_result['results']['results']['effect_estimate'] results = {'base': base_effect} # Load and modify data for sensitivity tests df = pd.read_csv(dataset_path) for name, perturbation_func in perturbations.items(): # Apply perturbation df_modified = perturbation_func(df.copy()) temp_path = f"temp_{name}.csv" df_modified.to_csv(temp_path, index=False) # Run analysis on modified data result = run_causal_analysis(query=query, dataset_path=temp_path) if 'error' not in result: results[name] = result['results']['results']['effect_estimate'] # Clean up import os os.remove(temp_path) return results # Example perturbations perturbations = { 'drop_5pct': lambda df: df.sample(frac=0.95), 'add_noise': lambda df: df.assign( outcome=df['outcome'] + np.random.normal(0, df['outcome'].std() * 0.1, len(df)) ) } sensitivity_results = sensitivity_analysis( "data.csv", "What is the effect of treatment on outcome?", perturbations ) Integration Patterns -------------------- Jupyter Notebook Integration ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For interactive analysis and visualization: .. code-block:: python import matplotlib.pyplot as plt import seaborn as sns from causal_agent import run_causal_analysis # Run analysis result = run_causal_analysis( query="What is the effect of education on income?", dataset_path="data.csv" ) # Extract and visualize results effect = result['results']['results']['effect_estimate'] ci = result['results']['results']['confidence_interval'] # Create visualization fig, ax = plt.subplots(figsize=(8, 6)) ax.errorbar([0], [effect], yerr=[[effect - ci[0]], [ci[1] - effect]], fmt='o', capsize=5, capthick=2) ax.axhline(y=0, color='r', linestyle='--', alpha=0.5) ax.set_ylabel('Causal Effect Estimate') ax.set_title('Treatment Effect with 95% Confidence Interval') plt.show() Pipeline Integration ~~~~~~~~~~~~~~~~~~~ Integrate CAIS into data processing pipelines: .. code-block:: python from typing import Dict, Any, List import pandas as pd class CausalAnalysisPipeline: """Pipeline for automated causal analysis.""" def __init__(self, llm_provider: str = "openai"): import os os.environ["LLM_PROVIDER"] = llm_provider def analyze_dataset(self, dataset_info: Dict[str, Any]) -> Dict[str, Any]: """Analyze a single dataset.""" result = run_causal_analysis( query=dataset_info['query'], dataset_path=dataset_info['path'], dataset_description=dataset_info.get('description') ) return { 'dataset_id': dataset_info['id'], 'query': dataset_info['query'], 'effect_estimate': result['results']['results']['effect_estimate'], 'method_used': result['results']['results']['method_used'], 'significant': result['results']['results']['p_value'] < 0.05 } def batch_analyze(self, datasets: List[Dict[str, Any]]) -> pd.DataFrame: """Analyze multiple datasets.""" results = [] for dataset_info in datasets: try: result = self.analyze_dataset(dataset_info) results.append(result) except Exception as e: results.append({ 'dataset_id': dataset_info['id'], 'error': str(e) }) return pd.DataFrame(results) # Usage pipeline = CausalAnalysisPipeline(llm_provider="anthropic") datasets = [ { 'id': 'study1', 'path': 'data/study1.csv', 'query': 'What is the effect of treatment A on outcome Y?', 'description': 'RCT data from study 1' }, { 'id': 'study2', 'path': 'data/study2.csv', 'query': 'What is the effect of intervention B on metric Z?', 'description': 'Observational data from study 2' } ] results_df = pipeline.batch_analyze(datasets) Custom Data Preprocessing ------------------------- Data Validation and Cleaning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Implement custom data validation before analysis: .. code-block:: python import pandas as pd import numpy as np from typing import Tuple, List def validate_and_clean_data(df: pd.DataFrame, treatment_col: str, outcome_col: str) -> Tuple[pd.DataFrame, List[str]]: """Validate and clean data for causal analysis.""" warnings = [] df_clean = df.copy() # Check for missing values missing_treatment = df_clean[treatment_col].isna().sum() missing_outcome = df_clean[outcome_col].isna().sum() if missing_treatment > 0: warnings.append(f"Dropping {missing_treatment} rows with missing treatment") df_clean = df_clean.dropna(subset=[treatment_col]) if missing_outcome > 0: warnings.append(f"Dropping {missing_outcome} rows with missing outcome") df_clean = df_clean.dropna(subset=[outcome_col]) # Validate treatment variable unique_treatments = df_clean[treatment_col].nunique() if unique_treatments != 2: warnings.append(f"Treatment variable has {unique_treatments} unique values, expected 2") # Check for outliers in outcome Q1 = df_clean[outcome_col].quantile(0.25) Q3 = df_clean[outcome_col].quantile(0.75) IQR = Q3 - Q1 outliers = ((df_clean[outcome_col] < (Q1 - 1.5 * IQR)) | (df_clean[outcome_col] > (Q3 + 1.5 * IQR))).sum() if outliers > 0: warnings.append(f"Found {outliers} potential outliers in outcome variable") return df_clean, warnings # Usage before analysis df = pd.read_csv("data.csv") df_clean, warnings = validate_and_clean_data(df, 'treatment', 'outcome') for warning in warnings: print(f"Warning: {warning}") # Save cleaned data df_clean.to_csv("data_cleaned.csv", index=False) # Run analysis on cleaned data result = run_causal_analysis( query="What is the effect of treatment on outcome?", dataset_path="data_cleaned.csv" ) Performance Optimization ------------------------ Caching Results ~~~~~~~~~~~~~~ Cache analysis results to avoid recomputation: .. code-block:: python import hashlib import json import os from functools import wraps def cache_analysis(cache_dir: str = "analysis_cache"): """Decorator to cache analysis results.""" os.makedirs(cache_dir, exist_ok=True) def decorator(func): @wraps(func) def wrapper(query: str, dataset_path: str, **kwargs): # Create cache key cache_key = hashlib.md5( f"{query}_{dataset_path}_{json.dumps(kwargs, sort_keys=True)}".encode() ).hexdigest() cache_file = os.path.join(cache_dir, f"{cache_key}.json") # Check if cached result exists if os.path.exists(cache_file): with open(cache_file, 'r') as f: return json.load(f) # Run analysis and cache result result = func(query, dataset_path, **kwargs) with open(cache_file, 'w') as f: json.dump(result, f, indent=2) return result return wrapper return decorator # Apply caching to analysis function @cache_analysis() def cached_causal_analysis(query: str, dataset_path: str, **kwargs): return run_causal_analysis(query, dataset_path, **kwargs) # Usage - subsequent calls with same parameters will use cache result1 = cached_causal_analysis( "What is the effect of treatment on outcome?", "data.csv" ) result2 = cached_causal_analysis( # This will use cached result "What is the effect of treatment on outcome?", "data.csv" ) Best Practices for Advanced Usage --------------------------------- 1. **Version Control**: Track analysis configurations and results 2. **Documentation**: Document custom workflows and parameter choices 3. **Testing**: Validate analyses on known datasets before production use 4. **Monitoring**: Log analysis performance and error rates 5. **Reproducibility**: Use fixed random seeds and version pinning Next Steps ---------- - For batch processing workflows, see :doc:`batch_processing` - For LLM provider configuration, see :doc:`configuration` - For method-specific details, see :doc:`../methods/index` - For developer documentation, see :doc:`../development/index`