LLM Integration

This document provides comprehensive guidance on LLM integration patterns, prompt engineering strategies, and response processing techniques used throughout the CAIS system.

Overview 

CAIS leverages Large Language Models (LLMs) at multiple stages of the causal analysis workflow to provide intelligent reasoning, variable identification, method selection, and result interpretation. The system is designed to work with multiple LLM providers while maintaining consistent behavior and reliability.

Key Integration Points:

Variable Identification: Extract causal variables from natural language queries
Method Selection: Reason about appropriate causal inference methods
Assumption Checking: Validate method assumptions using domain knowledge
Result Interpretation: Generate human-readable explanations of statistical results
Error Recovery: Provide intelligent fallback strategies when methods fail

LLM Provider Architecture 

Supported Providers 

CAIS supports multiple LLM providers through a unified interface:

# causal_agent/config.py

SUPPORTED_PROVIDERS = {
    "openai": {
        "models": ["gpt-3.5-turbo", "gpt-4", "gpt-4-turbo", "gpt-4o"],
        "client_class": "ChatOpenAI"
    },
    "anthropic": {
        "models": ["claude-3-haiku", "claude-3-sonnet", "claude-3-opus"],
        "client_class": "ChatAnthropic"
    },
    "google": {
        "models": ["gemini-pro", "gemini-pro-vision"],
        "client_class": "ChatGoogleGenerativeAI"
    },
    "ollama": {
        "models": ["llama2", "mistral", "codellama"],
        "client_class": "ChatOllama"
    }
}

Configuration Management 

The LLM client factory provides consistent configuration across providers:

def get_llm_client(
    provider: Optional[str] = None,
    model: Optional[str] = None,
    temperature: float = 0.0,
    max_tokens: Optional[int] = None,
    **kwargs
) -> BaseChatModel:
    """
    Factory function for creating LLM clients with consistent configuration.

    Args:
        provider: LLM provider name (openai, anthropic, google, ollama)
        model: Specific model name within provider
        temperature: Sampling temperature (0.0 for deterministic)
        max_tokens: Maximum tokens in response
        **kwargs: Provider-specific configuration options

    Returns:
        Configured LLM client instance
    """
    # Environment variable fallbacks
    provider = provider or os.getenv("LLM_PROVIDER", "openai")
    model = model or os.getenv("LLM_MODEL", "gpt-4")

    # Provider-specific client creation
    if provider == "openai":
        return ChatOpenAI(
            model=model,
            temperature=temperature,
            max_tokens=max_tokens,
            api_key=os.getenv("OPENAI_API_KEY"),
            **kwargs
        )
    elif provider == "anthropic":
        return ChatAnthropic(
            model=model,
            temperature=temperature,
            max_tokens=max_tokens,
            api_key=os.getenv("ANTHROPIC_API_KEY"),
            **kwargs
        )
    # ... additional providers

Environment Configuration 

LLM configuration is managed through environment variables:

# Basic configuration
export LLM_PROVIDER=openai
export LLM_MODEL=gpt-4
export OPENAI_API_KEY=your_api_key_here

# Advanced configuration
export LLM_TEMPERATURE=0.0
export LLM_MAX_TOKENS=2000
export LLM_TIMEOUT=30

# Provider-specific settings
export ANTHROPIC_API_KEY=your_anthropic_key
export GOOGLE_API_KEY=your_google_key

Prompt Engineering Patterns 

Core Prompt Structure 

All CAIS prompts follow a consistent structure for reliability and maintainability:

PROMPT_TEMPLATE = """
You are an expert in {domain}. Your task is to {task_description}.

Context:
{context_information}

Input Data:
{input_data}

Instructions:
{specific_instructions}

Output Format:
{output_format_specification}

Examples:
{examples_if_applicable}
"""

Template Components:

Role Definition: Establish expertise and context
Task Description: Clear statement of what needs to be accomplished
Context Information: Relevant background and constraints
Input Data: Structured data for analysis
Specific Instructions: Detailed guidance for the task
Output Format: Exact specification of expected response format
Examples: Concrete examples when helpful

Variable Identification Prompts 

Treatment Variable Identification:

TREATMENT_VAR_IDENTIFICATION_PROMPT = """
You are an expert in causal inference. Your task is to identify the **treatment variable**
in a dataset to perform causal analysis that answers the user's query.

User Query:
{query}

Dataset Description:
{description}

Available Variables:
{column_info}

The treatment variable is the intervention, policy, or exposure whose causal effect
we want to estimate. It should be:
- Clearly mentioned or implied in the user's query
- Present in the available variables
- Conceptually meaningful as a treatment/intervention

If multiple variables could serve as treatment, select the one most directly
related to the user's causal question.

If no clear treatment variable can be identified, return null.

Return your response as a valid JSON object:
{{ "treatment_variable": "COLUMN_NAME_OR_NULL" }}
"""

Outcome Variable Identification:

OUTCOME_VAR_IDENTIFICATION_PROMPT = """
You are an expert in causal inference. Your task is to identify the **outcome variable**
in a dataset to perform causal analysis that answers the user's query.

User Query:
{query}

Dataset Description:
{description}

Available Variables:
{column_info}

The outcome variable is the dependent variable whose value we believe is causally
affected by the treatment. It should be:
- The main outcome of interest mentioned in the query
- Present in the available variables
- Measured after or contemporaneously with the treatment

Common outcome patterns in queries:
- "effect of X on Y" → Y is the outcome
- "impact of X on Y" → Y is the outcome
- "does X cause Y" → Y is the outcome

Return your response as a valid JSON object:
{{ "outcome_variable": "COLUMN_NAME_OR_NULL" }}
"""

Method Selection Prompts 

Decision Tree Reasoning:

METHOD_SELECTION_REASONING_PROMPT = """
You are an expert in causal inference method selection. Analyze the dataset and
variables to recommend the most appropriate causal inference method.

Dataset Analysis:
{dataset_analysis}

Identified Variables:
- Treatment: {treatment_variable}
- Outcome: {outcome_variable}
- Covariates: {covariates}
- Time Variable: {time_variable}
- Instrument: {instrument_variable}
- Running Variable: {running_variable}
- Is RCT: {is_rct}

Available Methods:
{available_methods}

Selection Criteria:
1. **Experimental Methods** (RCT, Difference in Means):
   - Use when is_rct=true or treatment is randomly assigned
   - Strongest causal identification

2. **Quasi-Experimental Methods**:
   - **Difference-in-Differences**: Time variation + treatment timing variation
   - **Instrumental Variables**: Valid instrument available
   - **Regression Discontinuity**: Running variable with cutoff

3. **Observational Methods**:
   - **Propensity Score Methods**: Rich set of covariates
   - **Backdoor Adjustment**: Sufficient covariates to block confounding
   - **Linear Regression**: Simple baseline method

Consider:
- Data structure and available variables
- Method assumptions and their plausibility
- Strength of causal identification
- Sample size and statistical power

Return your analysis as JSON:
{{
    "recommended_method": "method_name",
    "confidence": 0.0-1.0,
    "reasoning": "detailed explanation",
    "assumptions": ["list of key assumptions"],
    "alternatives": ["alternative methods"],
    "concerns": ["potential issues"]
}}
"""

Result Interpretation Prompts 

Statistical Results Interpretation:

RESULT_INTERPRETATION_PROMPT = """
You are an expert in causal inference and statistical interpretation.
Provide a clear, comprehensive interpretation of causal analysis results.

Analysis Details:
- Method Used: {method_name}
- Treatment Variable: {treatment_variable}
- Outcome Variable: {outcome_variable}
- Sample Size: {sample_size}

Statistical Results:
- Effect Estimate: {effect_estimate}
- Standard Error: {standard_error}
- 95% Confidence Interval: {confidence_interval}
- P-value: {p_value}

Diagnostic Tests:
{diagnostic_results}

Method Assumptions:
{method_assumptions}

Provide interpretation covering:

1. **Effect Size and Direction**:
   - Magnitude and practical significance
   - Direction of causal effect
   - Units and scale interpretation

2. **Statistical Significance**:
   - P-value interpretation
   - Confidence interval meaning
   - Statistical vs practical significance

3. **Assumption Assessment**:
   - How well assumptions are satisfied
   - Diagnostic test results
   - Reliability of causal interpretation

4. **Limitations and Caveats**:
   - Method-specific limitations
   - Potential sources of bias
   - Generalizability concerns

5. **Practical Implications**:
   - Real-world meaning of results
   - Policy or decision implications
   - Recommendations for action

Format as clear, accessible explanation suitable for non-experts while
maintaining statistical rigor.
"""

Response Processing Architecture 

Structured Output Parsing 

CAIS uses structured output parsing to ensure reliable LLM responses:

from typing import Dict, Any, Optional
import json
import re
from pydantic import BaseModel, ValidationError

class LLMResponseParser:
    """Parser for structured LLM responses with validation and error handling"""

    def __init__(self, expected_schema: Optional[BaseModel] = None):
        self.expected_schema = expected_schema

    def parse_json_response(self, response: str) -> Dict[str, Any]:
        """
        Parse JSON response from LLM with error handling and validation.

        Args:
            response: Raw LLM response string

        Returns:
            Parsed and validated JSON object

        Raises:
            ValueError: If response cannot be parsed or validated
        """
        try:
            # Extract JSON from response (handle markdown formatting)
            json_str = self._extract_json(response)

            # Parse JSON
            parsed = json.loads(json_str)

            # Validate against schema if provided
            if self.expected_schema:
                validated = self.expected_schema(**parsed)
                return validated.dict()

            return parsed

        except (json.JSONDecodeError, ValidationError) as e:
            raise ValueError(f"Failed to parse LLM response: {e}")

    def _extract_json(self, response: str) -> str:
        """Extract JSON from potentially formatted response"""
        # Remove markdown code blocks
        response = re.sub(r'```json\s*', '', response)
        response = re.sub(r'```\s*$', '', response)

        # Find JSON object
        json_match = re.search(r'\{.*\}', response, re.DOTALL)
        if json_match:
            return json_match.group(0)

        # If no JSON found, try the entire response
        return response.strip()

Response Validation Schemas 

Define Pydantic schemas for structured validation:

from pydantic import BaseModel, Field
from typing import List, Optional

class VariableIdentificationResponse(BaseModel):
    """Schema for variable identification responses"""
    treatment_variable: Optional[str] = Field(None, description="Identified treatment variable")
    outcome_variable: Optional[str] = Field(None, description="Identified outcome variable")
    covariates: List[str] = Field(default_factory=list, description="Identified covariates")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence in identification")
    reasoning: str = Field(description="Explanation of identification logic")

class MethodSelectionResponse(BaseModel):
    """Schema for method selection responses"""
    recommended_method: str = Field(description="Recommended causal method")
    confidence: float = Field(ge=0.0, le=1.0, description="Confidence in recommendation")
    reasoning: str = Field(description="Detailed reasoning for selection")
    assumptions: List[str] = Field(description="Key method assumptions")
    alternatives: List[str] = Field(default_factory=list, description="Alternative methods")
    concerns: List[str] = Field(default_factory=list, description="Potential concerns")

class ResultInterpretationResponse(BaseModel):
    """Schema for result interpretation responses"""
    effect_interpretation: str = Field(description="Interpretation of effect size")
    significance_assessment: str = Field(description="Statistical significance assessment")
    assumption_evaluation: str = Field(description="Method assumption evaluation")
    limitations: List[str] = Field(description="Analysis limitations")
    practical_implications: str = Field(description="Practical implications")

Error Handling and Retry Logic 

Implement robust error handling for LLM interactions:

import time
import logging
from typing import Dict, Any, Callable
from functools import wraps

logger = logging.getLogger(__name__)

def llm_retry(max_retries: int = 3, backoff_factor: float = 2.0):
    """Decorator for LLM calls with exponential backoff retry logic"""

    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None

            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)

                except Exception as e:
                    last_exception = e

                    if attempt < max_retries - 1:
                        wait_time = backoff_factor ** attempt
                        logger.warning(
                            f"LLM call failed (attempt {attempt + 1}/{max_retries}): {e}. "
                            f"Retrying in {wait_time} seconds..."
                        )
                        time.sleep(wait_time)
                    else:
                        logger.error(f"LLM call failed after {max_retries} attempts: {e}")

            raise last_exception

        return wrapper
    return decorator

class LLMClient:
    """Wrapper for LLM clients with error handling and validation"""

    def __init__(self, llm_client, parser: LLMResponseParser):
        self.llm = llm_client
        self.parser = parser

    @llm_retry(max_retries=3)
    def call_with_validation(
        self,
        prompt: str,
        expected_schema: Optional[BaseModel] = None
    ) -> Dict[str, Any]:
        """
        Call LLM with automatic retry and response validation.

        Args:
            prompt: Formatted prompt string
            expected_schema: Pydantic schema for response validation

        Returns:
            Validated response dictionary
        """
        try:
            # Call LLM
            response = self.llm.invoke(prompt)
            response_text = response.content if hasattr(response, 'content') else str(response)

            # Parse and validate response
            if expected_schema:
                self.parser.expected_schema = expected_schema

            parsed_response = self.parser.parse_json_response(response_text)

            logger.info(f"LLM call successful: {len(response_text)} characters")
            return parsed_response

        except Exception as e:
            logger.error(f"LLM call failed: {e}")
            raise

Prompt Optimization Strategies 

Few-Shot Learning 

Use examples to improve LLM performance on specific tasks:

FEW_SHOT_VARIABLE_IDENTIFICATION = """
You are an expert in causal inference variable identification.

Here are examples of correct variable identification:

Example 1:
Query: "What is the effect of education on income?"
Variables: education_years, annual_income, age, gender, experience
Response: {{"treatment_variable": "education_years", "outcome_variable": "annual_income"}}

Example 2:
Query: "Does smoking cause lung cancer?"
Variables: smoking_status, cancer_diagnosis, age, gender, family_history
Response: {{"treatment_variable": "smoking_status", "outcome_variable": "cancer_diagnosis"}}

Example 3:
Query: "Impact of minimum wage on employment"
Variables: min_wage_policy, employment_rate, state, year, population
Response: {{"treatment_variable": "min_wage_policy", "outcome_variable": "employment_rate"}}

Now identify variables for this query:
Query: {query}
Variables: {variables}
Response:
"""

Chain-of-Thought Reasoning 

Encourage step-by-step reasoning for complex decisions:

CHAIN_OF_THOUGHT_METHOD_SELECTION = """
You are selecting a causal inference method. Think through this step-by-step:

Step 1: Analyze the data structure
- Is this experimental or observational data?
- What variables are available?
- What is the sample size?

Step 2: Consider identification strategies
- Is there random assignment?
- Are there instruments available?
- Is there time/policy variation?
- Are there sufficient covariates?

Step 3: Evaluate method assumptions
- Which methods have plausible assumptions?
- What are the key threats to identification?
- How can assumptions be tested?

Step 4: Select the best method
- Which method provides strongest identification?
- What are the trade-offs?
- Are there good alternatives?

Dataset: {dataset_info}
Variables: {variables}

Work through each step and provide your reasoning:
"""

Prompt Versioning and A/B Testing 

Implement systematic prompt improvement:

class PromptManager:
    """Manager for prompt versioning and A/B testing"""

    def __init__(self):
        self.prompts = {}
        self.active_versions = {}
        self.performance_metrics = {}

    def register_prompt(
        self,
        prompt_name: str,
        version: str,
        template: str,
        metadata: Dict[str, Any] = None
    ):
        """Register a prompt version"""
        if prompt_name not in self.prompts:
            self.prompts[prompt_name] = {}

        self.prompts[prompt_name][version] = {
            'template': template,
            'metadata': metadata or {},
            'created_at': time.time()
        }

    def get_prompt(self, prompt_name: str, version: str = None) -> str:
        """Get prompt template by name and version"""
        if version is None:
            version = self.active_versions.get(prompt_name, 'latest')

        return self.prompts[prompt_name][version]['template']

    def set_active_version(self, prompt_name: str, version: str):
        """Set active version for a prompt"""
        self.active_versions[prompt_name] = version

    def record_performance(
        self,
        prompt_name: str,
        version: str,
        success: bool,
        metrics: Dict[str, Any]
    ):
        """Record performance metrics for prompt version"""
        key = f"{prompt_name}:{version}"
        if key not in self.performance_metrics:
            self.performance_metrics[key] = []

        self.performance_metrics[key].append({
            'success': success,
            'metrics': metrics,
            'timestamp': time.time()
        })

Integration with Decision Tree 

LLM-Enhanced Decision Logic 

Combine rule-based logic with LLM reasoning:

class DecisionTreeLLMEngine:
    """LLM-enhanced decision tree for method selection"""

    def __init__(self, llm_client: LLMClient):
        self.llm = llm_client
        self.rule_based_engine = RuleBasedDecisionTree()

    def select_method(
        self,
        variables: Variables,
        dataset_analysis: DatasetAnalysis,
        context: Dict[str, Any] = None
    ) -> Dict[str, Any]:
        """
        Select method using combined rule-based and LLM reasoning.

        Args:
            variables: Identified causal variables
            dataset_analysis: Dataset characteristics
            context: Additional context for decision

        Returns:
            Method selection with reasoning
        """
        # First, get rule-based recommendation
        rule_based_result = self.rule_based_engine.select_method(
            variables, dataset_analysis
        )

        # If rule-based selection is confident, use it
        if rule_based_result['confidence'] > 0.8:
            return rule_based_result

        # Otherwise, use LLM for enhanced reasoning
        llm_result = self._llm_method_selection(
            variables, dataset_analysis, rule_based_result, context
        )

        # Combine results
        return self._combine_recommendations(rule_based_result, llm_result)

    def _llm_method_selection(
        self,
        variables: Variables,
        dataset_analysis: DatasetAnalysis,
        rule_based_result: Dict[str, Any],
        context: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Use LLM for method selection reasoning"""

        prompt = self._build_method_selection_prompt(
            variables, dataset_analysis, rule_based_result, context
        )

        response = self.llm.call_with_validation(
            prompt, MethodSelectionResponse
        )

        return response

    def _combine_recommendations(
        self,
        rule_based: Dict[str, Any],
        llm_based: Dict[str, Any]
    ) -> Dict[str, Any]:
        """Combine rule-based and LLM recommendations"""

        # If both agree, high confidence
        if rule_based['method'] == llm_based['recommended_method']:
            return {
                'method': rule_based['method'],
                'confidence': min(rule_based['confidence'] + 0.2, 1.0),
                'reasoning': f"Both rule-based and LLM reasoning agree: {llm_based['reasoning']}",
                'assumptions': rule_based['assumptions'],
                'alternatives': llm_based['alternatives']
            }

        # If they disagree, use LLM with lower confidence
        else:
            return {
                'method': llm_based['recommended_method'],
                'confidence': llm_based['confidence'] * 0.8,
                'reasoning': f"LLM override of rule-based selection: {llm_based['reasoning']}",
                'assumptions': llm_based['assumptions'],
                'alternatives': [rule_based['method']] + llm_based['alternatives']
            }

Performance Optimization 

Caching Strategies 

Implement intelligent caching for LLM responses:

import hashlib
from typing import Dict, Any, Optional

class LLMResponseCache:
    """Cache for LLM responses to reduce API calls and improve performance"""

    def __init__(self, max_size: int = 1000):
        self.cache = {}
        self.max_size = max_size
        self.access_times = {}

    def _generate_key(self, prompt: str, model: str, temperature: float) -> str:
        """Generate cache key from prompt and parameters"""
        content = f"{prompt}:{model}:{temperature}"
        return hashlib.md5(content.encode()).hexdigest()

    def get(
        self,
        prompt: str,
        model: str,
        temperature: float
    ) -> Optional[Dict[str, Any]]:
        """Get cached response if available"""
        key = self._generate_key(prompt, model, temperature)

        if key in self.cache:
            self.access_times[key] = time.time()
            return self.cache[key]

        return None

    def set(
        self,
        prompt: str,
        model: str,
        temperature: float,
        response: Dict[str, Any]
    ):
        """Cache response with LRU eviction"""
        key = self._generate_key(prompt, model, temperature)

        # Evict oldest if at capacity
        if len(self.cache) >= self.max_size:
            oldest_key = min(self.access_times.keys(), key=self.access_times.get)
            del self.cache[oldest_key]
            del self.access_times[oldest_key]

        self.cache[key] = response
        self.access_times[key] = time.time()

Batch Processing 

Optimize for multiple queries:

class BatchLLMProcessor:
    """Process multiple LLM requests efficiently"""

    def __init__(self, llm_client: LLMClient, batch_size: int = 5):
        self.llm = llm_client
        self.batch_size = batch_size

    def process_batch(
        self,
        prompts: List[str],
        schemas: List[BaseModel] = None
    ) -> List[Dict[str, Any]]:
        """Process multiple prompts in batches"""
        results = []

        for i in range(0, len(prompts), self.batch_size):
            batch = prompts[i:i + self.batch_size]
            batch_schemas = schemas[i:i + self.batch_size] if schemas else [None] * len(batch)

            # Process batch concurrently
            batch_results = self._process_concurrent_batch(batch, batch_schemas)
            results.extend(batch_results)

        return results

    def _process_concurrent_batch(
        self,
        prompts: List[str],
        schemas: List[BaseModel]
    ) -> List[Dict[str, Any]]:
        """Process batch of prompts concurrently"""
        import concurrent.futures

        with concurrent.futures.ThreadPoolExecutor(max_workers=len(prompts)) as executor:
            futures = [
                executor.submit(self.llm.call_with_validation, prompt, schema)
                for prompt, schema in zip(prompts, schemas)
            ]

            results = []
            for future in concurrent.futures.as_completed(futures):
                try:
                    result = future.result()
                    results.append(result)
                except Exception as e:
                    logger.error(f"Batch processing error: {e}")
                    results.append({"error": str(e)})

            return results

Monitoring and Debugging 

LLM Call Logging 

Comprehensive logging for debugging and monitoring:

class LLMCallLogger:
    """Logger for LLM interactions with detailed metrics"""

    def __init__(self, log_level: str = "INFO"):
        self.logger = logging.getLogger("llm_calls")
        self.logger.setLevel(getattr(logging, log_level))

        # Metrics tracking
        self.call_count = 0
        self.total_tokens = 0
        self.total_cost = 0.0
        self.error_count = 0

    def log_call(
        self,
        prompt: str,
        response: str,
        model: str,
        tokens_used: int = None,
        cost: float = None,
        duration: float = None,
        success: bool = True
    ):
        """Log LLM call with metrics"""
        self.call_count += 1

        if success:
            self.logger.info(
                f"LLM Call #{self.call_count} - Model: {model}, "
                f"Tokens: {tokens_used}, Duration: {duration:.2f}s"
            )
        else:
            self.error_count += 1
            self.logger.error(
                f"LLM Call #{self.call_count} FAILED - Model: {model}, "
                f"Error in response processing"
            )

        # Update metrics
        if tokens_used:
            self.total_tokens += tokens_used
        if cost:
            self.total_cost += cost

        # Log detailed information at debug level
        self.logger.debug(f"Prompt: {prompt[:200]}...")
        self.logger.debug(f"Response: {response[:200]}...")

    def get_metrics(self) -> Dict[str, Any]:
        """Get aggregated metrics"""
        return {
            "total_calls": self.call_count,
            "successful_calls": self.call_count - self.error_count,
            "error_rate": self.error_count / max(self.call_count, 1),
            "total_tokens": self.total_tokens,
            "total_cost": self.total_cost,
            "average_tokens_per_call": self.total_tokens / max(self.call_count, 1)
        }

Testing LLM Integration 

Mock LLM Responses 

Create deterministic tests using mock responses:

class MockLLMClient:
    """Mock LLM client for testing with predefined responses"""

    def __init__(self, responses: Dict[str, str]):
        self.responses = responses
        self.call_count = 0

    def invoke(self, prompt: str) -> str:
        """Return predefined response based on prompt pattern"""
        self.call_count += 1

        # Match prompt to predefined response
        for pattern, response in self.responses.items():
            if pattern in prompt:
                return response

        # Default response if no pattern matches
        return '{"error": "No mock response defined for this prompt"}'

# Example usage in tests
mock_responses = {
    "identify the treatment variable": '{"treatment_variable": "education"}',
    "identify the outcome variable": '{"outcome_variable": "income"}',
    "select causal method": '{"recommended_method": "linear_regression", "confidence": 0.8}'
}

mock_llm = MockLLMClient(mock_responses)

Integration Testing 

Test LLM integration within the full workflow:

def test_llm_integration_workflow():
    """Test complete workflow with LLM integration"""

    # Use mock LLM for deterministic testing
    mock_llm = MockLLMClient(STANDARD_MOCK_RESPONSES)

    # Create agent with mock LLM
    agent = CausalAgent(llm=mock_llm)

    # Run analysis
    result = agent.run_analysis(
        query="What is the effect of education on income?",
        dataset_path="test_data.csv"
    )

    # Verify LLM was called appropriately
    assert mock_llm.call_count > 0
    assert "effect_estimate" in result
    assert result["method_used"] in EXPECTED_METHODS

Best Practices 

Prompt Design 

Be Specific: Provide clear, unambiguous instructions
Use Examples: Include few-shot examples for complex tasks
Structure Output: Specify exact output format (JSON, etc.)
Handle Edge Cases: Address potential ambiguities and edge cases
Validate Assumptions: Make domain assumptions explicit

Error Handling 

Graceful Degradation: Provide fallback strategies when LLM fails
Retry Logic: Implement exponential backoff for transient failures
Input Validation: Validate inputs before sending to LLM
Output Validation: Validate LLM outputs against expected schemas
Logging: Comprehensive logging for debugging and monitoring

Performance 

Caching: Cache responses for repeated queries
Batch Processing: Process multiple requests efficiently
Model Selection: Use appropriate model size for task complexity
Temperature Control: Use low temperature for deterministic tasks
Token Management: Optimize prompts for token efficiency

Security 

Input Sanitization: Sanitize user inputs to prevent prompt injection
API Key Management: Secure handling of API credentials
Data Privacy: Avoid sending sensitive data to external LLMs
Rate Limiting: Respect provider rate limits and quotas
Error Messages: Avoid exposing sensitive information in error messages

This comprehensive LLM integration framework enables CAIS to leverage the power of large language models while maintaining reliability, performance, and security standards required for production causal inference systems.