Code_of_Conquest/api/docs/AI_INTEGRATION.md

# AI Integration Documentation

## Overview

Code of Conquest uses AI models for narrative generation through a unified Replicate API integration. This document covers the AI client architecture, model selection, and usage patterns.

**Key Components:**
- **ReplicateClient** - Low-level API client for all AI models
- **ModelSelector** - Tier-based model routing and configuration
- **NarrativeGenerator** - High-level wrapper for game-specific generation

---

## Architecture

```
┌─────────────────────┐
│  NarrativeGenerator │  ← High-level game API
├─────────────────────┤
│   ModelSelector     │  ← Tier/context routing
├─────────────────────┤
│   ReplicateClient   │  ← Unified API client
├─────────────────────┤
│   Replicate API     │  ← All models (Llama, Claude)
└─────────────────────┘
```

All AI models are accessed through Replicate API for unified billing and management.

---

## Replicate Client

**File:** `app/ai/replicate_client.py`

### Supported Models

| Model Type | Identifier | Tier | Use Case |
|------------|-----------|------|----------|
| `LLAMA_3_8B` | `meta/meta-llama-3-8b-instruct` | Free | Cost-effective, good quality |
| `CLAUDE_HAIKU` | `anthropic/claude-3.5-haiku` | Basic | Fast, high quality |
| `CLAUDE_SONNET` | `anthropic/claude-3.5-sonnet` | Premium | Excellent quality |
| `CLAUDE_SONNET_4` | `anthropic/claude-4.5-sonnet` | Elite | Best quality |

### Basic Usage

```python
from app.ai.replicate_client import ReplicateClient, ModelType

# Free tier - Llama (default)
client = ReplicateClient()
response = client.generate(
    prompt="You are a dungeon master...",
    max_tokens=256,
    temperature=0.7
)
print(response.text)
print(f"Tokens: {response.tokens_used}")

# Paid tier - Claude models
client = ReplicateClient(model=ModelType.CLAUDE_HAIKU)
response = client.generate(
    prompt="Describe the tavern",
    system_prompt="You are a dungeon master"
)

# Override model per-call
response = client.generate("Test", model=ModelType.CLAUDE_SONNET)
```

### Response Object

```python
@dataclass
class ReplicateResponse:
    text: str              # Generated text
    tokens_used: int       # Approximate token count
    model: str             # Model identifier
    generation_time: float # Generation time in seconds
```

### Configuration

```python
# Default parameters
DEFAULT_MAX_TOKENS = 256
DEFAULT_TEMPERATURE = 0.7
DEFAULT_TOP_P = 0.9
DEFAULT_TIMEOUT = 30  # seconds

# Model-specific defaults
MODEL_DEFAULTS = {
    ModelType.LLAMA_3_8B: {"max_tokens": 256, "temperature": 0.7},
    ModelType.CLAUDE_HAIKU: {"max_tokens": 512, "temperature": 0.8},
    ModelType.CLAUDE_SONNET: {"max_tokens": 1024, "temperature": 0.9},
    ModelType.CLAUDE_SONNET_4: {"max_tokens": 2048, "temperature": 0.9},
}
```

### Error Handling

```python
from app.ai.replicate_client import (
    ReplicateClientError,    # Base error
    ReplicateAPIError,       # API errors
    ReplicateRateLimitError, # Rate limiting
    ReplicateTimeoutError    # Timeouts
)

try:
    response = client.generate(prompt)
except ReplicateRateLimitError:
    # Handle rate limiting (client retries automatically 3 times)
    pass
except ReplicateTimeoutError:
    # Handle timeout
    pass
except ReplicateAPIError as e:
    # Handle other API errors
    logger.error(f"API error: {e}")
```

### Features

- **Retry Logic**: Exponential backoff (3 retries) for rate limits
- **Model-specific Formatting**: Llama special tokens, Claude system prompts
- **API Key Validation**: `client.validate_api_key()` method

---

## Model Selector

**File:** `app/ai/model_selector.py`

### User Tiers

```python
class UserTier(str, Enum):
    FREE = "free"       # Llama 3 8B
    BASIC = "basic"     # Claude Haiku
    PREMIUM = "premium" # Claude Sonnet
    ELITE = "elite"     # Claude Sonnet 4
```

### Context Types

```python
class ContextType(str, Enum):
    STORY_PROGRESSION = "story_progression"  # Creative narratives
    COMBAT_NARRATION = "combat_narration"    # Action descriptions
    QUEST_SELECTION = "quest_selection"      # Quest picking
    NPC_DIALOGUE = "npc_dialogue"            # Character conversations
    SIMPLE_RESPONSE = "simple_response"      # Quick responses
```

### Usage

```python
from app.ai.model_selector import ModelSelector, UserTier, ContextType

selector = ModelSelector()

# Select model configuration
config = selector.select_model(
    user_tier=UserTier.PREMIUM,
    context_type=ContextType.STORY_PROGRESSION
)

print(config.model_type)   # ModelType.CLAUDE_SONNET
print(config.max_tokens)   # 1024
print(config.temperature)  # 0.9
```

### Token Limits by Tier

| Tier | Base Tokens | Model |
|------|-------------|-------|
| FREE | 256 | Llama 3 8B |
| BASIC | 512 | Claude Haiku |
| PREMIUM | 1024 | Claude Sonnet |
| ELITE | 2048 | Claude Sonnet 4 |

### Context Adjustments

**Temperature by Context:**
- Story Progression: 0.9 (creative)
- Combat Narration: 0.8 (exciting)
- Quest Selection: 0.5 (deterministic)
- NPC Dialogue: 0.85 (natural)
- Simple Response: 0.7 (balanced)

**Token Multipliers:**
- Story Progression: 1.0× (full allocation)
- Combat Narration: 0.75× (shorter)
- Quest Selection: 0.5× (brief)
- NPC Dialogue: 0.75× (conversational)
- Simple Response: 0.5× (quick)

### Cost Estimation

```python
# Get tier information
info = selector.get_tier_info(UserTier.PREMIUM)
# {
#     "tier": "premium",
#     "model": "anthropic/claude-3.5-sonnet",
#     "model_name": "Claude 3.5 Sonnet",
#     "base_tokens": 1024,
#     "quality": "Excellent quality, detailed narratives"
# }

# Estimate cost per request
cost = selector.estimate_cost_per_request(UserTier.PREMIUM)
# ~$0.009 per request
```

---

## Narrative Generator

**File:** `app/ai/narrative_generator.py`

High-level wrapper that coordinates model selection, prompt templates, and AI generation.

### Initialization

```python
from app.ai.narrative_generator import NarrativeGenerator
from app.ai.model_selector import UserTier

generator = NarrativeGenerator()
```

### Story Response Generation

```python
response = generator.generate_story_response(
    character={
        "name": "Aldric",
        "level": 3,
        "player_class": "Fighter",
        "stats": {"strength": 16, "dexterity": 14, ...}
    },
    action="I search the room for hidden doors",
    game_state={
        "current_location": "Ancient Library",
        "location_type": "DUNGEON",
        "active_quests": ["find_artifact"]
    },
    user_tier=UserTier.PREMIUM,
    conversation_history=[
        {"turn": 1, "action": "entered library", "dm_response": "..."},
        {"turn": 2, "action": "examined shelves", "dm_response": "..."}
    ],
    action_instructions="""
        The player searches for supplies. This means:
        - Describe what they FIND, not auto-purchase
        - List items with PRICES if applicable
        - Ask what they want to do with findings
    """  # Optional: from action_prompts.yaml dm_prompt_template
)

print(response.narrative)
print(f"Tokens: {response.tokens_used}")
print(f"Model: {response.model}")
print(f"Time: {response.generation_time:.2f}s")
```

### Action Instructions

The `action_instructions` parameter passes action-specific guidance from `action_prompts.yaml` to the AI. This ensures:

1. **Player agency** - AI presents options rather than making decisions
2. **Action semantics** - "Search" means find options, not auto-buy
3. **Context-aware responses** - Different instructions for different actions

The instructions are injected into the prompt template and include critical player agency rules:
- Never auto-purchase items
- Never complete transactions without consent
- Present choices and ask what they want to do

### Combat Narration

```python
response = generator.generate_combat_narration(
    character={"name": "Aldric", ...},
    combat_state={
        "round_number": 3,
        "enemies": [{"name": "Goblin", "hp": 5, "max_hp": 10}],
        "terrain": "cave"
    },
    action="swings their sword at the goblin",
    action_result={
        "hit": True,
        "damage": 12,
        "effects": ["bleeding"]
    },
    user_tier=UserTier.BASIC,
    is_critical=True,
    is_finishing_blow=True
)
```

### Quest Selection

```python
quest_id = generator.generate_quest_selection(
    character={"name": "Aldric", "level": 3, ...},
    eligible_quests=[
        {"quest_id": "goblin_cave", "name": "Clear the Cave", ...},
        {"quest_id": "herb_gathering", "name": "Gather Herbs", ...}
    ],
    game_context={
        "current_location": "Tavern",
        "recent_events": ["talked to locals"]
    },
    user_tier=UserTier.FREE,
    recent_actions=["asked about rumors", "ordered ale"]
)
print(quest_id)  # "goblin_cave"
```

### NPC Dialogue

```python
response = generator.generate_npc_dialogue(
    character={"name": "Aldric", ...},
    npc={
        "name": "Old Barkeep",
        "role": "Tavern Owner",
        "personality": "gruff but kind"
    },
    conversation_topic="What rumors have you heard lately?",
    game_state={"current_location": "The Rusty Anchor", ...},
    user_tier=UserTier.PREMIUM,
    npc_knowledge=["goblin attacks", "missing merchant"]
)
```

### Response Object

```python
@dataclass
class NarrativeResponse:
    narrative: str       # Generated text
    tokens_used: int     # Token count
    model: str           # Model used
    context_type: str    # Type of generation
    generation_time: float
```

### Error Handling

```python
from app.ai.narrative_generator import NarrativeGeneratorError

try:
    response = generator.generate_story_response(...)
except NarrativeGeneratorError as e:
    logger.error(f"Generation failed: {e}")
    # Handle gracefully (show error to user, use fallback, etc.)
```

---

## Prompt Templates

**File:** `app/ai/prompt_templates.py`
**Templates:** `app/ai/templates/*.j2`

### Available Templates

1. **story_action.j2** - Story progression turns
2. **combat_action.j2** - Combat narration
3. **quest_offering.j2** - Context-aware quest selection
4. **npc_dialogue.j2** - NPC conversations

### Template Filters

- `format_inventory` - Format item lists
- `format_stats` - Format character stats
- `format_skills` - Format skill lists
- `format_effects` - Format active effects
- `truncate_text` - Truncate with ellipsis
- `format_gold` - Format currency

### Direct Template Usage

```python
from app.ai.prompt_templates import get_prompt_templates

templates = get_prompt_templates()

prompt = templates.render(
    "story_action.j2",
    character={"name": "Aldric", ...},
    action="search for traps",
    game_state={...},
    conversation_history=[...]
)
```

---

## Configuration

### Environment Variables

```bash
# Required
REPLICATE_API_TOKEN=r8_...

# Optional (defaults shown)
REPLICATE_MODEL=meta/meta-llama-3-8b-instruct
```

### Cost Management

Approximate costs per 1K tokens:

| Model | Input | Output |
|-------|-------|--------|
| Llama 3 8B | Free | Free |
| Claude Haiku | $0.001 | $0.005 |
| Claude Sonnet | $0.003 | $0.015 |
| Claude Sonnet 4 | $0.015 | $0.075 |

---

## Integration with Background Jobs

AI generation runs asynchronously via RQ jobs. See `app/tasks/ai_tasks.py`.

```python
from app.tasks.ai_tasks import enqueue_ai_task

# Queue a story action
job = enqueue_ai_task(
    task_type="narrative",
    user_id="user_123",
    context={
        "session_id": "sess_789",
        "character_id": "char_456",
        "action": "I explore the tavern"
    }
)
# Returns: {"job_id": "abc-123", "status": "queued"}
```

---

## Usage Tracking

All AI calls are automatically logged for cost monitoring. See `app/services/usage_tracking_service.py`.

```python
from app.services.usage_tracking_service import UsageTrackingService

tracker = UsageTrackingService()

# Get daily usage
usage = tracker.get_daily_usage("user_123", date.today())
print(f"Requests: {usage.total_requests}")
print(f"Cost: ${usage.estimated_cost:.4f}")

# Get monthly cost
monthly = tracker.get_monthly_cost("user_123", 2025, 11)
```

---

## Rate Limiting

Tier-based daily limits enforced via `app/services/rate_limiter_service.py`.

Limits are loaded from config (`rate_limiting.tiers.{tier}.ai_calls_per_day`).

### AI Calls (Turns)

| Tier | Daily Limit |
|------|------------|
| FREE | 50 turns |
| BASIC | 200 turns |
| PREMIUM | 1000 turns |
| ELITE | Unlimited |

A value of `-1` in config means unlimited.

### Custom Actions

Free-text player actions (beyond preset buttons) have separate limits:

| Tier | Custom Actions/Day | Max Characters |
|------|-------------------|----------------|
| FREE | 10 | 150 |
| BASIC | 50 | 300 |
| PREMIUM | Unlimited | 500 |
| ELITE | Unlimited | 500 |

These are configurable in `config/*.yaml` under `rate_limiting.tiers.{tier}.custom_actions_per_day` and `custom_action_char_limit`.

```python
from app.services.rate_limiter_service import RateLimiterService

limiter = RateLimiterService()

try:
    limiter.check_rate_limit("user_123", UserTier.PREMIUM)
    # Process request...
    limiter.increment_usage("user_123")
except RateLimitExceeded as e:
    # Return error to user
    pass
```

---

## Best Practices

1. **Always specify context type** - Helps optimize token usage and temperature
2. **Provide conversation history** - Improves narrative coherence
3. **Handle errors gracefully** - Show user-friendly messages
4. **Monitor costs** - Use usage tracking service
5. **Test with mocks first** - Use mocked clients during development

---

## Verification Scripts

- `scripts/verify_ai_models.py` - Test model routing and API connectivity
- `scripts/verify_e2e_ai_generation.py` - End-to-end generation flow tests

```bash
# Test model routing (no API key needed)
python scripts/verify_ai_models.py

# Test with real API calls
python scripts/verify_ai_models.py --llama --haiku --sonnet

# Full E2E test
python scripts/verify_e2e_ai_generation.py --real --tier premium
```