Code_of_Conquest/docs/DEPLOYMENT.md

# Deployment & Operations

## Local Development Setup

### Prerequisites

| Tool | Version | Purpose |
|------|---------|---------|
| Python | 3.11+ | Backend runtime |
| Docker | Latest | Local services |
| Redis | 7.0+ | Job queue & caching |
| Git | Latest | Version control |

### Setup Steps

```bash
# 1. Clone repository
git clone <repo-url>
cd code_of_conquest

# 2. Create virtual environment
python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Configure environment
cp .env.example .env
# Edit .env with your API keys and settings

# 5. Start local services
docker-compose up -d

# 6. Start RQ workers
rq worker ai_tasks combat_tasks marketplace_tasks &

# 7. Run Flask development server
flask run --debug
```

### Environment Variables

| Variable | Description | Required |
|----------|-------------|----------|
| `FLASK_ENV` | development/production | Yes |
| `SECRET_KEY` | Flask secret key | Yes |
| `REPLICATE_API_KEY` | Replicate API key | Yes |
| `ANTHROPIC_API_KEY` | Anthropic API key | Yes |
| `APPWRITE_ENDPOINT` | Appwrite server URL | Yes |
| `APPWRITE_PROJECT_ID` | Appwrite project ID | Yes |
| `APPWRITE_API_KEY` | Appwrite API key | Yes |
| `REDIS_URL` | Redis connection URL | Yes |
| `LOG_LEVEL` | Logging level (DEBUG/INFO/WARNING/ERROR) | No |

---

## Docker Compose (Local Development)

**docker-compose.yml:**

```yaml
version: '3.8'
services:
  redis:
    image: redis:alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  rq-worker:
    build: .
    command: rq worker ai_tasks combat_tasks marketplace_tasks --url redis://redis:6379
    depends_on:
      - redis
    env_file:
      - .env
    environment:
      - REDIS_URL=redis://redis:6379

volumes:
  redis_data:
```

---

## Testing Strategy

### Manual Testing (Preferred)

**API Testing Document:** `docs/API_TESTING.md`

Contains:
- Endpoint examples
- Sample curl/httpie commands
- Expected responses
- Authentication setup

**Example API Test:**

```bash
# Login
curl -X POST http://localhost:5000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"email": "test@example.com", "password": "password123"}'

# Create character (with auth token)
curl -X POST http://localhost:5000/api/v1/characters \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <token>" \
  -d '{"name": "Aragorn", "class_id": "vanguard"}'
```

### Unit Tests (Optional)

**Framework:** pytest

**Test Categories:**

| Category | Location | Focus |
|----------|----------|-------|
| Combat | `tests/test_combat.py` | Damage calculations, effect processing |
| Skills | `tests/test_skills.py` | Skill unlock logic, prerequisites |
| Marketplace | `tests/test_marketplace.py` | Bidding logic, auction processing |
| Character | `tests/test_character.py` | Character creation, stats |

**Run Tests:**
```bash
# All tests
pytest

# Specific test file
pytest tests/test_combat.py

# With coverage
pytest --cov=app tests/
```

### Load Testing

**Tool:** Locust or Apache Bench

**Test Scenarios:**

| Scenario | Target | Success Criteria |
|----------|--------|------------------|
| Concurrent AI requests | 50 concurrent users | < 5s response time |
| Marketplace browsing | 100 concurrent users | < 1s response time |
| Session realtime updates | 10 players per session | < 100ms update latency |

---

## Production Deployment

### Deployment Checklist

**Pre-Deployment:**
- [ ] All environment variables configured
- [ ] Appwrite collections created with proper permissions
- [ ] Redis configured and accessible
- [ ] RQ workers running
- [ ] SSL certificates installed
- [ ] Rate limiting configured
- [ ] Error logging/monitoring set up (Sentry recommended)
- [ ] Backup strategy for Appwrite data

**Production Configuration:**
- [ ] `DEBUG = False` in Flask
- [ ] Secure session keys (random, long)
- [ ] CORS restricted to production domain
- [ ] Rate limits appropriate for production
- [ ] AI cost alerts configured
- [ ] CDN for static assets (optional)

### Dockerfile

```dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

# Expose port
EXPOSE 5000

# Run application
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "wsgi:app"]
```

### Build & Push Script

**scripts/build_and_push.sh:**

```bash
#!/bin/bash

# Get current git branch
BRANCH=$(git rev-parse --abbrev-ref HEAD)

# Ask for tag options
read -p "Tag as :latest? (y/n) " TAG_LATEST
read -p "Push to registry? (y/n) " PUSH_IMAGE

# Build image
docker build -t ai-dungeon-master:$BRANCH .

if [ "$TAG_LATEST" = "y" ]; then
    docker tag ai-dungeon-master:$BRANCH ai-dungeon-master:latest
fi

if [ "$PUSH_IMAGE" = "y" ]; then
    docker push ai-dungeon-master:$BRANCH
    if [ "$TAG_LATEST" = "y" ]; then
        docker push ai-dungeon-master:latest
    fi
fi
```

### Production Environment

**Recommended Stack:**
- **Web Server:** Nginx (reverse proxy)
- **WSGI Server:** Gunicorn (4+ workers)
- **Process Manager:** Supervisor or systemd
- **Redis:** Standalone or Redis Cluster
- **RQ Workers:** Separate instances for each queue

**Scaling Strategy:**

| Component | Scaling Method | Trigger |
|-----------|----------------|---------|
| Flask API | Horizontal (add workers) | CPU > 70% |
| RQ Workers | Horizontal (add workers) | Queue length > 100 |
| Redis | Vertical (upgrade instance) | Memory > 80% |
| Appwrite | Managed by Appwrite | N/A |

---

## Monitoring & Logging

### Application Logging

**Logging Configuration:**

| Level | Use Case | Examples |
|-------|----------|----------|
| DEBUG | Development only | Variable values, function calls |
| INFO | Normal operations | User actions, API calls |
| WARNING | Potential issues | Rate limit approaching, slow queries |
| ERROR | Errors (recoverable) | Failed AI calls, validation errors |
| CRITICAL | Critical failures | Database connection lost, service down |

**Structured Logging with Structlog:**

```python
import structlog

logger = structlog.get_logger(__name__)

logger.info("Combat action executed",
    session_id=session_id,
    character_id=character_id,
    action_type="attack",
    damage=15
)
```

### Monitoring Tools

**Recommended Tools:**

| Tool | Purpose | Priority |
|------|---------|----------|
| **Sentry** | Error tracking and alerting | High |
| **Prometheus** | Metrics collection | Medium |
| **Grafana** | Metrics visualization | Medium |
| **Uptime Robot** | Uptime monitoring | High |
| **CloudWatch** | AWS logs/metrics (if using AWS) | Medium |

### Key Metrics to Monitor

| Metric | Alert Threshold | Action |
|--------|----------------|--------|
| API response time | > 3s average | Scale workers |
| Error rate | > 5% | Investigate logs |
| AI API errors | > 10% | Check API status |
| Queue length | > 500 | Add workers |
| Redis memory | > 80% | Upgrade instance |
| CPU usage | > 80% | Scale horizontally |
| AI cost per day | > budget × 1.2 | Investigate usage |

### AI Cost Tracking

**Log Structure:**

| Field | Type | Purpose |
|-------|------|---------|
| `user_id` | str | Track per-user usage |
| `model` | str | Which model used |
| `tier` | str | FREE/STANDARD/PREMIUM |
| `tokens_used` | int | Token count |
| `cost_estimate` | float | Estimated cost |
| `timestamp` | datetime | When called |
| `context_type` | str | What prompted the call |

**Daily Report:**
- Total AI calls per tier
- Total tokens used
- Estimated cost
- Top users by usage
- Anomaly detection (unusual spikes)

---

## Security

### Authentication & Authorization

**Implementation:**

| Layer | Method | Details |
|-------|--------|---------|
| **User Auth** | Appwrite Auth | Email/password, OAuth providers |
| **API Auth** | JWT tokens | Bearer token in Authorization header |
| **Session Validation** | Every API call | Verify token, check expiry |
| **Resource Access** | User ID check | Users can only access their own data |

### Input Validation

**Validation Strategy:**

| Input Type | Validation | Tools |
|------------|------------|-------|
| JSON payloads | Schema validation | Marshmallow or Pydantic |
| Character names | Sanitize, length limits | Bleach library |
| Chat messages | Sanitize, profanity filter | Custom validators |
| AI prompts | Template-based only | Jinja2 (no direct user input) |

**Example Validation:**

| Field | Rules |
|-------|-------|
| Character name | 3-20 chars, alphanumeric + spaces only |
| Gold amount | Positive integer, max 999,999,999 |
| Action text | Max 500 chars, sanitized HTML |

### Rate Limiting

**Implementation:** Flask-Limiter with Redis backend

**Limits by Tier:**

| Tier | API Calls/Min | AI Calls/Day | Marketplace Actions/Day |
|------|---------------|--------------|------------------------|
| FREE | 30 | 50 | N/A |
| BASIC | 60 | 200 | N/A |
| PREMIUM | 120 | 1000 | 50 |
| ELITE | 300 | Unlimited | 100 |

**Rate Limit Bypass:**
- Admin accounts
- Health check endpoints
- Static assets

### API Security

**Configuration:**

| Setting | Value | Reason |
|---------|-------|--------|
| **CORS** | Production domain only | Prevent unauthorized access |
| **HTTPS** | Required | Encrypt data in transit |
| **API Keys** | Environment variables | Never in code |
| **Appwrite Permissions** | Least privilege | Collection-level security |
| **SQL Injection** | N/A | Using Appwrite (NoSQL) |
| **XSS** | Sanitize all inputs | Prevent script injection |
| **CSRF** | CSRF tokens | For form submissions |

### Data Protection

**Access Control Matrix:**

| Resource | Owner | Party Member | Public | System |
|----------|-------|--------------|--------|--------|
| Characters | RW | R | - | RW |
| Sessions | R | RW (turn) | - | RW |
| Marketplace Listings | RW (own) | - | R | RW |
| Transactions | R (own) | - | - | RW |

**RW = Read/Write, R = Read only, - = No access**

### Secrets Management

**Never Commit:**
- API keys
- Database credentials
- Secret keys
- Tokens

**Best Practices:**
- Use `.env` for local development
- Use environment variables in production
- Use secrets manager (AWS Secrets Manager, HashiCorp Vault) in production
- Rotate keys regularly
- Different keys for dev/staging/prod

---

## Backup & Recovery

### Appwrite Data Backup

**Strategy:**

| Data Type | Backup Frequency | Retention | Method |
|-----------|------------------|-----------|--------|
| Characters | Daily | 30 days | Appwrite export |
| Sessions (active) | Hourly | 7 days | Appwrite export |
| Marketplace | Daily | 30 days | Appwrite export |
| Transactions | Daily | 90 days | Appwrite export |

**Backup Script:**
- Export collections to JSON
- Compress and encrypt
- Upload to S3 or object storage
- Verify backup integrity

### Disaster Recovery Plan

| Scenario | RTO | RPO | Steps |
|----------|-----|-----|-------|
| **Database corruption** | 4 hours | 24 hours | Restore from latest backup |
| **API server down** | 15 minutes | 0 | Restart/failover to standby |
| **Redis failure** | 5 minutes | Session data loss | Restart, users re-login |
| **Complete infrastructure loss** | 24 hours | 24 hours | Restore from backups to new infrastructure |

**RTO = Recovery Time Objective, RPO = Recovery Point Objective**

---

## CI/CD Pipeline

### Recommended Workflow

| Stage | Actions | Tools |
|-------|---------|-------|
| **1. Commit** | Developer pushes to `dev` branch | Git |
| **2. Build** | Run tests, lint code | GitHub Actions, pytest, flake8 |
| **3. Test** | Unit tests, integration tests | pytest |
| **4. Build Image** | Create Docker image | Docker |
| **5. Deploy to Staging** | Deploy to staging environment | Docker, SSH |
| **6. Manual Test** | QA testing on staging | Manual |
| **7. Merge to Beta** | Promote to beta branch | Git |
| **8. Deploy to Beta** | Deploy to beta environment | Docker, SSH |
| **9. Merge to Master** | Production promotion | Git |
| **10. Deploy to Prod** | Deploy to production | Docker, SSH |
| **11. Tag Release** | Create version tag | Git |

### GitHub Actions Example

```yaml
name: CI/CD

on:
  push:
    branches: [ dev, beta, master ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.11
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run tests
        run: pytest
      - name: Lint
        run: flake8 app/

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Build Docker image
        run: docker build -t ai-dungeon-master:${{ github.ref_name }} .
      - name: Push to registry
        run: docker push ai-dungeon-master:${{ github.ref_name }}
```

---

## Performance Optimization

### Caching Strategy

| Cache Type | What to Cache | TTL |
|------------|---------------|-----|
| **Redis Cache** | Session data | 30 minutes |
| | Character data (read-heavy) | 5 minutes |
| | Marketplace listings | 1 minute |
| | NPC shop items | 1 hour |
| **Browser Cache** | Static assets | 1 year |
| | API responses (GET) | 30 seconds |

### Database Optimization

**Appwrite Indexing:**
- Index `userId` on characters collection
- Index `status` on game_sessions collection
- Index `listing_type` + `status` on marketplace_listings
- Index `created_at` for time-based queries

### AI Call Optimization

**Strategies:**

| Strategy | Impact | Implementation |
|----------|--------|----------------|
| **Batch requests** | Reduce API calls | Combine multiple actions |
| **Cache common responses** | Reduce cost | Cache item descriptions |
| **Prompt optimization** | Reduce tokens | Shorter, more efficient prompts |
| **Model selection** | Reduce cost | Use cheaper models when appropriate |

---

## Troubleshooting

### Common Issues

| Issue | Symptoms | Solution |
|-------|----------|----------|
| **RQ workers not processing** | Jobs stuck in queue | Check Redis connection, restart workers |
| **AI calls failing** | 401/403 errors | Verify API keys, check rate limits |
| **Appwrite connection errors** | Database errors | Check Appwrite status, verify credentials |
| **Session not updating** | Stale data in UI | Check Appwrite Realtime connection |
| **High latency** | Slow API responses | Check RQ queue length, scale workers |

### Debug Mode

**Enable Debug Logging:**

```bash
export LOG_LEVEL=DEBUG
flask run --debug
```

**Debug Endpoints (development only):**
- `GET /debug/health` - Health check
- `GET /debug/redis` - Redis connection status
- `GET /debug/queues` - RQ queue status

---

## Resources

| Resource | URL |
|----------|-----|
| **Appwrite Docs** | https://appwrite.io/docs |
| **RQ Docs** | https://python-rq.org/ |
| **Flask Docs** | https://flask.palletsprojects.com/ |
| **Structlog Docs** | https://www.structlog.org/ |
| **HTMX Docs** | https://htmx.org/docs/ |
| **Anthropic API** | https://docs.anthropic.com/ |
| **Replicate API** | https://replicate.com/docs |