Phase 2 Step 3: Implement Background Job Queue

Implemented APScheduler integration for background scan execution, enabling async job processing without blocking HTTP requests. ## Changes ### Background Jobs (web/jobs/) - scan_job.py - Execute scans in background threads - execute_scan() with isolated database sessions - Comprehensive error handling and logging - Scan status lifecycle tracking - Timing and error message storage ### Scheduler Service (web/services/scheduler_service.py) - SchedulerService class for job management - APScheduler BackgroundScheduler integration - ThreadPoolExecutor for concurrent jobs (max 3 workers) - queue_scan() - Immediate job execution - Job monitoring: list_jobs(), get_job_status() - Graceful shutdown handling ### Flask Integration (web/app.py) - init_scheduler() function - Scheduler initialization in app factory - Stored scheduler in app context (app.scheduler) ### Database Schema (migration 003) - Added scan timing fields: - started_at - Scan execution start time - completed_at - Scan execution completion time - error_message - Error details for failed scans ### Service Layer Updates (web/services/scan_service.py) - trigger_scan() accepts scheduler parameter - Queues background jobs after creating scan record - get_scan_status() includes new timing and error fields - _save_scan_to_db() sets completed_at timestamp ### API Updates (web/api/scans.py) - POST /api/scans passes scheduler to trigger_scan() - Scans now execute in background automatically ### Model Updates (web/models.py) - Added started_at, completed_at, error_message to Scan model ### Testing (tests/test_background_jobs.py) - 13 unit tests for background job execution - Scheduler initialization and configuration tests - Job queuing and status tracking tests - Scan timing field tests - Error handling and storage tests - Integration test for full workflow (skipped by default) ## Features - Async scan execution without blocking HTTP requests - Concurrent scan support (configurable max workers) - Isolated database sessions per background thread - Scan lifecycle tracking: created → running → completed/failed - Error messages captured and stored in database - Job monitoring and management capabilities - Graceful shutdown waits for running jobs ## Implementation Notes - Scanner runs in subprocess from background thread - Docker provides necessary privileges (--privileged, --network host) - Each job gets isolated SQLAlchemy session (avoid locking) - Job IDs follow pattern: scan_{scan_id} - Background jobs survive across requests - Failed jobs store error messages in database ## Documentation (docs/ai/PHASE2.md) - Updated progress: 6/14 days complete (43%) - Marked Step 3 as complete - Added detailed implementation notes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 09:24:00 -06:00
parent 6c4905d6c1
commit ee0c5a2c3c
10 changed files with 810 additions and 33 deletions
--- a/web/api/scans.py
+++ b/web/api/scans.py
@@ -146,7 +146,8 @@ def trigger_scan():
        scan_service = ScanService(current_app.db_session)
        scan_id = scan_service.trigger_scan(
            config_file=config_file,
-            triggered_by='api'
+            triggered_by='api',
+            scheduler=current_app.scheduler
        )

        logger.info(f"Scan {scan_id} triggered via API: config={config_file}")
--- a/web/app.py
+++ b/web/app.py
@@ -60,6 +60,9 @@ def create_app(config: dict = None) -> Flask:
    # Initialize extensions
    init_extensions(app)

+    # Initialize background scheduler
+    init_scheduler(app)
+
    # Register blueprints
    register_blueprints(app)

@@ -169,6 +172,25 @@ def init_extensions(app: Flask) -> None:
    app.logger.info("Extensions initialized")


+def init_scheduler(app: Flask) -> None:
+    """
+    Initialize background job scheduler.
+
+    Args:
+        app: Flask application instance
+    """
+    from web.services.scheduler_service import SchedulerService
+
+    # Create and initialize scheduler
+    scheduler = SchedulerService()
+    scheduler.init_scheduler(app)
+
+    # Store in app context for access from routes
+    app.scheduler = scheduler
+
+    app.logger.info("Background scheduler initialized")
+
+
 def register_blueprints(app: Flask) -> None:
    """
    Register Flask blueprints for different app sections.
--- a/web/jobs/init.py
+++ b/web/jobs/init.py
@@ -0,0 +1,6 @@
+"""
+Background jobs package for SneakyScanner.
+
+This package contains job definitions for background task execution,
+including scan jobs and scheduled tasks.
+"""
--- a/web/jobs/scan_job.py
+++ b/web/jobs/scan_job.py
@@ -0,0 +1,152 @@
+"""
+Background scan job execution.
+
+This module handles the execution of scans in background threads,
+updating database status and handling errors.
+"""
+
+import logging
+import traceback
+from datetime import datetime
+from pathlib import Path
+
+from sqlalchemy import create_engine
+from sqlalchemy.orm import sessionmaker
+
+from src.scanner import SneakyScanner
+from web.models import Scan
+from web.services.scan_service import ScanService
+
+logger = logging.getLogger(__name__)
+
+
+def execute_scan(scan_id: int, config_file: str, db_url: str):
+    """
+    Execute a scan in the background.
+
+    This function is designed to run in a background thread via APScheduler.
+    It creates its own database session to avoid conflicts with the main
+    application thread.
+
+    Args:
+        scan_id: ID of the scan record in database
+        config_file: Path to YAML configuration file
+        db_url: Database connection URL
+
+    Workflow:
+        1. Create new database session for this thread
+        2. Update scan status to 'running'
+        3. Execute scanner
+        4. Generate output files (JSON, HTML, ZIP)
+        5. Save results to database
+        6. Update status to 'completed' or 'failed'
+    """
+    logger.info(f"Starting background scan execution: scan_id={scan_id}, config={config_file}")
+
+    # Create new database session for this thread
+    engine = create_engine(db_url, echo=False)
+    Session = sessionmaker(bind=engine)
+    session = Session()
+
+    try:
+        # Get scan record
+        scan = session.query(Scan).filter_by(id=scan_id).first()
+        if not scan:
+            logger.error(f"Scan {scan_id} not found in database")
+            return
+
+        # Update status to running (in case it wasn't already)
+        scan.status = 'running'
+        scan.started_at = datetime.utcnow()
+        session.commit()
+
+        logger.info(f"Scan {scan_id}: Initializing scanner with config {config_file}")
+
+        # Initialize scanner
+        scanner = SneakyScanner(config_file)
+
+        # Execute scan
+        logger.info(f"Scan {scan_id}: Running scanner...")
+        start_time = datetime.utcnow()
+        report, timestamp = scanner.scan()
+        end_time = datetime.utcnow()
+
+        scan_duration = (end_time - start_time).total_seconds()
+        logger.info(f"Scan {scan_id}: Scanner completed in {scan_duration:.2f} seconds")
+
+        # Generate output files (JSON, HTML, ZIP)
+        logger.info(f"Scan {scan_id}: Generating output files...")
+        scanner.generate_outputs(report, timestamp)
+
+        # Save results to database
+        logger.info(f"Scan {scan_id}: Saving results to database...")
+        scan_service = ScanService(session)
+        scan_service._save_scan_to_db(report, scan_id, status='completed')
+
+        logger.info(f"Scan {scan_id}: Completed successfully")
+
+    except FileNotFoundError as e:
+        # Config file not found
+        error_msg = f"Configuration file not found: {str(e)}"
+        logger.error(f"Scan {scan_id}: {error_msg}")
+
+        scan = session.query(Scan).filter_by(id=scan_id).first()
+        if scan:
+            scan.status = 'failed'
+            scan.error_message = error_msg
+            scan.completed_at = datetime.utcnow()
+            session.commit()
+
+    except Exception as e:
+        # Any other error during scan execution
+        error_msg = f"Scan execution failed: {str(e)}"
+        logger.error(f"Scan {scan_id}: {error_msg}")
+        logger.error(f"Scan {scan_id}: Traceback:\n{traceback.format_exc()}")
+
+        try:
+            scan = session.query(Scan).filter_by(id=scan_id).first()
+            if scan:
+                scan.status = 'failed'
+                scan.error_message = error_msg
+                scan.completed_at = datetime.utcnow()
+                session.commit()
+        except Exception as db_error:
+            logger.error(f"Scan {scan_id}: Failed to update error status in database: {str(db_error)}")
+
+    finally:
+        # Always close the session
+        session.close()
+        logger.info(f"Scan {scan_id}: Background job completed, session closed")
+
+
+def get_scan_status_from_db(scan_id: int, db_url: str) -> dict:
+    """
+    Helper function to get scan status directly from database.
+
+    Useful for monitoring background jobs without needing Flask app context.
+
+    Args:
+        scan_id: Scan ID to check
+        db_url: Database connection URL
+
+    Returns:
+        Dictionary with scan status information
+    """
+    engine = create_engine(db_url, echo=False)
+    Session = sessionmaker(bind=engine)
+    session = Session()
+
+    try:
+        scan = session.query(Scan).filter_by(id=scan_id).first()
+        if not scan:
+            return None
+
+        return {
+            'scan_id': scan.id,
+            'status': scan.status,
+            'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
+            'duration': scan.duration,
+            'error_message': scan.error_message
+        }
+    finally:
+        session.close()
--- a/web/models.py
+++ b/web/models.py
@@ -55,6 +55,9 @@ class Scan(Base):
    created_at = Column(DateTime, nullable=False, default=datetime.utcnow, comment="Record creation time")
    triggered_by = Column(String(50), nullable=False, default='manual', comment="manual, scheduled, api")
    schedule_id = Column(Integer, ForeignKey('schedules.id'), nullable=True, comment="FK to schedules if triggered by schedule")
+    started_at = Column(DateTime, nullable=True, comment="Scan execution start time")
+    completed_at = Column(DateTime, nullable=True, comment="Scan execution completion time")
+    error_message = Column(Text, nullable=True, comment="Error message if scan failed")

    # Relationships
    sites = relationship('ScanSite', back_populates='scan', cascade='all, delete-orphan')
--- a/web/services/scan_service.py
+++ b/web/services/scan_service.py
@@ -42,7 +42,7 @@ class ScanService:
        self.db = db_session

    def trigger_scan(self, config_file: str, triggered_by: str = 'manual',
-                    schedule_id: Optional[int] = None) -> int:
+                    schedule_id: Optional[int] = None, scheduler=None) -> int:
        """
        Trigger a new scan.

@@ -53,6 +53,7 @@ class ScanService:
            config_file: Path to YAML configuration file
            triggered_by: Source that triggered scan (manual, scheduled, api)
            schedule_id: Optional schedule ID if triggered by schedule
+            scheduler: Optional SchedulerService instance for queuing background jobs

        Returns:
            Scan ID of the created scan
@@ -87,8 +88,21 @@ class ScanService:

        logger.info(f"Scan {scan.id} triggered via {triggered_by}")

-        # NOTE: Background job queuing will be implemented in Step 3
-        # For now, just return the scan ID
+        # Queue background job if scheduler provided
+        if scheduler:
+            try:
+                job_id = scheduler.queue_scan(scan.id, config_file)
+                logger.info(f"Scan {scan.id} queued for background execution (job_id={job_id})")
+            except Exception as e:
+                logger.error(f"Failed to queue scan {scan.id}: {str(e)}")
+                # Mark scan as failed if job queuing fails
+                scan.status = 'failed'
+                scan.error_message = f"Failed to queue background job: {str(e)}"
+                self.db.commit()
+                raise
+        else:
+            logger.warning(f"Scan {scan.id} created but not queued (no scheduler provided)")
+
        return scan.id

    def get_scan(self, scan_id: int) -> Optional[Dict[str, Any]]:
@@ -230,7 +244,9 @@ class ScanService:
            'scan_id': scan.id,
            'status': scan.status,
            'title': scan.title,
-            'started_at': scan.timestamp.isoformat() if scan.timestamp else None,
+            'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
+            'started_at': scan.started_at.isoformat() if scan.started_at else None,
+            'completed_at': scan.completed_at.isoformat() if scan.completed_at else None,
            'duration': scan.duration,
            'triggered_by': scan.triggered_by
        }
@@ -242,6 +258,7 @@ class ScanService:
            status_info['progress'] = 'Complete'
        elif scan.status == 'failed':
            status_info['progress'] = 'Failed'
+            status_info['error_message'] = scan.error_message

        return status_info

@@ -265,6 +282,7 @@ class ScanService:
        # Update scan record
        scan.status = status
        scan.duration = report.get('scan_duration')
+        scan.completed_at = datetime.utcnow()

        # Map report data to database models
        self._map_report_to_models(report, scan)
--- a/web/services/scheduler_service.py
+++ b/web/services/scheduler_service.py
@@ -0,0 +1,257 @@
+"""
+Scheduler service for managing background jobs and scheduled scans.
+
+This service integrates APScheduler with Flask to enable background
+scan execution and future scheduled scanning capabilities.
+"""
+
+import logging
+from datetime import datetime
+from typing import Optional
+
+from apscheduler.schedulers.background import BackgroundScheduler
+from apscheduler.executors.pool import ThreadPoolExecutor
+from flask import Flask
+
+from web.jobs.scan_job import execute_scan
+
+logger = logging.getLogger(__name__)
+
+
+class SchedulerService:
+    """
+    Service for managing background job scheduling.
+
+    Uses APScheduler's BackgroundScheduler to run scans asynchronously
+    without blocking HTTP requests.
+    """
+
+    def __init__(self):
+        """Initialize scheduler service (scheduler not started yet)."""
+        self.scheduler: Optional[BackgroundScheduler] = None
+        self.db_url: Optional[str] = None
+
+    def init_scheduler(self, app: Flask):
+        """
+        Initialize and start APScheduler with Flask app.
+
+        Args:
+            app: Flask application instance
+
+        Configuration:
+            - BackgroundScheduler: Runs in separate thread
+            - ThreadPoolExecutor: Allows concurrent scan execution
+            - Max workers: 3 (configurable via SCHEDULER_MAX_WORKERS)
+        """
+        if self.scheduler:
+            logger.warning("Scheduler already initialized")
+            return
+
+        # Store database URL for passing to background jobs
+        self.db_url = app.config['SQLALCHEMY_DATABASE_URI']
+
+        # Configure executor for concurrent jobs
+        max_workers = app.config.get('SCHEDULER_MAX_WORKERS', 3)
+        executors = {
+            'default': ThreadPoolExecutor(max_workers=max_workers)
+        }
+
+        # Configure job defaults
+        job_defaults = {
+            'coalesce': True,  # Combine multiple pending instances into one
+            'max_instances': app.config.get('SCHEDULER_MAX_INSTANCES', 3),
+            'misfire_grace_time': 60  # Allow 60 seconds for delayed starts
+        }
+
+        # Create scheduler
+        self.scheduler = BackgroundScheduler(
+            executors=executors,
+            job_defaults=job_defaults,
+            timezone='UTC'
+        )
+
+        # Start scheduler
+        self.scheduler.start()
+        logger.info(f"APScheduler started with {max_workers} max workers")
+
+        # Register shutdown handler
+        import atexit
+        atexit.register(lambda: self.shutdown())
+
+    def shutdown(self):
+        """
+        Shutdown scheduler gracefully.
+
+        Waits for running jobs to complete before shutting down.
+        """
+        if self.scheduler:
+            logger.info("Shutting down APScheduler...")
+            self.scheduler.shutdown(wait=True)
+            logger.info("APScheduler shutdown complete")
+            self.scheduler = None
+
+    def queue_scan(self, scan_id: int, config_file: str) -> str:
+        """
+        Queue a scan for immediate background execution.
+
+        Args:
+            scan_id: Database ID of the scan
+            config_file: Path to YAML configuration file
+
+        Returns:
+            Job ID from APScheduler
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        # Add job to run immediately
+        job = self.scheduler.add_job(
+            func=execute_scan,
+            args=[scan_id, config_file, self.db_url],
+            id=f'scan_{scan_id}',
+            name=f'Scan {scan_id}',
+            replace_existing=True,
+            misfire_grace_time=300  # 5 minutes
+        )
+
+        logger.info(f"Queued scan {scan_id} for background execution (job_id={job.id})")
+        return job.id
+
+    def add_scheduled_scan(self, schedule_id: int, config_file: str,
+                          cron_expression: str) -> str:
+        """
+        Add a recurring scheduled scan.
+
+        Args:
+            schedule_id: Database ID of the schedule
+            config_file: Path to YAML configuration file
+            cron_expression: Cron expression (e.g., "0 2 * * *" for 2am daily)
+
+        Returns:
+            Job ID from APScheduler
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+            ValueError: If cron expression is invalid
+
+        Note:
+            This is a placeholder for Phase 3 scheduled scanning feature.
+            Currently not used, but structure is in place.
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        # Parse cron expression
+        # Format: "minute hour day month day_of_week"
+        parts = cron_expression.split()
+        if len(parts) != 5:
+            raise ValueError(f"Invalid cron expression: {cron_expression}")
+
+        minute, hour, day, month, day_of_week = parts
+
+        # Add cron job (currently placeholder - will be enhanced in Phase 3)
+        job = self.scheduler.add_job(
+            func=self._trigger_scheduled_scan,
+            args=[schedule_id, config_file],
+            trigger='cron',
+            minute=minute,
+            hour=hour,
+            day=day,
+            month=month,
+            day_of_week=day_of_week,
+            id=f'schedule_{schedule_id}',
+            name=f'Schedule {schedule_id}',
+            replace_existing=True
+        )
+
+        logger.info(f"Added scheduled scan {schedule_id} with cron '{cron_expression}' (job_id={job.id})")
+        return job.id
+
+    def remove_scheduled_scan(self, schedule_id: int):
+        """
+        Remove a scheduled scan job.
+
+        Args:
+            schedule_id: Database ID of the schedule
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        job_id = f'schedule_{schedule_id}'
+
+        try:
+            self.scheduler.remove_job(job_id)
+            logger.info(f"Removed scheduled scan job: {job_id}")
+        except Exception as e:
+            logger.warning(f"Failed to remove scheduled scan job {job_id}: {str(e)}")
+
+    def _trigger_scheduled_scan(self, schedule_id: int, config_file: str):
+        """
+        Internal method to trigger a scan from a schedule.
+
+        Creates a new scan record and queues it for execution.
+
+        Args:
+            schedule_id: Database ID of the schedule
+            config_file: Path to YAML configuration file
+
+        Note:
+            This will be fully implemented in Phase 3 when scheduled
+            scanning is added. Currently a placeholder.
+        """
+        logger.info(f"Scheduled scan triggered: schedule_id={schedule_id}")
+        # TODO: In Phase 3, this will:
+        # 1. Create a new Scan record with triggered_by='scheduled'
+        # 2. Call queue_scan() with the new scan_id
+        # 3. Update schedule's last_run and next_run timestamps
+
+    def get_job_status(self, job_id: str) -> Optional[dict]:
+        """
+        Get status of a scheduled job.
+
+        Args:
+            job_id: APScheduler job ID
+
+        Returns:
+            Dictionary with job information, or None if not found
+        """
+        if not self.scheduler:
+            return None
+
+        job = self.scheduler.get_job(job_id)
+        if not job:
+            return None
+
+        return {
+            'id': job.id,
+            'name': job.name,
+            'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
+            'trigger': str(job.trigger)
+        }
+
+    def list_jobs(self) -> list:
+        """
+        List all scheduled jobs.
+
+        Returns:
+            List of job information dictionaries
+        """
+        if not self.scheduler:
+            return []
+
+        jobs = self.scheduler.get_jobs()
+        return [
+            {
+                'id': job.id,
+                'name': job.name,
+                'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
+                'trigger': str(job.trigger)
+            }
+            for job in jobs
+        ]