Phase 2 Step 3: Implement Background Job Queue

Implemented APScheduler integration for background scan execution, enabling async job processing without blocking HTTP requests. ## Changes ### Background Jobs (web/jobs/) - scan_job.py - Execute scans in background threads - execute_scan() with isolated database sessions - Comprehensive error handling and logging - Scan status lifecycle tracking - Timing and error message storage ### Scheduler Service (web/services/scheduler_service.py) - SchedulerService class for job management - APScheduler BackgroundScheduler integration - ThreadPoolExecutor for concurrent jobs (max 3 workers) - queue_scan() - Immediate job execution - Job monitoring: list_jobs(), get_job_status() - Graceful shutdown handling ### Flask Integration (web/app.py) - init_scheduler() function - Scheduler initialization in app factory - Stored scheduler in app context (app.scheduler) ### Database Schema (migration 003) - Added scan timing fields: - started_at - Scan execution start time - completed_at - Scan execution completion time - error_message - Error details for failed scans ### Service Layer Updates (web/services/scan_service.py) - trigger_scan() accepts scheduler parameter - Queues background jobs after creating scan record - get_scan_status() includes new timing and error fields - _save_scan_to_db() sets completed_at timestamp ### API Updates (web/api/scans.py) - POST /api/scans passes scheduler to trigger_scan() - Scans now execute in background automatically ### Model Updates (web/models.py) - Added started_at, completed_at, error_message to Scan model ### Testing (tests/test_background_jobs.py) - 13 unit tests for background job execution - Scheduler initialization and configuration tests - Job queuing and status tracking tests - Scan timing field tests - Error handling and storage tests - Integration test for full workflow (skipped by default) ## Features - Async scan execution without blocking HTTP requests - Concurrent scan support (configurable max workers) - Isolated database sessions per background thread - Scan lifecycle tracking: created → running → completed/failed - Error messages captured and stored in database - Job monitoring and management capabilities - Graceful shutdown waits for running jobs ## Implementation Notes - Scanner runs in subprocess from background thread - Docker provides necessary privileges (--privileged, --network host) - Each job gets isolated SQLAlchemy session (avoid locking) - Job IDs follow pattern: scan_{scan_id} - Background jobs survive across requests - Failed jobs store error messages in database ## Documentation (docs/ai/PHASE2.md) - Updated progress: 6/14 days complete (43%) - Marked Step 3 as complete - Added detailed implementation notes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 09:24:00 -06:00
parent 6c4905d6c1
commit ee0c5a2c3c
10 changed files with 810 additions and 33 deletions
--- a/web/services/scan_service.py
+++ b/web/services/scan_service.py
@@ -42,7 +42,7 @@ class ScanService:
        self.db = db_session

    def trigger_scan(self, config_file: str, triggered_by: str = 'manual',
-                    schedule_id: Optional[int] = None) -> int:
+                    schedule_id: Optional[int] = None, scheduler=None) -> int:
        """
        Trigger a new scan.

@@ -53,6 +53,7 @@ class ScanService:
            config_file: Path to YAML configuration file
            triggered_by: Source that triggered scan (manual, scheduled, api)
            schedule_id: Optional schedule ID if triggered by schedule
+            scheduler: Optional SchedulerService instance for queuing background jobs

        Returns:
            Scan ID of the created scan
@@ -87,8 +88,21 @@ class ScanService:

        logger.info(f"Scan {scan.id} triggered via {triggered_by}")

-        # NOTE: Background job queuing will be implemented in Step 3
-        # For now, just return the scan ID
+        # Queue background job if scheduler provided
+        if scheduler:
+            try:
+                job_id = scheduler.queue_scan(scan.id, config_file)
+                logger.info(f"Scan {scan.id} queued for background execution (job_id={job_id})")
+            except Exception as e:
+                logger.error(f"Failed to queue scan {scan.id}: {str(e)}")
+                # Mark scan as failed if job queuing fails
+                scan.status = 'failed'
+                scan.error_message = f"Failed to queue background job: {str(e)}"
+                self.db.commit()
+                raise
+        else:
+            logger.warning(f"Scan {scan.id} created but not queued (no scheduler provided)")
+
        return scan.id

    def get_scan(self, scan_id: int) -> Optional[Dict[str, Any]]:
@@ -230,7 +244,9 @@ class ScanService:
            'scan_id': scan.id,
            'status': scan.status,
            'title': scan.title,
-            'started_at': scan.timestamp.isoformat() if scan.timestamp else None,
+            'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
+            'started_at': scan.started_at.isoformat() if scan.started_at else None,
+            'completed_at': scan.completed_at.isoformat() if scan.completed_at else None,
            'duration': scan.duration,
            'triggered_by': scan.triggered_by
        }
@@ -242,6 +258,7 @@ class ScanService:
            status_info['progress'] = 'Complete'
        elif scan.status == 'failed':
            status_info['progress'] = 'Failed'
+            status_info['error_message'] = scan.error_message

        return status_info

@@ -265,6 +282,7 @@ class ScanService:
        # Update scan record
        scan.status = status
        scan.duration = report.get('scan_duration')
+        scan.completed_at = datetime.utcnow()

        # Map report data to database models
        self._map_report_to_models(report, scan)
--- a/web/services/scheduler_service.py
+++ b/web/services/scheduler_service.py
@@ -0,0 +1,257 @@
+"""
+Scheduler service for managing background jobs and scheduled scans.
+
+This service integrates APScheduler with Flask to enable background
+scan execution and future scheduled scanning capabilities.
+"""
+
+import logging
+from datetime import datetime
+from typing import Optional
+
+from apscheduler.schedulers.background import BackgroundScheduler
+from apscheduler.executors.pool import ThreadPoolExecutor
+from flask import Flask
+
+from web.jobs.scan_job import execute_scan
+
+logger = logging.getLogger(__name__)
+
+
+class SchedulerService:
+    """
+    Service for managing background job scheduling.
+
+    Uses APScheduler's BackgroundScheduler to run scans asynchronously
+    without blocking HTTP requests.
+    """
+
+    def __init__(self):
+        """Initialize scheduler service (scheduler not started yet)."""
+        self.scheduler: Optional[BackgroundScheduler] = None
+        self.db_url: Optional[str] = None
+
+    def init_scheduler(self, app: Flask):
+        """
+        Initialize and start APScheduler with Flask app.
+
+        Args:
+            app: Flask application instance
+
+        Configuration:
+            - BackgroundScheduler: Runs in separate thread
+            - ThreadPoolExecutor: Allows concurrent scan execution
+            - Max workers: 3 (configurable via SCHEDULER_MAX_WORKERS)
+        """
+        if self.scheduler:
+            logger.warning("Scheduler already initialized")
+            return
+
+        # Store database URL for passing to background jobs
+        self.db_url = app.config['SQLALCHEMY_DATABASE_URI']
+
+        # Configure executor for concurrent jobs
+        max_workers = app.config.get('SCHEDULER_MAX_WORKERS', 3)
+        executors = {
+            'default': ThreadPoolExecutor(max_workers=max_workers)
+        }
+
+        # Configure job defaults
+        job_defaults = {
+            'coalesce': True,  # Combine multiple pending instances into one
+            'max_instances': app.config.get('SCHEDULER_MAX_INSTANCES', 3),
+            'misfire_grace_time': 60  # Allow 60 seconds for delayed starts
+        }
+
+        # Create scheduler
+        self.scheduler = BackgroundScheduler(
+            executors=executors,
+            job_defaults=job_defaults,
+            timezone='UTC'
+        )
+
+        # Start scheduler
+        self.scheduler.start()
+        logger.info(f"APScheduler started with {max_workers} max workers")
+
+        # Register shutdown handler
+        import atexit
+        atexit.register(lambda: self.shutdown())
+
+    def shutdown(self):
+        """
+        Shutdown scheduler gracefully.
+
+        Waits for running jobs to complete before shutting down.
+        """
+        if self.scheduler:
+            logger.info("Shutting down APScheduler...")
+            self.scheduler.shutdown(wait=True)
+            logger.info("APScheduler shutdown complete")
+            self.scheduler = None
+
+    def queue_scan(self, scan_id: int, config_file: str) -> str:
+        """
+        Queue a scan for immediate background execution.
+
+        Args:
+            scan_id: Database ID of the scan
+            config_file: Path to YAML configuration file
+
+        Returns:
+            Job ID from APScheduler
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        # Add job to run immediately
+        job = self.scheduler.add_job(
+            func=execute_scan,
+            args=[scan_id, config_file, self.db_url],
+            id=f'scan_{scan_id}',
+            name=f'Scan {scan_id}',
+            replace_existing=True,
+            misfire_grace_time=300  # 5 minutes
+        )
+
+        logger.info(f"Queued scan {scan_id} for background execution (job_id={job.id})")
+        return job.id
+
+    def add_scheduled_scan(self, schedule_id: int, config_file: str,
+                          cron_expression: str) -> str:
+        """
+        Add a recurring scheduled scan.
+
+        Args:
+            schedule_id: Database ID of the schedule
+            config_file: Path to YAML configuration file
+            cron_expression: Cron expression (e.g., "0 2 * * *" for 2am daily)
+
+        Returns:
+            Job ID from APScheduler
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+            ValueError: If cron expression is invalid
+
+        Note:
+            This is a placeholder for Phase 3 scheduled scanning feature.
+            Currently not used, but structure is in place.
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        # Parse cron expression
+        # Format: "minute hour day month day_of_week"
+        parts = cron_expression.split()
+        if len(parts) != 5:
+            raise ValueError(f"Invalid cron expression: {cron_expression}")
+
+        minute, hour, day, month, day_of_week = parts
+
+        # Add cron job (currently placeholder - will be enhanced in Phase 3)
+        job = self.scheduler.add_job(
+            func=self._trigger_scheduled_scan,
+            args=[schedule_id, config_file],
+            trigger='cron',
+            minute=minute,
+            hour=hour,
+            day=day,
+            month=month,
+            day_of_week=day_of_week,
+            id=f'schedule_{schedule_id}',
+            name=f'Schedule {schedule_id}',
+            replace_existing=True
+        )
+
+        logger.info(f"Added scheduled scan {schedule_id} with cron '{cron_expression}' (job_id={job.id})")
+        return job.id
+
+    def remove_scheduled_scan(self, schedule_id: int):
+        """
+        Remove a scheduled scan job.
+
+        Args:
+            schedule_id: Database ID of the schedule
+
+        Raises:
+            RuntimeError: If scheduler not initialized
+        """
+        if not self.scheduler:
+            raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
+
+        job_id = f'schedule_{schedule_id}'
+
+        try:
+            self.scheduler.remove_job(job_id)
+            logger.info(f"Removed scheduled scan job: {job_id}")
+        except Exception as e:
+            logger.warning(f"Failed to remove scheduled scan job {job_id}: {str(e)}")
+
+    def _trigger_scheduled_scan(self, schedule_id: int, config_file: str):
+        """
+        Internal method to trigger a scan from a schedule.
+
+        Creates a new scan record and queues it for execution.
+
+        Args:
+            schedule_id: Database ID of the schedule
+            config_file: Path to YAML configuration file
+
+        Note:
+            This will be fully implemented in Phase 3 when scheduled
+            scanning is added. Currently a placeholder.
+        """
+        logger.info(f"Scheduled scan triggered: schedule_id={schedule_id}")
+        # TODO: In Phase 3, this will:
+        # 1. Create a new Scan record with triggered_by='scheduled'
+        # 2. Call queue_scan() with the new scan_id
+        # 3. Update schedule's last_run and next_run timestamps
+
+    def get_job_status(self, job_id: str) -> Optional[dict]:
+        """
+        Get status of a scheduled job.
+
+        Args:
+            job_id: APScheduler job ID
+
+        Returns:
+            Dictionary with job information, or None if not found
+        """
+        if not self.scheduler:
+            return None
+
+        job = self.scheduler.get_job(job_id)
+        if not job:
+            return None
+
+        return {
+            'id': job.id,
+            'name': job.name,
+            'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
+            'trigger': str(job.trigger)
+        }
+
+    def list_jobs(self) -> list:
+        """
+        List all scheduled jobs.
+
+        Returns:
+            List of job information dictionaries
+        """
+        if not self.scheduler:
+            return []
+
+        jobs = self.scheduler.get_jobs()
+        return [
+            {
+                'id': job.id,
+                'name': job.name,
+                'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
+                'trigger': str(job.trigger)
+            }
+            for job in jobs
+        ]