Phase 2 Step 3: Implement Background Job Queue

Implemented APScheduler integration for background scan execution,
enabling async job processing without blocking HTTP requests.

## Changes

### Background Jobs (web/jobs/)
- scan_job.py - Execute scans in background threads
  - execute_scan() with isolated database sessions
  - Comprehensive error handling and logging
  - Scan status lifecycle tracking
  - Timing and error message storage

### Scheduler Service (web/services/scheduler_service.py)
- SchedulerService class for job management
- APScheduler BackgroundScheduler integration
- ThreadPoolExecutor for concurrent jobs (max 3 workers)
- queue_scan() - Immediate job execution
- Job monitoring: list_jobs(), get_job_status()
- Graceful shutdown handling

### Flask Integration (web/app.py)
- init_scheduler() function
- Scheduler initialization in app factory
- Stored scheduler in app context (app.scheduler)

### Database Schema (migration 003)
- Added scan timing fields:
  - started_at - Scan execution start time
  - completed_at - Scan execution completion time
  - error_message - Error details for failed scans

### Service Layer Updates (web/services/scan_service.py)
- trigger_scan() accepts scheduler parameter
- Queues background jobs after creating scan record
- get_scan_status() includes new timing and error fields
- _save_scan_to_db() sets completed_at timestamp

### API Updates (web/api/scans.py)
- POST /api/scans passes scheduler to trigger_scan()
- Scans now execute in background automatically

### Model Updates (web/models.py)
- Added started_at, completed_at, error_message to Scan model

### Testing (tests/test_background_jobs.py)
- 13 unit tests for background job execution
- Scheduler initialization and configuration tests
- Job queuing and status tracking tests
- Scan timing field tests
- Error handling and storage tests
- Integration test for full workflow (skipped by default)

## Features

- Async scan execution without blocking HTTP requests
- Concurrent scan support (configurable max workers)
- Isolated database sessions per background thread
- Scan lifecycle tracking: created → running → completed/failed
- Error messages captured and stored in database
- Job monitoring and management capabilities
- Graceful shutdown waits for running jobs

## Implementation Notes

- Scanner runs in subprocess from background thread
- Docker provides necessary privileges (--privileged, --network host)
- Each job gets isolated SQLAlchemy session (avoid locking)
- Job IDs follow pattern: scan_{scan_id}
- Background jobs survive across requests
- Failed jobs store error messages in database

## Documentation (docs/ai/PHASE2.md)
- Updated progress: 6/14 days complete (43%)
- Marked Step 3 as complete
- Added detailed implementation notes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-11-14 09:24:00 -06:00
parent 6c4905d6c1
commit ee0c5a2c3c
10 changed files with 810 additions and 33 deletions

View File

@@ -146,7 +146,8 @@ def trigger_scan():
scan_service = ScanService(current_app.db_session)
scan_id = scan_service.trigger_scan(
config_file=config_file,
triggered_by='api'
triggered_by='api',
scheduler=current_app.scheduler
)
logger.info(f"Scan {scan_id} triggered via API: config={config_file}")

View File

@@ -60,6 +60,9 @@ def create_app(config: dict = None) -> Flask:
# Initialize extensions
init_extensions(app)
# Initialize background scheduler
init_scheduler(app)
# Register blueprints
register_blueprints(app)
@@ -169,6 +172,25 @@ def init_extensions(app: Flask) -> None:
app.logger.info("Extensions initialized")
def init_scheduler(app: Flask) -> None:
"""
Initialize background job scheduler.
Args:
app: Flask application instance
"""
from web.services.scheduler_service import SchedulerService
# Create and initialize scheduler
scheduler = SchedulerService()
scheduler.init_scheduler(app)
# Store in app context for access from routes
app.scheduler = scheduler
app.logger.info("Background scheduler initialized")
def register_blueprints(app: Flask) -> None:
"""
Register Flask blueprints for different app sections.

6
web/jobs/__init__.py Normal file
View File

@@ -0,0 +1,6 @@
"""
Background jobs package for SneakyScanner.
This package contains job definitions for background task execution,
including scan jobs and scheduled tasks.
"""

152
web/jobs/scan_job.py Normal file
View File

@@ -0,0 +1,152 @@
"""
Background scan job execution.
This module handles the execution of scans in background threads,
updating database status and handling errors.
"""
import logging
import traceback
from datetime import datetime
from pathlib import Path
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from src.scanner import SneakyScanner
from web.models import Scan
from web.services.scan_service import ScanService
logger = logging.getLogger(__name__)
def execute_scan(scan_id: int, config_file: str, db_url: str):
"""
Execute a scan in the background.
This function is designed to run in a background thread via APScheduler.
It creates its own database session to avoid conflicts with the main
application thread.
Args:
scan_id: ID of the scan record in database
config_file: Path to YAML configuration file
db_url: Database connection URL
Workflow:
1. Create new database session for this thread
2. Update scan status to 'running'
3. Execute scanner
4. Generate output files (JSON, HTML, ZIP)
5. Save results to database
6. Update status to 'completed' or 'failed'
"""
logger.info(f"Starting background scan execution: scan_id={scan_id}, config={config_file}")
# Create new database session for this thread
engine = create_engine(db_url, echo=False)
Session = sessionmaker(bind=engine)
session = Session()
try:
# Get scan record
scan = session.query(Scan).filter_by(id=scan_id).first()
if not scan:
logger.error(f"Scan {scan_id} not found in database")
return
# Update status to running (in case it wasn't already)
scan.status = 'running'
scan.started_at = datetime.utcnow()
session.commit()
logger.info(f"Scan {scan_id}: Initializing scanner with config {config_file}")
# Initialize scanner
scanner = SneakyScanner(config_file)
# Execute scan
logger.info(f"Scan {scan_id}: Running scanner...")
start_time = datetime.utcnow()
report, timestamp = scanner.scan()
end_time = datetime.utcnow()
scan_duration = (end_time - start_time).total_seconds()
logger.info(f"Scan {scan_id}: Scanner completed in {scan_duration:.2f} seconds")
# Generate output files (JSON, HTML, ZIP)
logger.info(f"Scan {scan_id}: Generating output files...")
scanner.generate_outputs(report, timestamp)
# Save results to database
logger.info(f"Scan {scan_id}: Saving results to database...")
scan_service = ScanService(session)
scan_service._save_scan_to_db(report, scan_id, status='completed')
logger.info(f"Scan {scan_id}: Completed successfully")
except FileNotFoundError as e:
# Config file not found
error_msg = f"Configuration file not found: {str(e)}"
logger.error(f"Scan {scan_id}: {error_msg}")
scan = session.query(Scan).filter_by(id=scan_id).first()
if scan:
scan.status = 'failed'
scan.error_message = error_msg
scan.completed_at = datetime.utcnow()
session.commit()
except Exception as e:
# Any other error during scan execution
error_msg = f"Scan execution failed: {str(e)}"
logger.error(f"Scan {scan_id}: {error_msg}")
logger.error(f"Scan {scan_id}: Traceback:\n{traceback.format_exc()}")
try:
scan = session.query(Scan).filter_by(id=scan_id).first()
if scan:
scan.status = 'failed'
scan.error_message = error_msg
scan.completed_at = datetime.utcnow()
session.commit()
except Exception as db_error:
logger.error(f"Scan {scan_id}: Failed to update error status in database: {str(db_error)}")
finally:
# Always close the session
session.close()
logger.info(f"Scan {scan_id}: Background job completed, session closed")
def get_scan_status_from_db(scan_id: int, db_url: str) -> dict:
"""
Helper function to get scan status directly from database.
Useful for monitoring background jobs without needing Flask app context.
Args:
scan_id: Scan ID to check
db_url: Database connection URL
Returns:
Dictionary with scan status information
"""
engine = create_engine(db_url, echo=False)
Session = sessionmaker(bind=engine)
session = Session()
try:
scan = session.query(Scan).filter_by(id=scan_id).first()
if not scan:
return None
return {
'scan_id': scan.id,
'status': scan.status,
'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
'duration': scan.duration,
'error_message': scan.error_message
}
finally:
session.close()

View File

@@ -55,6 +55,9 @@ class Scan(Base):
created_at = Column(DateTime, nullable=False, default=datetime.utcnow, comment="Record creation time")
triggered_by = Column(String(50), nullable=False, default='manual', comment="manual, scheduled, api")
schedule_id = Column(Integer, ForeignKey('schedules.id'), nullable=True, comment="FK to schedules if triggered by schedule")
started_at = Column(DateTime, nullable=True, comment="Scan execution start time")
completed_at = Column(DateTime, nullable=True, comment="Scan execution completion time")
error_message = Column(Text, nullable=True, comment="Error message if scan failed")
# Relationships
sites = relationship('ScanSite', back_populates='scan', cascade='all, delete-orphan')

View File

@@ -42,7 +42,7 @@ class ScanService:
self.db = db_session
def trigger_scan(self, config_file: str, triggered_by: str = 'manual',
schedule_id: Optional[int] = None) -> int:
schedule_id: Optional[int] = None, scheduler=None) -> int:
"""
Trigger a new scan.
@@ -53,6 +53,7 @@ class ScanService:
config_file: Path to YAML configuration file
triggered_by: Source that triggered scan (manual, scheduled, api)
schedule_id: Optional schedule ID if triggered by schedule
scheduler: Optional SchedulerService instance for queuing background jobs
Returns:
Scan ID of the created scan
@@ -87,8 +88,21 @@ class ScanService:
logger.info(f"Scan {scan.id} triggered via {triggered_by}")
# NOTE: Background job queuing will be implemented in Step 3
# For now, just return the scan ID
# Queue background job if scheduler provided
if scheduler:
try:
job_id = scheduler.queue_scan(scan.id, config_file)
logger.info(f"Scan {scan.id} queued for background execution (job_id={job_id})")
except Exception as e:
logger.error(f"Failed to queue scan {scan.id}: {str(e)}")
# Mark scan as failed if job queuing fails
scan.status = 'failed'
scan.error_message = f"Failed to queue background job: {str(e)}"
self.db.commit()
raise
else:
logger.warning(f"Scan {scan.id} created but not queued (no scheduler provided)")
return scan.id
def get_scan(self, scan_id: int) -> Optional[Dict[str, Any]]:
@@ -230,7 +244,9 @@ class ScanService:
'scan_id': scan.id,
'status': scan.status,
'title': scan.title,
'started_at': scan.timestamp.isoformat() if scan.timestamp else None,
'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
'started_at': scan.started_at.isoformat() if scan.started_at else None,
'completed_at': scan.completed_at.isoformat() if scan.completed_at else None,
'duration': scan.duration,
'triggered_by': scan.triggered_by
}
@@ -242,6 +258,7 @@ class ScanService:
status_info['progress'] = 'Complete'
elif scan.status == 'failed':
status_info['progress'] = 'Failed'
status_info['error_message'] = scan.error_message
return status_info
@@ -265,6 +282,7 @@ class ScanService:
# Update scan record
scan.status = status
scan.duration = report.get('scan_duration')
scan.completed_at = datetime.utcnow()
# Map report data to database models
self._map_report_to_models(report, scan)

View File

@@ -0,0 +1,257 @@
"""
Scheduler service for managing background jobs and scheduled scans.
This service integrates APScheduler with Flask to enable background
scan execution and future scheduled scanning capabilities.
"""
import logging
from datetime import datetime
from typing import Optional
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor
from flask import Flask
from web.jobs.scan_job import execute_scan
logger = logging.getLogger(__name__)
class SchedulerService:
"""
Service for managing background job scheduling.
Uses APScheduler's BackgroundScheduler to run scans asynchronously
without blocking HTTP requests.
"""
def __init__(self):
"""Initialize scheduler service (scheduler not started yet)."""
self.scheduler: Optional[BackgroundScheduler] = None
self.db_url: Optional[str] = None
def init_scheduler(self, app: Flask):
"""
Initialize and start APScheduler with Flask app.
Args:
app: Flask application instance
Configuration:
- BackgroundScheduler: Runs in separate thread
- ThreadPoolExecutor: Allows concurrent scan execution
- Max workers: 3 (configurable via SCHEDULER_MAX_WORKERS)
"""
if self.scheduler:
logger.warning("Scheduler already initialized")
return
# Store database URL for passing to background jobs
self.db_url = app.config['SQLALCHEMY_DATABASE_URI']
# Configure executor for concurrent jobs
max_workers = app.config.get('SCHEDULER_MAX_WORKERS', 3)
executors = {
'default': ThreadPoolExecutor(max_workers=max_workers)
}
# Configure job defaults
job_defaults = {
'coalesce': True, # Combine multiple pending instances into one
'max_instances': app.config.get('SCHEDULER_MAX_INSTANCES', 3),
'misfire_grace_time': 60 # Allow 60 seconds for delayed starts
}
# Create scheduler
self.scheduler = BackgroundScheduler(
executors=executors,
job_defaults=job_defaults,
timezone='UTC'
)
# Start scheduler
self.scheduler.start()
logger.info(f"APScheduler started with {max_workers} max workers")
# Register shutdown handler
import atexit
atexit.register(lambda: self.shutdown())
def shutdown(self):
"""
Shutdown scheduler gracefully.
Waits for running jobs to complete before shutting down.
"""
if self.scheduler:
logger.info("Shutting down APScheduler...")
self.scheduler.shutdown(wait=True)
logger.info("APScheduler shutdown complete")
self.scheduler = None
def queue_scan(self, scan_id: int, config_file: str) -> str:
"""
Queue a scan for immediate background execution.
Args:
scan_id: Database ID of the scan
config_file: Path to YAML configuration file
Returns:
Job ID from APScheduler
Raises:
RuntimeError: If scheduler not initialized
"""
if not self.scheduler:
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
# Add job to run immediately
job = self.scheduler.add_job(
func=execute_scan,
args=[scan_id, config_file, self.db_url],
id=f'scan_{scan_id}',
name=f'Scan {scan_id}',
replace_existing=True,
misfire_grace_time=300 # 5 minutes
)
logger.info(f"Queued scan {scan_id} for background execution (job_id={job.id})")
return job.id
def add_scheduled_scan(self, schedule_id: int, config_file: str,
cron_expression: str) -> str:
"""
Add a recurring scheduled scan.
Args:
schedule_id: Database ID of the schedule
config_file: Path to YAML configuration file
cron_expression: Cron expression (e.g., "0 2 * * *" for 2am daily)
Returns:
Job ID from APScheduler
Raises:
RuntimeError: If scheduler not initialized
ValueError: If cron expression is invalid
Note:
This is a placeholder for Phase 3 scheduled scanning feature.
Currently not used, but structure is in place.
"""
if not self.scheduler:
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
# Parse cron expression
# Format: "minute hour day month day_of_week"
parts = cron_expression.split()
if len(parts) != 5:
raise ValueError(f"Invalid cron expression: {cron_expression}")
minute, hour, day, month, day_of_week = parts
# Add cron job (currently placeholder - will be enhanced in Phase 3)
job = self.scheduler.add_job(
func=self._trigger_scheduled_scan,
args=[schedule_id, config_file],
trigger='cron',
minute=minute,
hour=hour,
day=day,
month=month,
day_of_week=day_of_week,
id=f'schedule_{schedule_id}',
name=f'Schedule {schedule_id}',
replace_existing=True
)
logger.info(f"Added scheduled scan {schedule_id} with cron '{cron_expression}' (job_id={job.id})")
return job.id
def remove_scheduled_scan(self, schedule_id: int):
"""
Remove a scheduled scan job.
Args:
schedule_id: Database ID of the schedule
Raises:
RuntimeError: If scheduler not initialized
"""
if not self.scheduler:
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
job_id = f'schedule_{schedule_id}'
try:
self.scheduler.remove_job(job_id)
logger.info(f"Removed scheduled scan job: {job_id}")
except Exception as e:
logger.warning(f"Failed to remove scheduled scan job {job_id}: {str(e)}")
def _trigger_scheduled_scan(self, schedule_id: int, config_file: str):
"""
Internal method to trigger a scan from a schedule.
Creates a new scan record and queues it for execution.
Args:
schedule_id: Database ID of the schedule
config_file: Path to YAML configuration file
Note:
This will be fully implemented in Phase 3 when scheduled
scanning is added. Currently a placeholder.
"""
logger.info(f"Scheduled scan triggered: schedule_id={schedule_id}")
# TODO: In Phase 3, this will:
# 1. Create a new Scan record with triggered_by='scheduled'
# 2. Call queue_scan() with the new scan_id
# 3. Update schedule's last_run and next_run timestamps
def get_job_status(self, job_id: str) -> Optional[dict]:
"""
Get status of a scheduled job.
Args:
job_id: APScheduler job ID
Returns:
Dictionary with job information, or None if not found
"""
if not self.scheduler:
return None
job = self.scheduler.get_job(job_id)
if not job:
return None
return {
'id': job.id,
'name': job.name,
'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
'trigger': str(job.trigger)
}
def list_jobs(self) -> list:
"""
List all scheduled jobs.
Returns:
List of job information dictionaries
"""
if not self.scheduler:
return []
jobs = self.scheduler.get_jobs()
return [
{
'id': job.id,
'name': job.name,
'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
'trigger': str(job.trigger)
}
for job in jobs
]