Phase 2 Step 3: Implement Background Job Queue
Implemented APScheduler integration for background scan execution,
enabling async job processing without blocking HTTP requests.
## Changes
### Background Jobs (web/jobs/)
- scan_job.py - Execute scans in background threads
- execute_scan() with isolated database sessions
- Comprehensive error handling and logging
- Scan status lifecycle tracking
- Timing and error message storage
### Scheduler Service (web/services/scheduler_service.py)
- SchedulerService class for job management
- APScheduler BackgroundScheduler integration
- ThreadPoolExecutor for concurrent jobs (max 3 workers)
- queue_scan() - Immediate job execution
- Job monitoring: list_jobs(), get_job_status()
- Graceful shutdown handling
### Flask Integration (web/app.py)
- init_scheduler() function
- Scheduler initialization in app factory
- Stored scheduler in app context (app.scheduler)
### Database Schema (migration 003)
- Added scan timing fields:
- started_at - Scan execution start time
- completed_at - Scan execution completion time
- error_message - Error details for failed scans
### Service Layer Updates (web/services/scan_service.py)
- trigger_scan() accepts scheduler parameter
- Queues background jobs after creating scan record
- get_scan_status() includes new timing and error fields
- _save_scan_to_db() sets completed_at timestamp
### API Updates (web/api/scans.py)
- POST /api/scans passes scheduler to trigger_scan()
- Scans now execute in background automatically
### Model Updates (web/models.py)
- Added started_at, completed_at, error_message to Scan model
### Testing (tests/test_background_jobs.py)
- 13 unit tests for background job execution
- Scheduler initialization and configuration tests
- Job queuing and status tracking tests
- Scan timing field tests
- Error handling and storage tests
- Integration test for full workflow (skipped by default)
## Features
- Async scan execution without blocking HTTP requests
- Concurrent scan support (configurable max workers)
- Isolated database sessions per background thread
- Scan lifecycle tracking: created → running → completed/failed
- Error messages captured and stored in database
- Job monitoring and management capabilities
- Graceful shutdown waits for running jobs
## Implementation Notes
- Scanner runs in subprocess from background thread
- Docker provides necessary privileges (--privileged, --network host)
- Each job gets isolated SQLAlchemy session (avoid locking)
- Job IDs follow pattern: scan_{scan_id}
- Background jobs survive across requests
- Failed jobs store error messages in database
## Documentation (docs/ai/PHASE2.md)
- Updated progress: 6/14 days complete (43%)
- Marked Step 3 as complete
- Added detailed implementation notes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -146,7 +146,8 @@ def trigger_scan():
|
||||
scan_service = ScanService(current_app.db_session)
|
||||
scan_id = scan_service.trigger_scan(
|
||||
config_file=config_file,
|
||||
triggered_by='api'
|
||||
triggered_by='api',
|
||||
scheduler=current_app.scheduler
|
||||
)
|
||||
|
||||
logger.info(f"Scan {scan_id} triggered via API: config={config_file}")
|
||||
|
||||
22
web/app.py
22
web/app.py
@@ -60,6 +60,9 @@ def create_app(config: dict = None) -> Flask:
|
||||
# Initialize extensions
|
||||
init_extensions(app)
|
||||
|
||||
# Initialize background scheduler
|
||||
init_scheduler(app)
|
||||
|
||||
# Register blueprints
|
||||
register_blueprints(app)
|
||||
|
||||
@@ -169,6 +172,25 @@ def init_extensions(app: Flask) -> None:
|
||||
app.logger.info("Extensions initialized")
|
||||
|
||||
|
||||
def init_scheduler(app: Flask) -> None:
|
||||
"""
|
||||
Initialize background job scheduler.
|
||||
|
||||
Args:
|
||||
app: Flask application instance
|
||||
"""
|
||||
from web.services.scheduler_service import SchedulerService
|
||||
|
||||
# Create and initialize scheduler
|
||||
scheduler = SchedulerService()
|
||||
scheduler.init_scheduler(app)
|
||||
|
||||
# Store in app context for access from routes
|
||||
app.scheduler = scheduler
|
||||
|
||||
app.logger.info("Background scheduler initialized")
|
||||
|
||||
|
||||
def register_blueprints(app: Flask) -> None:
|
||||
"""
|
||||
Register Flask blueprints for different app sections.
|
||||
|
||||
6
web/jobs/__init__.py
Normal file
6
web/jobs/__init__.py
Normal file
@@ -0,0 +1,6 @@
|
||||
"""
|
||||
Background jobs package for SneakyScanner.
|
||||
|
||||
This package contains job definitions for background task execution,
|
||||
including scan jobs and scheduled tasks.
|
||||
"""
|
||||
152
web/jobs/scan_job.py
Normal file
152
web/jobs/scan_job.py
Normal file
@@ -0,0 +1,152 @@
|
||||
"""
|
||||
Background scan job execution.
|
||||
|
||||
This module handles the execution of scans in background threads,
|
||||
updating database status and handling errors.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import traceback
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.orm import sessionmaker
|
||||
|
||||
from src.scanner import SneakyScanner
|
||||
from web.models import Scan
|
||||
from web.services.scan_service import ScanService
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def execute_scan(scan_id: int, config_file: str, db_url: str):
|
||||
"""
|
||||
Execute a scan in the background.
|
||||
|
||||
This function is designed to run in a background thread via APScheduler.
|
||||
It creates its own database session to avoid conflicts with the main
|
||||
application thread.
|
||||
|
||||
Args:
|
||||
scan_id: ID of the scan record in database
|
||||
config_file: Path to YAML configuration file
|
||||
db_url: Database connection URL
|
||||
|
||||
Workflow:
|
||||
1. Create new database session for this thread
|
||||
2. Update scan status to 'running'
|
||||
3. Execute scanner
|
||||
4. Generate output files (JSON, HTML, ZIP)
|
||||
5. Save results to database
|
||||
6. Update status to 'completed' or 'failed'
|
||||
"""
|
||||
logger.info(f"Starting background scan execution: scan_id={scan_id}, config={config_file}")
|
||||
|
||||
# Create new database session for this thread
|
||||
engine = create_engine(db_url, echo=False)
|
||||
Session = sessionmaker(bind=engine)
|
||||
session = Session()
|
||||
|
||||
try:
|
||||
# Get scan record
|
||||
scan = session.query(Scan).filter_by(id=scan_id).first()
|
||||
if not scan:
|
||||
logger.error(f"Scan {scan_id} not found in database")
|
||||
return
|
||||
|
||||
# Update status to running (in case it wasn't already)
|
||||
scan.status = 'running'
|
||||
scan.started_at = datetime.utcnow()
|
||||
session.commit()
|
||||
|
||||
logger.info(f"Scan {scan_id}: Initializing scanner with config {config_file}")
|
||||
|
||||
# Initialize scanner
|
||||
scanner = SneakyScanner(config_file)
|
||||
|
||||
# Execute scan
|
||||
logger.info(f"Scan {scan_id}: Running scanner...")
|
||||
start_time = datetime.utcnow()
|
||||
report, timestamp = scanner.scan()
|
||||
end_time = datetime.utcnow()
|
||||
|
||||
scan_duration = (end_time - start_time).total_seconds()
|
||||
logger.info(f"Scan {scan_id}: Scanner completed in {scan_duration:.2f} seconds")
|
||||
|
||||
# Generate output files (JSON, HTML, ZIP)
|
||||
logger.info(f"Scan {scan_id}: Generating output files...")
|
||||
scanner.generate_outputs(report, timestamp)
|
||||
|
||||
# Save results to database
|
||||
logger.info(f"Scan {scan_id}: Saving results to database...")
|
||||
scan_service = ScanService(session)
|
||||
scan_service._save_scan_to_db(report, scan_id, status='completed')
|
||||
|
||||
logger.info(f"Scan {scan_id}: Completed successfully")
|
||||
|
||||
except FileNotFoundError as e:
|
||||
# Config file not found
|
||||
error_msg = f"Configuration file not found: {str(e)}"
|
||||
logger.error(f"Scan {scan_id}: {error_msg}")
|
||||
|
||||
scan = session.query(Scan).filter_by(id=scan_id).first()
|
||||
if scan:
|
||||
scan.status = 'failed'
|
||||
scan.error_message = error_msg
|
||||
scan.completed_at = datetime.utcnow()
|
||||
session.commit()
|
||||
|
||||
except Exception as e:
|
||||
# Any other error during scan execution
|
||||
error_msg = f"Scan execution failed: {str(e)}"
|
||||
logger.error(f"Scan {scan_id}: {error_msg}")
|
||||
logger.error(f"Scan {scan_id}: Traceback:\n{traceback.format_exc()}")
|
||||
|
||||
try:
|
||||
scan = session.query(Scan).filter_by(id=scan_id).first()
|
||||
if scan:
|
||||
scan.status = 'failed'
|
||||
scan.error_message = error_msg
|
||||
scan.completed_at = datetime.utcnow()
|
||||
session.commit()
|
||||
except Exception as db_error:
|
||||
logger.error(f"Scan {scan_id}: Failed to update error status in database: {str(db_error)}")
|
||||
|
||||
finally:
|
||||
# Always close the session
|
||||
session.close()
|
||||
logger.info(f"Scan {scan_id}: Background job completed, session closed")
|
||||
|
||||
|
||||
def get_scan_status_from_db(scan_id: int, db_url: str) -> dict:
|
||||
"""
|
||||
Helper function to get scan status directly from database.
|
||||
|
||||
Useful for monitoring background jobs without needing Flask app context.
|
||||
|
||||
Args:
|
||||
scan_id: Scan ID to check
|
||||
db_url: Database connection URL
|
||||
|
||||
Returns:
|
||||
Dictionary with scan status information
|
||||
"""
|
||||
engine = create_engine(db_url, echo=False)
|
||||
Session = sessionmaker(bind=engine)
|
||||
session = Session()
|
||||
|
||||
try:
|
||||
scan = session.query(Scan).filter_by(id=scan_id).first()
|
||||
if not scan:
|
||||
return None
|
||||
|
||||
return {
|
||||
'scan_id': scan.id,
|
||||
'status': scan.status,
|
||||
'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
|
||||
'duration': scan.duration,
|
||||
'error_message': scan.error_message
|
||||
}
|
||||
finally:
|
||||
session.close()
|
||||
@@ -55,6 +55,9 @@ class Scan(Base):
|
||||
created_at = Column(DateTime, nullable=False, default=datetime.utcnow, comment="Record creation time")
|
||||
triggered_by = Column(String(50), nullable=False, default='manual', comment="manual, scheduled, api")
|
||||
schedule_id = Column(Integer, ForeignKey('schedules.id'), nullable=True, comment="FK to schedules if triggered by schedule")
|
||||
started_at = Column(DateTime, nullable=True, comment="Scan execution start time")
|
||||
completed_at = Column(DateTime, nullable=True, comment="Scan execution completion time")
|
||||
error_message = Column(Text, nullable=True, comment="Error message if scan failed")
|
||||
|
||||
# Relationships
|
||||
sites = relationship('ScanSite', back_populates='scan', cascade='all, delete-orphan')
|
||||
|
||||
@@ -42,7 +42,7 @@ class ScanService:
|
||||
self.db = db_session
|
||||
|
||||
def trigger_scan(self, config_file: str, triggered_by: str = 'manual',
|
||||
schedule_id: Optional[int] = None) -> int:
|
||||
schedule_id: Optional[int] = None, scheduler=None) -> int:
|
||||
"""
|
||||
Trigger a new scan.
|
||||
|
||||
@@ -53,6 +53,7 @@ class ScanService:
|
||||
config_file: Path to YAML configuration file
|
||||
triggered_by: Source that triggered scan (manual, scheduled, api)
|
||||
schedule_id: Optional schedule ID if triggered by schedule
|
||||
scheduler: Optional SchedulerService instance for queuing background jobs
|
||||
|
||||
Returns:
|
||||
Scan ID of the created scan
|
||||
@@ -87,8 +88,21 @@ class ScanService:
|
||||
|
||||
logger.info(f"Scan {scan.id} triggered via {triggered_by}")
|
||||
|
||||
# NOTE: Background job queuing will be implemented in Step 3
|
||||
# For now, just return the scan ID
|
||||
# Queue background job if scheduler provided
|
||||
if scheduler:
|
||||
try:
|
||||
job_id = scheduler.queue_scan(scan.id, config_file)
|
||||
logger.info(f"Scan {scan.id} queued for background execution (job_id={job_id})")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to queue scan {scan.id}: {str(e)}")
|
||||
# Mark scan as failed if job queuing fails
|
||||
scan.status = 'failed'
|
||||
scan.error_message = f"Failed to queue background job: {str(e)}"
|
||||
self.db.commit()
|
||||
raise
|
||||
else:
|
||||
logger.warning(f"Scan {scan.id} created but not queued (no scheduler provided)")
|
||||
|
||||
return scan.id
|
||||
|
||||
def get_scan(self, scan_id: int) -> Optional[Dict[str, Any]]:
|
||||
@@ -230,7 +244,9 @@ class ScanService:
|
||||
'scan_id': scan.id,
|
||||
'status': scan.status,
|
||||
'title': scan.title,
|
||||
'started_at': scan.timestamp.isoformat() if scan.timestamp else None,
|
||||
'timestamp': scan.timestamp.isoformat() if scan.timestamp else None,
|
||||
'started_at': scan.started_at.isoformat() if scan.started_at else None,
|
||||
'completed_at': scan.completed_at.isoformat() if scan.completed_at else None,
|
||||
'duration': scan.duration,
|
||||
'triggered_by': scan.triggered_by
|
||||
}
|
||||
@@ -242,6 +258,7 @@ class ScanService:
|
||||
status_info['progress'] = 'Complete'
|
||||
elif scan.status == 'failed':
|
||||
status_info['progress'] = 'Failed'
|
||||
status_info['error_message'] = scan.error_message
|
||||
|
||||
return status_info
|
||||
|
||||
@@ -265,6 +282,7 @@ class ScanService:
|
||||
# Update scan record
|
||||
scan.status = status
|
||||
scan.duration = report.get('scan_duration')
|
||||
scan.completed_at = datetime.utcnow()
|
||||
|
||||
# Map report data to database models
|
||||
self._map_report_to_models(report, scan)
|
||||
|
||||
257
web/services/scheduler_service.py
Normal file
257
web/services/scheduler_service.py
Normal file
@@ -0,0 +1,257 @@
|
||||
"""
|
||||
Scheduler service for managing background jobs and scheduled scans.
|
||||
|
||||
This service integrates APScheduler with Flask to enable background
|
||||
scan execution and future scheduled scanning capabilities.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from datetime import datetime
|
||||
from typing import Optional
|
||||
|
||||
from apscheduler.schedulers.background import BackgroundScheduler
|
||||
from apscheduler.executors.pool import ThreadPoolExecutor
|
||||
from flask import Flask
|
||||
|
||||
from web.jobs.scan_job import execute_scan
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class SchedulerService:
|
||||
"""
|
||||
Service for managing background job scheduling.
|
||||
|
||||
Uses APScheduler's BackgroundScheduler to run scans asynchronously
|
||||
without blocking HTTP requests.
|
||||
"""
|
||||
|
||||
def __init__(self):
|
||||
"""Initialize scheduler service (scheduler not started yet)."""
|
||||
self.scheduler: Optional[BackgroundScheduler] = None
|
||||
self.db_url: Optional[str] = None
|
||||
|
||||
def init_scheduler(self, app: Flask):
|
||||
"""
|
||||
Initialize and start APScheduler with Flask app.
|
||||
|
||||
Args:
|
||||
app: Flask application instance
|
||||
|
||||
Configuration:
|
||||
- BackgroundScheduler: Runs in separate thread
|
||||
- ThreadPoolExecutor: Allows concurrent scan execution
|
||||
- Max workers: 3 (configurable via SCHEDULER_MAX_WORKERS)
|
||||
"""
|
||||
if self.scheduler:
|
||||
logger.warning("Scheduler already initialized")
|
||||
return
|
||||
|
||||
# Store database URL for passing to background jobs
|
||||
self.db_url = app.config['SQLALCHEMY_DATABASE_URI']
|
||||
|
||||
# Configure executor for concurrent jobs
|
||||
max_workers = app.config.get('SCHEDULER_MAX_WORKERS', 3)
|
||||
executors = {
|
||||
'default': ThreadPoolExecutor(max_workers=max_workers)
|
||||
}
|
||||
|
||||
# Configure job defaults
|
||||
job_defaults = {
|
||||
'coalesce': True, # Combine multiple pending instances into one
|
||||
'max_instances': app.config.get('SCHEDULER_MAX_INSTANCES', 3),
|
||||
'misfire_grace_time': 60 # Allow 60 seconds for delayed starts
|
||||
}
|
||||
|
||||
# Create scheduler
|
||||
self.scheduler = BackgroundScheduler(
|
||||
executors=executors,
|
||||
job_defaults=job_defaults,
|
||||
timezone='UTC'
|
||||
)
|
||||
|
||||
# Start scheduler
|
||||
self.scheduler.start()
|
||||
logger.info(f"APScheduler started with {max_workers} max workers")
|
||||
|
||||
# Register shutdown handler
|
||||
import atexit
|
||||
atexit.register(lambda: self.shutdown())
|
||||
|
||||
def shutdown(self):
|
||||
"""
|
||||
Shutdown scheduler gracefully.
|
||||
|
||||
Waits for running jobs to complete before shutting down.
|
||||
"""
|
||||
if self.scheduler:
|
||||
logger.info("Shutting down APScheduler...")
|
||||
self.scheduler.shutdown(wait=True)
|
||||
logger.info("APScheduler shutdown complete")
|
||||
self.scheduler = None
|
||||
|
||||
def queue_scan(self, scan_id: int, config_file: str) -> str:
|
||||
"""
|
||||
Queue a scan for immediate background execution.
|
||||
|
||||
Args:
|
||||
scan_id: Database ID of the scan
|
||||
config_file: Path to YAML configuration file
|
||||
|
||||
Returns:
|
||||
Job ID from APScheduler
|
||||
|
||||
Raises:
|
||||
RuntimeError: If scheduler not initialized
|
||||
"""
|
||||
if not self.scheduler:
|
||||
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
|
||||
|
||||
# Add job to run immediately
|
||||
job = self.scheduler.add_job(
|
||||
func=execute_scan,
|
||||
args=[scan_id, config_file, self.db_url],
|
||||
id=f'scan_{scan_id}',
|
||||
name=f'Scan {scan_id}',
|
||||
replace_existing=True,
|
||||
misfire_grace_time=300 # 5 minutes
|
||||
)
|
||||
|
||||
logger.info(f"Queued scan {scan_id} for background execution (job_id={job.id})")
|
||||
return job.id
|
||||
|
||||
def add_scheduled_scan(self, schedule_id: int, config_file: str,
|
||||
cron_expression: str) -> str:
|
||||
"""
|
||||
Add a recurring scheduled scan.
|
||||
|
||||
Args:
|
||||
schedule_id: Database ID of the schedule
|
||||
config_file: Path to YAML configuration file
|
||||
cron_expression: Cron expression (e.g., "0 2 * * *" for 2am daily)
|
||||
|
||||
Returns:
|
||||
Job ID from APScheduler
|
||||
|
||||
Raises:
|
||||
RuntimeError: If scheduler not initialized
|
||||
ValueError: If cron expression is invalid
|
||||
|
||||
Note:
|
||||
This is a placeholder for Phase 3 scheduled scanning feature.
|
||||
Currently not used, but structure is in place.
|
||||
"""
|
||||
if not self.scheduler:
|
||||
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
|
||||
|
||||
# Parse cron expression
|
||||
# Format: "minute hour day month day_of_week"
|
||||
parts = cron_expression.split()
|
||||
if len(parts) != 5:
|
||||
raise ValueError(f"Invalid cron expression: {cron_expression}")
|
||||
|
||||
minute, hour, day, month, day_of_week = parts
|
||||
|
||||
# Add cron job (currently placeholder - will be enhanced in Phase 3)
|
||||
job = self.scheduler.add_job(
|
||||
func=self._trigger_scheduled_scan,
|
||||
args=[schedule_id, config_file],
|
||||
trigger='cron',
|
||||
minute=minute,
|
||||
hour=hour,
|
||||
day=day,
|
||||
month=month,
|
||||
day_of_week=day_of_week,
|
||||
id=f'schedule_{schedule_id}',
|
||||
name=f'Schedule {schedule_id}',
|
||||
replace_existing=True
|
||||
)
|
||||
|
||||
logger.info(f"Added scheduled scan {schedule_id} with cron '{cron_expression}' (job_id={job.id})")
|
||||
return job.id
|
||||
|
||||
def remove_scheduled_scan(self, schedule_id: int):
|
||||
"""
|
||||
Remove a scheduled scan job.
|
||||
|
||||
Args:
|
||||
schedule_id: Database ID of the schedule
|
||||
|
||||
Raises:
|
||||
RuntimeError: If scheduler not initialized
|
||||
"""
|
||||
if not self.scheduler:
|
||||
raise RuntimeError("Scheduler not initialized. Call init_scheduler() first.")
|
||||
|
||||
job_id = f'schedule_{schedule_id}'
|
||||
|
||||
try:
|
||||
self.scheduler.remove_job(job_id)
|
||||
logger.info(f"Removed scheduled scan job: {job_id}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to remove scheduled scan job {job_id}: {str(e)}")
|
||||
|
||||
def _trigger_scheduled_scan(self, schedule_id: int, config_file: str):
|
||||
"""
|
||||
Internal method to trigger a scan from a schedule.
|
||||
|
||||
Creates a new scan record and queues it for execution.
|
||||
|
||||
Args:
|
||||
schedule_id: Database ID of the schedule
|
||||
config_file: Path to YAML configuration file
|
||||
|
||||
Note:
|
||||
This will be fully implemented in Phase 3 when scheduled
|
||||
scanning is added. Currently a placeholder.
|
||||
"""
|
||||
logger.info(f"Scheduled scan triggered: schedule_id={schedule_id}")
|
||||
# TODO: In Phase 3, this will:
|
||||
# 1. Create a new Scan record with triggered_by='scheduled'
|
||||
# 2. Call queue_scan() with the new scan_id
|
||||
# 3. Update schedule's last_run and next_run timestamps
|
||||
|
||||
def get_job_status(self, job_id: str) -> Optional[dict]:
|
||||
"""
|
||||
Get status of a scheduled job.
|
||||
|
||||
Args:
|
||||
job_id: APScheduler job ID
|
||||
|
||||
Returns:
|
||||
Dictionary with job information, or None if not found
|
||||
"""
|
||||
if not self.scheduler:
|
||||
return None
|
||||
|
||||
job = self.scheduler.get_job(job_id)
|
||||
if not job:
|
||||
return None
|
||||
|
||||
return {
|
||||
'id': job.id,
|
||||
'name': job.name,
|
||||
'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
|
||||
'trigger': str(job.trigger)
|
||||
}
|
||||
|
||||
def list_jobs(self) -> list:
|
||||
"""
|
||||
List all scheduled jobs.
|
||||
|
||||
Returns:
|
||||
List of job information dictionaries
|
||||
"""
|
||||
if not self.scheduler:
|
||||
return []
|
||||
|
||||
jobs = self.scheduler.get_jobs()
|
||||
return [
|
||||
{
|
||||
'id': job.id,
|
||||
'name': job.name,
|
||||
'next_run_time': job.next_run_time.isoformat() if job.next_run_time else None,
|
||||
'trigger': str(job.trigger)
|
||||
}
|
||||
for job in jobs
|
||||
]
|
||||
Reference in New Issue
Block a user