Add webpage screenshot capture with Playwright

Implements automated screenshot capture for all discovered HTTP/HTTPS services using Playwright with headless Chromium. Screenshots are saved as PNG files and referenced in JSON reports. Features: - Separate ScreenshotCapture module for code organization - Viewport screenshots (1280x720) with 15-second timeout - Graceful handling of self-signed certificates - Browser reuse for optimal performance - Screenshots stored in timestamped directories - Comprehensive documentation in README.md and new CLAUDE.md Technical changes: - Added src/screenshot_capture.py: Screenshot capture module with context manager pattern - Updated src/scanner.py: Integrated screenshot capture into HTTP/HTTPS analysis phase - Updated Dockerfile: Added Chromium and Playwright browser installation - Updated requirements.txt: Added playwright==1.40.0 - Added CLAUDE.md: Developer documentation and implementation guide - Updated README.md: Enhanced features section, added screenshot details and troubleshooting - Updated .gitignore: Ignore entire output/ directory including screenshots 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-14 00:57:36 +00:00
parent 48755a8539
commit 61cc24f8d2
7 changed files with 822 additions and 25 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,492 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+SneakyScanner is a dockerized network scanning tool that uses a five-phase approach: masscan for fast port discovery, nmap for service detection, sslyze for HTTP/HTTPS and SSL/TLS analysis, and Playwright for webpage screenshots. It accepts YAML configuration files defining scan targets and expected network behavior, then produces comprehensive JSON reports with service information, SSL certificates, TLS versions, cipher suites, and webpage screenshots - comparing expected vs. actual results.
+
+## Essential Commands
+
+### Building and Running
+
+```bash
+# Build the Docker image
+docker build -t sneakyscanner .
+
+# Run with docker-compose (easiest method)
+docker-compose build
+docker-compose up
+
+# Run directly with Docker
+docker run --rm --privileged --network host \
+  -v $(pwd)/configs:/app/configs:ro \
+  -v $(pwd)/output:/app/output \
+  sneakyscanner /app/configs/your-config.yaml
+```
+
+### Development
+
+```bash
+# Test the Python script locally (requires masscan and nmap installed)
+python3 src/scanner.py configs/example-site.yaml -o ./output
+
+# Validate YAML config
+python3 -c "import yaml; yaml.safe_load(open('configs/example-site.yaml'))"
+```
+
+## Architecture
+
+### Core Components
+
+1. **src/scanner.py** - Main application
+   - `SneakyScanner` class: Orchestrates scanning workflow
+   - `_load_config()`: Parses and validates YAML config
+   - `_run_masscan()`: Executes masscan for TCP/UDP scanning
+   - `_run_ping_scan()`: Executes masscan ICMP ping scanning
+   - `_run_nmap_service_detection()`: Executes nmap service detection on discovered TCP ports
+   - `_parse_nmap_xml()`: Parses nmap XML output to extract service information
+   - `_is_likely_web_service()`: Identifies web services based on nmap results
+   - `_detect_http_https()`: Detects HTTP vs HTTPS using socket connections
+   - `_analyze_ssl_tls()`: Analyzes SSL/TLS certificates and supported versions using sslyze
+   - `_run_http_analysis()`: Orchestrates HTTP/HTTPS and SSL/TLS analysis phase
+   - `scan()`: Main workflow - collects IPs, runs scans, performs service detection, HTTP/HTTPS analysis, compiles results
+   - `save_report()`: Writes JSON output with timestamp and scan duration
+
+2. **src/screenshot_capture.py** - Screenshot capture module
+   - `ScreenshotCapture` class: Handles webpage screenshot capture
+   - `capture()`: Captures screenshot of a web service (HTTP/HTTPS)
+   - `_launch_browser()`: Initializes Playwright with Chromium in headless mode
+   - `_close_browser()`: Cleanup browser resources
+   - `_get_screenshot_dir()`: Creates screenshots subdirectory
+   - `_generate_filename()`: Generates filename for screenshot (IP_PORT.png)
+
+3. **configs/** - YAML configuration files
+   - Define scan title, sites, IPs, and expected network behavior
+   - Each IP includes expected ping response and TCP/UDP ports
+
+4. **output/** - JSON scan reports and screenshots
+   - Timestamped JSON files: `scan_report_YYYYMMDD_HHMMSS.json`
+   - Screenshot directory: `scan_report_YYYYMMDD_HHMMSS_screenshots/`
+   - Contains actual vs. expected comparison for each IP
+
+### Scan Workflow
+
+1. Parse YAML config and extract all unique IPs
+2. Run ping scan on all IPs using `masscan --ping`
+3. Run TCP scan on all IPs for ports 0-65535
+4. Run UDP scan on all IPs for ports 0-65535
+5. Run service detection on discovered TCP ports using `nmap -sV`
+6. Run HTTP/HTTPS analysis on web services identified by nmap:
+   - Detect HTTP vs HTTPS using socket connections
+   - Capture webpage screenshot using Playwright (viewport 1280x720, 15s timeout)
+   - For HTTPS: Extract certificate details (subject, issuer, expiry, SANs)
+   - Test TLS version support (TLS 1.0, 1.1, 1.2, 1.3)
+   - List accepted cipher suites for each TLS version
+7. Aggregate results by IP and site
+8. Generate JSON report with timestamp, scan duration, screenshot references, and complete service details
+
+### Why Dockerized
+
+- Masscan and nmap require raw socket access (root/CAP_NET_RAW)
+- Isolates privileged operations in container
+- Ensures consistent masscan and nmap versions and dependencies
+- Uses `--privileged` and `--network host` for network access
+
+### Masscan Integration
+
+- Masscan is built from source in Dockerfile
+- Writes output to temporary JSON files
+- Results parsed line-by-line (masscan uses comma-separated JSON lines)
+- Temporary files cleaned up after each scan
+
+### Nmap Integration
+
+- Nmap installed via apt package in Dockerfile
+- Runs service detection (`-sV`) with intensity level 5 (balanced speed/accuracy)
+- Outputs XML format for structured parsing
+- XML parsed using Python's ElementTree library (xml.etree.ElementTree)
+- Extracts service name, product, version, extrainfo, and ostype
+- Runs sequentially per IP to avoid overwhelming the target
+- 10-minute timeout per host, 5-minute host timeout
+
+### HTTP/HTTPS and SSL/TLS Analysis
+
+- Uses sslyze library for comprehensive SSL/TLS scanning
+- HTTP/HTTPS detection using Python's built-in socket and ssl modules
+- Analyzes services based on:
+  - Nmap service identification (http, https, ssl, http-proxy, etc.)
+  - Common web ports (80, 443, 8000, 8006, 8008, 8080, 8081, 8443, 8888, 9443)
+  - This ensures non-standard ports (like Proxmox 8006) are analyzed even if nmap misidentifies them
+- For HTTPS services:
+  - Extracts certificate information using cryptography library
+  - Tests TLS versions: 1.0, 1.1, 1.2, 1.3
+  - Lists all accepted cipher suites for each supported TLS version
+  - Calculates days until certificate expiration
+  - Extracts SANs (Subject Alternative Names) from certificate
+- Graceful error handling: if SSL analysis fails, still reports HTTP/HTTPS detection
+- 5-second timeout per HTTP/HTTPS detection
+- Results merged into service data structure under `http_info` key
+- **Note**: Uses sslyze 6.0 API which accesses scan results as attributes (e.g., `certificate_info`, `tls_1_2_cipher_suites`) rather than through `.scan_commands_results.get()`
+
+### Webpage Screenshot Capture
+
+**Implementation**: `src/screenshot_capture.py` - Separate module for code organization
+
+**Technology Stack**:
+- Playwright 1.40.0 with Chromium in headless mode
+- System Chromium and chromium-driver installed via apt (Dockerfile)
+- Python's pathlib for cross-platform file path handling
+
+**Screenshot Process**:
+1. Screenshots captured for all successfully detected HTTP/HTTPS services
+2. Services identified by:
+   - Nmap service names: http, https, ssl, http-proxy, http-alt, etc.
+   - Common web ports: 80, 443, 8000, 8006, 8008, 8080, 8081, 8443, 8888, 9443
+3. Browser lifecycle managed via context manager pattern (`__enter__`, `__exit__`)
+
+**Configuration** (default values):
+- **Viewport size**: 1280x720 pixels (viewport only, not full page)
+- **Timeout**: 15 seconds per screenshot (15000ms in Playwright)
+- **Wait strategy**: `wait_until='networkidle'` - waits for network activity to settle
+- **SSL handling**: `ignore_https_errors=True` - handles self-signed certs
+- **User agent**: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
+- **Browser args**: `--no-sandbox`, `--disable-setuid-sandbox`, `--disable-dev-shm-usage`, `--disable-gpu`
+
+**Storage Architecture**:
+- Screenshots saved as PNG files in subdirectory: `scan_report_YYYYMMDD_HHMMSS_screenshots/`
+- Filename format: `{ip}_{port}.png` (dots in IP replaced with underscores)
+  - Example: `192_168_1_10_443.png` for 192.168.1.10:443
+- Path stored in JSON as relative reference: `http_info.screenshot` field
+- Relative paths ensure portability of output directory
+
+**Error Handling** (graceful degradation):
+- If screenshot fails (timeout, connection error, etc.), scan continues
+- Failed screenshots logged as warnings, not errors
+- Services without screenshots simply omit the `screenshot` field in JSON output
+- Browser launch failure disables all screenshots for the scan
+
+**Browser Lifecycle** (optimized for performance):
+1. Browser launched once at scan start (in `scan()` method)
+2. Reused for all screenshots via single browser instance
+3. New context + page created per screenshot (isolated state)
+4. Context and page closed after each screenshot
+5. Browser closed at scan completion (cleanup in `scan()` method)
+
+**Integration Points**:
+- Initialized in `scanner.py:scan()` with scan timestamp
+- Called from `scanner.py:_run_http_analysis()` after protocol detection
+- Cleanup called in `scanner.py:scan()` after all analysis complete
+
+**Code Reference Locations**:
+- `src/screenshot_capture.py`: Complete screenshot module (lines 1-202)
+- `src/scanner.py:scan()`: Browser initialization and cleanup
+- `src/scanner.py:_run_http_analysis()`: Screenshot capture invocation
+
+## Configuration Schema
+
+```yaml
+title: string                    # Report title (required)
+sites:                           # List of sites (required)
+  - name: string                 # Site name
+    ips:                         # List of IPs for this site
+      - address: string          # IP address (IPv4)
+        expected:                # Expected network behavior
+          ping: boolean          # Should respond to ping
+          tcp_ports: [int]       # Expected TCP ports
+          udp_ports: [int]       # Expected UDP ports
+          services: [string]     # Expected services (optional)
+```
+
+## Key Design Decisions
+
+1. **Five-phase scanning**: Masscan for fast port discovery (10,000 pps), nmap for service detection, then HTTP/HTTPS and SSL/TLS analysis for web services
+2. **All-port scanning**: TCP and UDP scans cover entire port range (0-65535) to detect unexpected services
+3. **Selective web analysis**: Only analyze services identified by nmap as web-related to optimize scan time
+4. **Machine-readable output**: JSON format enables automated report generation and comparison
+5. **Expected vs. Actual**: Config includes expected behavior to identify infrastructure drift
+6. **Site grouping**: IPs organized by logical site for better reporting
+7. **Temporary files**: Masscan and nmap output written to temp files to avoid conflicts in parallel scans
+8. **Service details**: Extract product name, version, and additional info for each discovered service
+9. **SSL/TLS security**: Comprehensive certificate analysis and TLS version testing with cipher suite enumeration
+
+## Testing Strategy
+
+When testing changes:
+
+1. Use a controlled test environment with known services (including HTTP/HTTPS)
+2. Create a test config with 1-2 IPs
+3. Verify JSON output structure matches schema
+4. Check that ping, TCP, and UDP results are captured
+5. Verify service detection results include service name, product, and version
+6. For web services, verify http_info includes:
+   - Correct protocol detection (http vs https)
+   - Screenshot path reference (relative to output directory)
+   - Verify screenshot PNG file exists at the referenced path
+   - Certificate details for HTTPS (subject, issuer, expiry, SANs)
+   - TLS version support (1.0-1.3) with cipher suites
+7. Ensure temp files are cleaned up (masscan JSON, nmap XML)
+8. Verify screenshot directory created with correct naming convention
+9. Test screenshot capture with HTTP, HTTPS, and self-signed certificate services
+
+## Common Tasks
+
+### Modifying Scan Parameters
+
+**Masscan rate limiting:**
+- `--rate`: Currently set to 10000 packets/second in src/scanner.py:80, 132
+- `--wait`: Set to 0 (don't wait for late responses)
+- Adjust these in `_run_masscan()` and `_run_ping_scan()` methods
+
+**Nmap service detection intensity:**
+- `--version-intensity`: Currently set to 5 (balanced) in src/scanner.py:201
+- Range: 0-9 (0=light, 9=comprehensive)
+- Lower values are faster but less accurate
+- Adjust in `_run_nmap_service_detection()` method
+
+**Nmap timeouts:**
+- `--host-timeout`: Currently 5 minutes in src/scanner.py:204
+- Overall subprocess timeout: 600 seconds (10 minutes) in src/scanner.py:208
+- Adjust based on network conditions and number of ports
+
+### Adding New Scan Types
+
+To add additional scan functionality (e.g., OS detection, vulnerability scanning):
+1. Add new method to `SneakyScanner` class (follow pattern of `_run_nmap_service_detection()`)
+2. Update `scan()` workflow to call new method
+3. Add results to `actual` section of output JSON
+4. Update YAML schema if expected values needed
+5. Update documentation (README.md, CLAUDE.md)
+
+### Changing Output Format
+
+JSON structure defined in src/scanner.py:365+. To modify:
+1. Update the report dictionary structure
+2. Ensure backward compatibility or version the schema
+3. Update README.md output format documentation
+4. Update example output in both README.md and CLAUDE.md
+
+### Customizing Screenshot Capture
+
+**Change viewport size** (src/screenshot_capture.py:35):
+```python
+self.viewport = viewport or {'width': 1920, 'height': 1080}  # Full HD
+```
+
+**Change timeout** (src/screenshot_capture.py:34):
+```python
+self.timeout = timeout * 1000  # Default is 15 seconds
+# Pass different value when initializing: ScreenshotCapture(..., timeout=30)
+```
+
+**Capture full-page screenshots** (src/screenshot_capture.py:173):
+```python
+page.screenshot(path=str(screenshot_path), type='png', full_page=True)
+```
+
+**Change wait strategy** (src/screenshot_capture.py:170):
+```python
+# Options: 'load', 'domcontentloaded', 'networkidle', 'commit'
+page.goto(url, wait_until='load', timeout=self.timeout)
+```
+
+**Add custom request headers** (src/screenshot_capture.py:157-161):
+```python
+context = self.browser.new_context(
+    viewport=self.viewport,
+    ignore_https_errors=True,
+    user_agent='CustomUserAgent/1.0',
+    extra_http_headers={'Authorization': 'Bearer token'}
+)
+```
+
+**Disable screenshot capture entirely**:
+In src/scanner.py:scan(), comment out or skip initialization:
+```python
+# self.screenshot_capture = ScreenshotCapture(...)
+self.screenshot_capture = None  # This disables all screenshots
+```
+
+**Add authentication** (for services requiring login):
+In src/screenshot_capture.py:capture(), before taking screenshot:
+```python
+# Navigate to login page first
+page.goto(f"{protocol}://{ip}:{port}/login")
+page.fill('#username', 'admin')
+page.fill('#password', 'password')
+page.click('#login-button')
+page.wait_for_url(f"{protocol}://{ip}:{port}/dashboard")
+# Then take screenshot
+page.screenshot(path=str(screenshot_path), type='png')
+```
+
+### Performance Optimization
+
+Current bottlenecks:
+1. **Port scanning**: ~30 seconds for 2 IPs (65535 ports each at 10k pps)
+2. **Service detection**: ~20-60 seconds per IP with open ports
+3. **HTTP/HTTPS analysis**: ~5-10 seconds per web service (includes SSL/TLS analysis)
+4. **Screenshot capture**: ~5-15 seconds per web service (depends on page load time)
+
+Optimization strategies:
+- Parallelize nmap scans across IPs (currently sequential)
+- Parallelize HTTP/HTTPS analysis and screenshot capture across services using ThreadPoolExecutor
+- Reduce port range for faster scanning (if full range not needed)
+- Lower nmap intensity (trade accuracy for speed)
+- Skip service detection on high ports (>1024) if desired
+- Reduce SSL/TLS analysis scope (e.g., test only TLS 1.2+ if legacy support not needed)
+- Adjust HTTP/HTTPS detection timeout (currently 5 seconds in src/scanner.py:510)
+- Adjust screenshot timeout (currently 15 seconds in src/screenshot_capture.py:34)
+- Disable screenshot capture for faster scans (set screenshot_capture to None)
+
+## Planned Features (Future Development)
+
+The following features are planned for future implementation:
+
+### 1. HTML Report Generation
+Build comprehensive HTML reports from JSON scan data with interactive visualizations.
+
+**Report Features:**
+- Service details and SSL/TLS information tables
+- Visual comparison of expected vs. actual results (red/green highlighting)
+- Certificate expiration warnings with countdown timers
+- TLS version compliance reports (highlight weak configurations)
+- Embedded webpage screenshots
+- Sortable/filterable tables
+- Timeline view of scan history
+- Export to PDF capability
+
+**Implementation Considerations:**
+- Template engine: Jinja2 or similar
+- CSS framework: Bootstrap or Tailwind for responsive design
+- Charts/graphs: Chart.js or Plotly for visualizations
+- Store templates in `templates/` directory
+- Generate static HTML that can be opened without server
+
+**Architecture:**
+```python
+class HTMLReportGenerator:
+    def __init__(self, json_report_path, template_dir='templates'):
+        pass
+
+    def generate_report(self, output_path):
+        # Parse JSON
+        # Render template with data
+        # Include screenshots
+        # Write HTML file
+        pass
+
+    def _compare_expected_actual(self, expected, actual):
+        # Generate diff/comparison data
+        pass
+
+    def _generate_cert_warnings(self, services):
+        # Identify expiring certs, weak TLS, etc.
+        pass
+```
+
+### 2. Comparison Reports (Scan Diffs)
+Generate reports showing changes between scans over time.
+
+**Features:**
+- Compare two scan reports
+- Highlight new/removed services
+- Track certificate changes
+- Detect TLS configuration drift
+- Show port changes
+
+### 3. Additional Enhancements
+- **Email Notifications**: Alert on unexpected changes or certificate expirations
+- **Scheduled Scanning**: Automated periodic scans with cron integration
+- **Vulnerability Detection**: Integration with CVE databases for known vulnerabilities
+- **API Mode**: REST API for triggering scans and retrieving results
+- **Multi-threading**: Parallel scanning of multiple IPs for better performance
+
+## Development Notes
+
+### Current Dependencies
+- PyYAML==6.0.1 (YAML parsing)
+- python-libnmap==0.7.3 (nmap XML parsing)
+- sslyze==6.0.0 (SSL/TLS analysis)
+- playwright==1.40.0 (webpage screenshot capture)
+- Built-in: socket, ssl, subprocess, xml.etree.ElementTree, logging
+- System: chromium, chromium-driver (installed via Dockerfile)
+
+### For HTML Reports, Will Need:
+- Jinja2 (template engine)
+- Optional: weasyprint or pdfkit for PDF export
+
+### Key Files to Modify for New Features:
+1. **src/scanner.py** - Core scanning logic (add new phases/methods)
+2. **src/screenshot_capture.py** - ✅ Implemented: Webpage screenshot capture module
+3. **src/report_generator.py** - New file for HTML report generation (planned)
+4. **templates/** - New directory for HTML templates (planned)
+5. **requirements.txt** - Add new dependencies
+6. **Dockerfile** - Install additional system dependencies (browsers, etc.)
+
+### Testing Strategy for New Features:
+
+**Screenshot Capture Testing** (✅ Implemented):
+1. Test with HTTP services (port 80, 8080, etc.)
+2. Test with HTTPS services with valid certificates (port 443, 8443)
+3. Test with HTTPS services with self-signed certificates
+4. Test with non-standard web ports (e.g., Proxmox on 8006)
+5. Test with slow-loading pages (verify 15s timeout works)
+6. Test with services that return errors (404, 500, etc.)
+7. Verify screenshot files are created with correct naming
+8. Verify JSON references point to correct screenshot files
+9. Verify browser cleanup occurs properly (no zombie processes)
+10. Test with multiple IPs and services to ensure browser reuse works
+
+**HTML Report Testing** (Planned):
+1. Validate HTML report rendering across browsers
+2. Ensure large scans don't cause memory issues with screenshots
+3. Test report generation with missing/incomplete data
+4. Verify all URLs and links work in generated reports
+5. Test embedded screenshots display correctly
+
+## Troubleshooting
+
+### Screenshot Capture Issues
+
+**Problem**: Screenshots not being captured
+- **Check**: Verify Chromium installed: `chromium --version` in container
+- **Check**: Verify Playwright browsers installed: `playwright install --dry-run chromium`
+- **Check**: Look for browser launch errors in stderr output
+- **Solution**: Rebuild Docker image ensuring Dockerfile steps complete
+
+**Problem**: "Failed to launch browser" error
+- **Check**: Ensure container has sufficient memory (Chromium needs ~200MB)
+- **Check**: Docker runs with `--privileged` or appropriate capabilities
+- **Solution**: Add `--shm-size=2gb` to docker run command if `/dev/shm` is too small
+
+**Problem**: Screenshots timing out
+- **Check**: Network connectivity to target services
+- **Check**: Services actually serve webpages (not just open ports)
+- **Solution**: Increase timeout in `src/screenshot_capture.py:34` if needed
+- **Solution**: Check service responds to HTTP requests: `curl -I http://IP:PORT`
+
+**Problem**: Screenshots are blank/empty
+- **Check**: Service returns valid HTML (not just TCP banner)
+- **Check**: Page requires JavaScript (may need longer wait time)
+- **Solution**: Change `wait_until` strategy from `'networkidle'` to `'load'` or `'domcontentloaded'`
+
+**Problem**: HTTPS certificate errors despite `ignore_https_errors=True`
+- **Check**: System certificates up to date in container
+- **Solution**: This should not happen; file an issue if it does
+
+### Nmap/Masscan Issues
+
+**Problem**: No ports discovered
+- **Check**: Firewall rules allow scanning
+- **Check**: Targets are actually online (`ping` test)
+- **Solution**: Run manual masscan: `masscan -p80,443 192.168.1.10 --rate 1000`
+
+**Problem**: "Operation not permitted" error
+- **Check**: Container runs with `--privileged` or `CAP_NET_RAW`
+- **Solution**: Add `--privileged` flag to docker run command
+
+**Problem**: Service detection not working
+- **Check**: Nmap can connect to ports: `nmap -p 80 192.168.1.10`
+- **Check**: Services actually respond to nmap probes (some firewall/IPS block)
+- **Solution**: Adjust nmap intensity or timeout values