12 KiB
Recon Pipeline — Change Documentation
Date: 2026-05-11
Jobs affected: recon-httpx, recon-nuclei
Author: thewestboi
Overview
This document explains all changes made to the recon pipeline. The core goal was to fix three bugs found in production runs and to decouple nuclei from httpx so neither job can block the other.
Problems Found in Production
Bug 1 — Content-Length stored as page title
httpx outputs lines in this format:
http://ford.com [409] [] [16] [41ms] [Cloudflare]
^^
this is Content-Length, NOT a title
The parse_state() parser in the diff stage failed to distinguish between a numeric Content-Length value and a real page title. On the first run it stored 16 as the title. On the next run through the same chunk, httpx returned an empty title — so the diff engine saw title: 16 → "" and marked the host as CHANGED.
Impact: 171 out of 173 hosts were marked as CHANGED on every repeated chunk run, sending ~159 false-positive targets to nuclei.
Bug 2 — State never self-healed for unchanged hosts
The old state update logic only rewrote entries for NEW + CHANGED + REMOVED hosts. Hosts classified as KNOWN (unchanged) kept their old — potentially broken — state entry forever.
# Old logic:
remove (NEW + CHANGED + REMOVED) from old state
append fresh entries for NEW + CHANGED only
→ KNOWN hosts: old entry stays, bug persists indefinitely
This meant a host with a corrupted state entry (e.g. CL stored as title) would trigger a false CHANGED on every single chunk repeat until it genuinely changed.
Bug 3 — nuclei blocked httpx / timeout on first run
nuclei was triggered directly by httpx via a post { success { build job: 'recon-nuclei' } } block. On the first full scan cycle (~57 runs covering all 16,844 hosts), nuclei received up to 159 targets per run and hit its 4-hour timeout — blocking the next httpx chunk from starting.
Architecture: Before vs After
Before
subfinder (daily)
└── writes all-resolved-latest.txt
│
▼
httpx (every 30 min)
├── probes chunk of 300 hosts
├── diffs against cumulative state
└── on success: triggers nuclei directly
│
▼
nuclei (blocking)
├── up to 159 targets
├── runs 3h+
└── httpx waits → next chunk delayed
After
subfinder (daily)
└── writes all-resolved-latest.txt
│
▼
httpx (every 30 min) nuclei (every 1 hour)
├── probes chunk of 300 hosts ├── reads nuclei-queue.txt
├── diffs against cumulative state ├── takes max 50 hosts
├── applies blacklist filter ├── scans them
├── appends NEWs+CHANGEDs to queue ├── removes them from queue
└── done (never waits for nuclei) └── sends Matrix notification
if findings
│ ▲
└──── nuclei-queue.txt ───────────────────┘
The two jobs share only one file: nuclei-queue.txt. httpx writes to it, nuclei reads from it. Neither job triggers or waits for the other.
Changes: recon-httpx
Stage 4 — httpx diff (State Fix)
What changed: The state update logic now removes all hosts in the current chunk from the old state before writing fresh entries, instead of only removing NEW + CHANGED + REMOVED hosts.
Old logic:
URLS_TO_REMOVE = NEW_URLS + CHANGED_URLS + REM_TXT
grep -vFf URLS_TO_REMOVE OLD_STATE > kept.txt
cat kept.txt + NEW_STATE + CHANGED_STATE > OLD_STATE
# KNOWN hosts: old entry stays forever → bug persists
New logic:
# Build list of ALL URLs seen in this chunk (http + https variants)
while read host; do
echo "https://${host}" >> CHUNK_URLS
echo "http://${host}" >> CHUNK_URLS
done < chunk.txt
# Remove entire chunk from state, rewrite with fresh live-state
grep -vFf CHUNK_URLS OLD_STATE > kept.txt
cat kept.txt + LIVE_STATE | sort -u > OLD_STATE
Effect: Every host is rewritten with its current httpx output on each chunk pass. Corrupted entries (CL-as-title, stale redirects) are automatically corrected the next time that chunk runs. No manual state deletion needed.
Stage 6 — Queue: feed nuclei (new stage)
This is an entirely new stage that replaces the old post { success { build 'recon-nuclei' } } trigger.
What it does:
- Collects
httpx-new-urls.txtandhttpx-changed-urls.txtas candidates - Applies the
NUCLEI_BLACKLISTparameter (wildcard and exact matching) - Appends filtered candidates to
nuclei-queue.txtusingsort -uto prevent duplicates - Logs how many entries were added and the new queue total
Blacklist parameter (NUCLEI_BLACKLIST) — text area, one rule per line:
*.hubspot.com
*.twilio.com
notifybf1.hubspot.com
Wildcard rules (*.example.com) match any subdomain of that domain. Exact rules match only that specific host. Matching is case-insensitive. Uses pure bash case pattern matching — no regex.
Duplicate prevention:
{ cat nuclei-queue.txt; cat new-candidates.txt; } | sort -u > nuclei-queue.txt
A host already in the queue will not be added again. A host that was already processed (removed from queue) and later genuinely changes will be re-added correctly.
New parameter added to httpx:
| Parameter | Type | Default | Description |
|---|---|---|---|
NUCLEI_BLACKLIST |
text | *.hubspot.com / *.twilio.com |
Domains to exclude from nuclei queue. One per line. Wildcards supported. |
Changes: recon-nuclei
The nuclei job has been completely rewritten as a standalone queue-based scanner.
Schedule
Runs every hour via cron: 0 * * * *
Timeout: 2 hours (was 4 hours — with max 50 targets this is always sufficient)
Stage 3 — Take chunk from queue (new)
Takes the first QUEUE_CHUNK_SIZE (default: 50) entries from nuclei-queue.txt.
nuclei-queue.txt (before): nuclei-queue.txt (after):
https://admin.ford.com https://jenkins.hubspot.com ← next run starts here
https://grafana.twilio.com https://portal.deere.com
https://jenkins.hubspot.com ...
https://portal.deere.com
...
Entries are removed from the queue immediately before scanning starts. This ensures that if the job is killed or times out mid-scan, those hosts are not scanned again on the next run. They are considered processed.
The blacklist is also applied here as a second defensive pass — any blacklisted entries found in the queue are silently removed without scanning.
Queue persistence
The queue file lives at:
/var/jenkins_home/recon-state/nuclei/nuclei-queue.txt
It persists across Jenkins restarts and job runs. If nuclei does not finish the full queue in one day, it continues exactly where it left off on the next run. There is no reset, no re-queue, no loss of work.
Stage 4 — Build login targets
Filters the current batch of targets using LOGIN_PATTERNS to identify hosts worth checking for default credentials (admin panels, grafana, jenkins, etc.). Only these go through the http/default-logins/ template scan.
Stage 5 — nuclei scan
Two scans per run:
| Scan | Targets | Templates |
|---|---|---|
| Scan 1 | All targets (max 50) | http/exposures/ + http/misconfiguration/ |
| Scan 2 | Login-filtered targets only | http/default-logins/ |
Fixed flags (was broken in previous version):
# Old (broken — -jsonl is not a valid flag, -output is wrong):
-o findings.txt -jsonl -output findings.jsonl
# New (correct):
-o findings.txt -je findings.jsonl
Diff and cumulative state
Same logic as before — new findings are diffed against nuclei-findings-cumulative.txt. Only genuinely new findings trigger a Matrix notification.
The diff report now includes Queue remaining so you can see how much work is left:
========================================
recon-nuclei — Diff Report
========================================
Job: recon-nuclei #71
Timestamp: 2026-05-11T10:00:00Z
Queue remaining: 7650
Total findings this run: 3
...
New and changed parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
QUEUE_CHUNK_SIZE |
string | 50 |
Max hosts taken from queue per run |
NUCLEI_BLACKLIST |
text | *.hubspot.com / *.twilio.com |
Domains to skip. One per line. Wildcards supported. |
NUCLEI_CONCURRENCY |
string | 10 |
Parallel template execution |
NUCLEI_RATE_LIMIT |
string | 50 |
Max HTTP requests/sec |
NUCLEI_SEVERITY |
string | low,medium,high,critical |
Severity filter |
INCLUDE_INFO |
boolean | false |
Include info-severity findings |
LOGIN_PATTERNS |
string | login|admin|... |
Patterns for default-login scan |
Removed parameter: SCAN_NEW_ONLY — no longer needed since nuclei only ever sees what httpx puts in the queue.
Queue Lifecycle — Full Timeline
Day 1: First full scan cycle
httpx runs every 30 minutes, processing chunks of 300 hosts from a total of ~16,844. It takes ~57 runs (~28 hours) to cover all hosts once.
httpx Run 1 (t=0h): +173 to queue → queue: 173
httpx Run 2 (t=0.5h): +160 to queue → queue: 333
...
httpx Run 57 (t=28h): +140 to queue → queue: ~8,000
nuclei Run 1 (t=1h): -50 from queue → queue: ~7,950
nuclei Run 2 (t=2h): -50 from queue → queue: ~7,900
...
nuclei Run 160 (t=13d): queue empty
During this period nuclei works through the backlog at 50 hosts/hour. Critical hosts (admin panels, API gateways) appear in the queue early and are scanned first since httpx processes chunk 0 first.
Day 2+: Steady state
httpx now only adds genuine NEWs and CHANGEDs per chunk — typically 0–5 per run. The queue stays small. nuclei processes the queue faster than httpx fills it.
Typical daily queue additions (steady state): ~20–50 hosts
nuclei capacity per day: 50 hosts/run × 24 runs = 1,200 hosts/day
→ queue is always empty within hours
What happens if nuclei misses a run
Nothing special. The queue file is untouched. The next run picks up exactly where the last one left off. No data is lost, no hosts are skipped, no duplicates are created.
File Structure
/var/jenkins_home/recon-state/
├── subfinder/
│ ├── all-resolved-latest.txt ← httpx reads this
│ └── all-subdomains-latest.txt
├── httpx/
│ ├── httpx-state-cumulative.txt ← diff baseline (auto-healed each chunk)
│ ├── chunk-pointer.txt ← current position in resolved list
│ ├── daily-digest.txt ← all NEWs+CHANGEDs today
│ └── history/
│ └── build-NNN.txt
└── nuclei/
├── nuclei-queue.txt ← shared queue between httpx and nuclei
├── nuclei-findings-cumulative.txt ← all findings ever seen
├── metadata.txt ← last run stats incl. queue remaining
└── history/
└── build-NNN.txt
Groovy Parser Compatibility Notes
Jenkins' Groovy parser scans the entire Jenkinsfile — including the content of sh '''...''' blocks — before execution. Any backslash sequence it does not recognise causes a compile error at startup, before the pipeline runs.
The following patterns cause errors and must be avoided inside sh '''...''':
| Pattern | Error | Replacement |
|---|---|---|
sed 's/\]$//' |
unexpected char: '\' |
sed 's/]$//' (no backslash needed) |
sed 's|https\?://||' |
unexpected char: '\' |
bash parameter expansion |
grep -q '^\*\.' |
unexpected char: '\' |
case "$var" in *.*) ... |
| `grep -qiE "(^ | .)domain"` | unexpected char: '\' |
All blacklist matching in both jobs now uses pure bash case pattern matching with no regex backslashes.