Blade34242 c50c56baab First with All

2026-05-11 10:03:09 +07:00

12 KiB

Raw Permalink Blame History

Recon Pipeline — Change Documentation

Date: 2026-05-11
Jobs affected: recon-httpx, recon-nuclei
Author: thewestboi

Overview

This document explains all changes made to the recon pipeline. The core goal was to fix three bugs found in production runs and to decouple nuclei from httpx so neither job can block the other.

Problems Found in Production

Bug 1 — Content-Length stored as page title

httpx outputs lines in this format:

http://ford.com [409] [] [16] [41ms] [Cloudflare]
                           ^^
                           this is Content-Length, NOT a title

The parse_state() parser in the diff stage failed to distinguish between a numeric Content-Length value and a real page title. On the first run it stored 16 as the title. On the next run through the same chunk, httpx returned an empty title — so the diff engine saw title: 16 → "" and marked the host as CHANGED.

Impact: 171 out of 173 hosts were marked as CHANGED on every repeated chunk run, sending ~159 false-positive targets to nuclei.

Bug 2 — State never self-healed for unchanged hosts

The old state update logic only rewrote entries for NEW + CHANGED + REMOVED hosts. Hosts classified as KNOWN (unchanged) kept their old — potentially broken — state entry forever.

# Old logic:
remove (NEW + CHANGED + REMOVED) from old state
append fresh entries for NEW + CHANGED only
→ KNOWN hosts: old entry stays, bug persists indefinitely

This meant a host with a corrupted state entry (e.g. CL stored as title) would trigger a false CHANGED on every single chunk repeat until it genuinely changed.

Bug 3 — nuclei blocked httpx / timeout on first run

nuclei was triggered directly by httpx via a post { success { build job: 'recon-nuclei' } } block. On the first full scan cycle (~57 runs covering all 16,844 hosts), nuclei received up to 159 targets per run and hit its 4-hour timeout — blocking the next httpx chunk from starting.

Architecture: Before vs After

Before

subfinder (daily)
    └── writes all-resolved-latest.txt
            │
            ▼
httpx (every 30 min)
    ├── probes chunk of 300 hosts
    ├── diffs against cumulative state
    └── on success: triggers nuclei directly
                        │
                        ▼
                    nuclei (blocking)
                    ├── up to 159 targets
                    ├── runs 3h+
                    └── httpx waits → next chunk delayed

After

subfinder (daily)
    └── writes all-resolved-latest.txt
            │
            ▼
httpx (every 30 min)                     nuclei (every 1 hour)
    ├── probes chunk of 300 hosts            ├── reads nuclei-queue.txt
    ├── diffs against cumulative state       ├── takes max 50 hosts
    ├── applies blacklist filter             ├── scans them
    ├── appends NEWs+CHANGEDs to queue       ├── removes them from queue
    └── done (never waits for nuclei)        └── sends Matrix notification
                                                 if findings
            │                                        ▲
            └──── nuclei-queue.txt ───────────────────┘

The two jobs share only one file: nuclei-queue.txt. httpx writes to it, nuclei reads from it. Neither job triggers or waits for the other.

Changes: recon-httpx

Stage 4 — httpx diff (State Fix)

What changed: The state update logic now removes all hosts in the current chunk from the old state before writing fresh entries, instead of only removing NEW + CHANGED + REMOVED hosts.

Old logic:

URLS_TO_REMOVE = NEW_URLS + CHANGED_URLS + REM_TXT
grep -vFf URLS_TO_REMOVE OLD_STATE > kept.txt
cat kept.txt + NEW_STATE + CHANGED_STATE > OLD_STATE
# KNOWN hosts: old entry stays forever → bug persists

New logic:

# Build list of ALL URLs seen in this chunk (http + https variants)
while read host; do
    echo "https://${host}" >> CHUNK_URLS
    echo "http://${host}"  >> CHUNK_URLS
done < chunk.txt

# Remove entire chunk from state, rewrite with fresh live-state
grep -vFf CHUNK_URLS OLD_STATE > kept.txt
cat kept.txt + LIVE_STATE | sort -u > OLD_STATE

Effect: Every host is rewritten with its current httpx output on each chunk pass. Corrupted entries (CL-as-title, stale redirects) are automatically corrected the next time that chunk runs. No manual state deletion needed.

Stage 6 — Queue: feed nuclei (new stage)

This is an entirely new stage that replaces the old post { success { build 'recon-nuclei' } } trigger.

What it does:

Collects httpx-new-urls.txt and httpx-changed-urls.txt as candidates
Applies the NUCLEI_BLACKLIST parameter (wildcard and exact matching)
Appends filtered candidates to nuclei-queue.txt using sort -u to prevent duplicates
Logs how many entries were added and the new queue total

Blacklist parameter (NUCLEI_BLACKLIST) — text area, one rule per line:

*.hubspot.com
*.twilio.com
notifybf1.hubspot.com

Wildcard rules (*.example.com) match any subdomain of that domain. Exact rules match only that specific host. Matching is case-insensitive. Uses pure bash case pattern matching — no regex.

Duplicate prevention:

{ cat nuclei-queue.txt; cat new-candidates.txt; } | sort -u > nuclei-queue.txt

A host already in the queue will not be added again. A host that was already processed (removed from queue) and later genuinely changes will be re-added correctly.

New parameter added to httpx:

Parameter	Type	Default	Description
`NUCLEI_BLACKLIST`	text	`.hubspot.com` / `.twilio.com`	Domains to exclude from nuclei queue. One per line. Wildcards supported.

Changes: recon-nuclei

The nuclei job has been completely rewritten as a standalone queue-based scanner.

Schedule

Runs every hour via cron: 0 * * * *
Timeout: 2 hours (was 4 hours — with max 50 targets this is always sufficient)

Stage 3 — Take chunk from queue (new)

Takes the first QUEUE_CHUNK_SIZE (default: 50) entries from nuclei-queue.txt.

nuclei-queue.txt (before):          nuclei-queue.txt (after):
https://admin.ford.com              https://jenkins.hubspot.com    ← next run starts here
https://grafana.twilio.com          https://portal.deere.com
https://jenkins.hubspot.com         ...
https://portal.deere.com
...

Entries are removed from the queue immediately before scanning starts. This ensures that if the job is killed or times out mid-scan, those hosts are not scanned again on the next run. They are considered processed.

The blacklist is also applied here as a second defensive pass — any blacklisted entries found in the queue are silently removed without scanning.

Queue persistence

The queue file lives at:

/var/jenkins_home/recon-state/nuclei/nuclei-queue.txt

It persists across Jenkins restarts and job runs. If nuclei does not finish the full queue in one day, it continues exactly where it left off on the next run. There is no reset, no re-queue, no loss of work.

Filters the current batch of targets using LOGIN_PATTERNS to identify hosts worth checking for default credentials (admin panels, grafana, jenkins, etc.). Only these go through the http/default-logins/ template scan.

Stage 5 — nuclei scan

Two scans per run:

Scan	Targets	Templates
Scan 1	All targets (max 50)	`http/exposures/` + `http/misconfiguration/`
Scan 2	Login-filtered targets only	`http/default-logins/`

Fixed flags (was broken in previous version):

# Old (broken — -jsonl is not a valid flag, -output is wrong):
-o findings.txt -jsonl -output findings.jsonl

# New (correct):
-o findings.txt -je findings.jsonl

Diff and cumulative state

Same logic as before — new findings are diffed against nuclei-findings-cumulative.txt. Only genuinely new findings trigger a Matrix notification.

The diff report now includes Queue remaining so you can see how much work is left:

========================================
  recon-nuclei — Diff Report
========================================
Job:       recon-nuclei  #71
Timestamp: 2026-05-11T10:00:00Z
Queue remaining: 7650
Total findings this run: 3
...

New and changed parameters

Parameter	Type	Default	Description
`QUEUE_CHUNK_SIZE`	string	`50`	Max hosts taken from queue per run
`NUCLEI_BLACKLIST`	text	`.hubspot.com` / `.twilio.com`	Domains to skip. One per line. Wildcards supported.
`NUCLEI_CONCURRENCY`	string	`10`	Parallel template execution
`NUCLEI_RATE_LIMIT`	string	`50`	Max HTTP requests/sec
`NUCLEI_SEVERITY`	string	`low,medium,high,critical`	Severity filter
`INCLUDE_INFO`	boolean	`false`	Include info-severity findings
`LOGIN_PATTERNS`	string	`login\|admin\|...`	Patterns for default-login scan

Removed parameter: SCAN_NEW_ONLY — no longer needed since nuclei only ever sees what httpx puts in the queue.

Queue Lifecycle — Full Timeline

Day 1: First full scan cycle

httpx runs every 30 minutes, processing chunks of 300 hosts from a total of ~16,844. It takes ~57 runs (~28 hours) to cover all hosts once.

httpx Run 1  (t=0h):    +173 to queue  → queue: 173
httpx Run 2  (t=0.5h):  +160 to queue  → queue: 333
...
httpx Run 57 (t=28h):   +140 to queue  → queue: ~8,000

nuclei Run 1  (t=1h):   -50 from queue → queue: ~7,950
nuclei Run 2  (t=2h):   -50 from queue → queue: ~7,900
...
nuclei Run 160 (t=13d): queue empty

During this period nuclei works through the backlog at 50 hosts/hour. Critical hosts (admin panels, API gateways) appear in the queue early and are scanned first since httpx processes chunk 0 first.

Day 2+: Steady state

httpx now only adds genuine NEWs and CHANGEDs per chunk — typically 0–5 per run. The queue stays small. nuclei processes the queue faster than httpx fills it.

Typical daily queue additions (steady state): ~20–50 hosts
nuclei capacity per day: 50 hosts/run × 24 runs = 1,200 hosts/day
→ queue is always empty within hours

What happens if nuclei misses a run

Nothing special. The queue file is untouched. The next run picks up exactly where the last one left off. No data is lost, no hosts are skipped, no duplicates are created.

File Structure

/var/jenkins_home/recon-state/
├── subfinder/
│   ├── all-resolved-latest.txt        ← httpx reads this
│   └── all-subdomains-latest.txt
├── httpx/
│   ├── httpx-state-cumulative.txt     ← diff baseline (auto-healed each chunk)
│   ├── chunk-pointer.txt              ← current position in resolved list
│   ├── daily-digest.txt               ← all NEWs+CHANGEDs today
│   └── history/
│       └── build-NNN.txt
└── nuclei/
    ├── nuclei-queue.txt               ← shared queue between httpx and nuclei
    ├── nuclei-findings-cumulative.txt ← all findings ever seen
    ├── metadata.txt                   ← last run stats incl. queue remaining
    └── history/
        └── build-NNN.txt

Groovy Parser Compatibility Notes

Jenkins' Groovy parser scans the entire Jenkinsfile — including the content of sh '''...''' blocks — before execution. Any backslash sequence it does not recognise causes a compile error at startup, before the pipeline runs.

The following patterns cause errors and must be avoided inside sh '''...''':

Pattern	Error	Replacement
`sed 's/\]$//'`	`unexpected char: '\'`	`sed 's/]$//'` (no backslash needed)
`sed 's\|https\?://\|\|'`	`unexpected char: '\'`	bash parameter expansion
`grep -q '^\*\.'`	`unexpected char: '\'`	`case "$var" in .) ...`
`grep -qiE "(^	.)domain"`	`unexpected char: '\'`

All blacklist matching in both jobs now uses pure bash case pattern matching with no regex backslashes.

12 KiB Raw Permalink Blame History Unescape Escape