recon-pipeline/docs/architecture.md
2026-05-11 10:03:09 +07:00

3.2 KiB

Architecture

Pipeline Overview

subfinder (every 2 days, 23:00)
    │
    └── writes: all-resolved-latest.txt
                    │
                    ▼
httpx (every 30 min)                    nuclei (every 1 hour)
    ├── reads chunk from resolved list      ├── reads nuclei-queue.txt
    ├── probes 300 hosts per run            ├── takes max 50 hosts
    ├── diffs against cumulative state      ├── scans: exposures + misconfig
    ├── self-heals state on each chunk      ├── scans: default-logins (filtered)
    ├── applies blacklist filter            ├── diffs against known findings
    └── appends NEWs+CHANGEDs to queue     └── Matrix notification if new
                    │                                    ▲
                    └──── nuclei-queue.txt ──────────────┘

Key Design Decisions

httpx as infinite queue — chunk pointer never resets externally. subfinder writes a new resolved list, httpx picks up where it left off. New subdomains at the end of the file get scanned on the next natural cycle.

State self-healing — on every chunk run, all hosts in that chunk are removed from the cumulative state and rewritten with fresh httpx output. Corrupted entries (e.g. Content-Length stored as title) heal automatically on the next pass through that chunk.

nuclei decoupled from httpx — nuclei no longer triggered by httpx. httpx writes to a queue file, nuclei reads it independently every hour. Max 50 targets per nuclei run = no more timeouts.

Duplicate preventionsort -u on queue additions ensures no host appears in the queue more than once at any time.

File Layout

/var/jenkins_home/recon-state/
├── subfinder/
│   ├── all-resolved-latest.txt     ← httpx input
│   ├── all-subdomains-latest.txt
│   └── history/
├── httpx/
│   ├── httpx-state-cumulative.txt  ← diff baseline (self-healing)
│   ├── chunk-pointer.txt           ← current position (never externally reset)
│   ├── daily-digest.txt            ← all NEWs+CHANGEDs today
│   ├── metadata.txt                ← last run stats
│   └── history/
└── nuclei/
    ├── nuclei-queue.txt            ← shared queue (httpx writes, nuclei reads)
    ├── nuclei-findings-cumulative.txt
    ├── metadata.txt                ← includes queue_remaining
    └── history/

Schedules

Job Schedule Duration Notes
subfinder every 2 days, 23:00 ~10 min Does NOT reset httpx pointer
httpx every 30 min ~20 min 300 hosts/chunk, ~57 runs/cycle
nuclei every 1 hour ~10 min max 50 hosts/run from queue

Timelines

First full cycle (~28h):

  • httpx scans all 16,844 hosts in ~57 runs
  • Queue fills up as NEWs are discovered
  • nuclei works through queue in background (~13 days for full backlog)
  • Critical hosts (login panels, APIs) get scanned first

Steady state (after 2 cycles):

  • httpx produces 0-5 real CHANGEDs per chunk
  • Queue stays small, nuclei empties it within hours
  • State fully healed, no more false CHANGEDs from CL-bug