A site mirroring tool in the vein of httrack.
  • Go 99.9%
  • Makefile 0.1%
Find a file
Mike 'Fuzzy' Partin f6c892af58
Some checks failed
Test and Release / test (push) Has been cancelled
Test and Release / lint (push) Has been cancelled
Test and Release / release (amd64, linux) (push) Has been cancelled
Test and Release / release (amd64, windows) (push) Has been cancelled
Test and Release / release (arm64, darwin) (push) Has been cancelled
Test and Release / release (arm64, linux) (push) Has been cancelled
feat: implement void element handling in HTML rewriter with pretty print support
2026-04-19 07:12:28 -07:00
.forgejo/workflows feat: add CI workflows and golangci-lint configuration for automated testing and linting 2026-04-18 09:44:49 -07:00
cmd/hoover docs: update README with comprehensive usage docs and features 2026-04-18 07:32:38 -07:00
internal feat: implement void element handling in HTML rewriter with pretty print support 2026-04-19 07:12:28 -07:00
.gitignore feat: add CI workflows and golangci-lint configuration for automated testing and linting 2026-04-18 09:44:49 -07:00
.golangci.yml feat: add CI workflows and golangci-lint configuration for automated testing and linting 2026-04-18 09:44:49 -07:00
.pre-commit-config.yaml feat: add CI workflows and golangci-lint configuration for automated testing and linting 2026-04-18 09:44:49 -07:00
go.mod feat: complete Phase 2 core fetching engine 2026-04-17 18:51:13 -07:00
go.sum feat: complete Phase 2 core fetching engine 2026-04-17 18:51:13 -07:00
Makefile feat: complete Phase 2 core fetching engine 2026-04-17 18:51:13 -07:00
README.md docs: add code quality and CI/CD documentation 2026-04-18 09:46:05 -07:00
ROADMAP.md docs: update README with comprehensive usage docs and features 2026-04-18 07:32:38 -07:00

hoover

A tool for archiving web pages locally, preserving directory structure and rewriting URLs for offline viewing.

Overview

Hoover is a web archiving tool that downloads web pages and their associated resources (CSS, JavaScript, images, fonts, etc.) while converting absolute URLs to relative paths. This allows archived content to be viewed offline without network dependencies.

Installation

From source

git clone https://git.lan.thwap.org/thwap/hoover.git
cd hoover
make install

Building manually

go build -o hoover ./cmd/hoover

Usage

Basic command structure:

hoover -url <URL> -output <DIRECTORY> [options]

Required flags

  • -url, -u: URL to archive (e.g., https://example.com)
  • -output, -o: Output directory where archived files will be saved

Optional flags

  • -depth, -d: Maximum recursion depth (default: 1)
  • -delay, -D: Delay between requests in milliseconds (default: 100)
  • -verbose, -v: Enable verbose logging
  • -no-robots, -R: Ignore robots.txt restrictions (default: respect robots.txt)
  • -force, -f: Overwrite existing files (default: skip)
  • -pretty, -p: Pretty-print HTML output (default: preserve original formatting)

Examples

Basic archiving

Archive a single page with its immediate assets:

hoover -url https://docs.example.com -output ./docs

Recursive archiving with depth limit

Archive pages up to 2 levels deep:

hoover -url https://docs.example.com -output ./docs -depth 2

Archiving with rate limiting

Add a 200ms delay between requests to be respectful to servers:

hoover -url https://example.com -output ./archive -delay 200

Force overwrite existing files

hoover -url https://example.com -output ./archive -force

Pretty-print HTML output

hoover -url https://example.com -output ./archive -pretty

Output Structure

Hoover preserves the original URL path structure in the output directory. Here's an example of what the output might look like:

archive/
├── index.html
├── about.html
├── contact.html
├── css/
│   └── style.css
├── js/
│   └── app.js
├── images/
│   ├── logo.png
│   └── banner.jpg
└── cdn.example.com/
    └── library.js

Key features of the output structure:

  • HTML pages are saved with their original paths
  • Directories are created automatically as needed
  • External domain assets are placed in subdirectories named after the host (with ports converted to underscores)
  • index.html is created for directory URLs
  • All URLs in HTML, CSS, and JavaScript files are rewritten to be relative

Features

  • Recursive downloading: Follow links within the same domain up to a configurable depth
  • Asset discovery: Automatically finds and downloads CSS, JavaScript, images, fonts, and media files
  • URL rewriting: Converts absolute URLs to relative paths for offline viewing
  • Directory preservation: Maintains the original site structure in the local filesystem
  • Rate limiting: Configurable delay between requests to be server-friendly
  • Robots.txt support: Respects robots.txt restrictions by default
  • Safe file writing: Sanitizes filenames, limits path lengths, and avoids overwriting existing files (unless -force is used)
  • Parallel downloading: Uses a worker pool for concurrent asset downloads
  • Summary reporting: Provides statistics on files fetched, skipped, failed, and total bytes written

Limitations

Hoover has several important limitations to be aware of:

No JavaScript execution

  • Hoover performs static analysis only and does not execute JavaScript
  • Dynamically loaded content (via XHR/fetch, client-side rendering, etc.) will not be captured unless explicitly linked in the initial HTML
  • URLs constructed dynamically in JavaScript will not be discovered

No login/session handling

  • Hoover does not handle authentication or maintain sessions beyond basic cookies
  • Password-protected areas cannot be archived
  • Sites requiring login will not be accessible

Limited dynamic content detection

  • Content loaded via AJAX after page load is not captured
  • Infinite scroll implementations will only capture initial content
  • Single Page Applications (SPAs) may not archive correctly

Static analysis of JavaScript

  • JavaScript files are analyzed for string patterns containing resource URLs
  • Complex JavaScript that constructs URLs dynamically may not be parsed correctly

Certificate validation

  • Hoover validates SSL certificates; self-signed certificates may cause errors
  • Use with caution on sites with invalid certificates

No interactive content

  • Forms, buttons, and other interactive elements will not function in the archived version
  • Search functionality and other server-side interactions will not work offline

Development

Running tests

make test

Building from source

make build

Project structure

hoover/
├── cmd/hoover/          # Main CLI entry point
├── internal/
│   ├── archiver/        # Core archiving orchestration
│   ├── fetcher/         # HTTP client and fetching logic
│   ├── parser/          # HTML/CSS/JS parsing
│   ├── rewriter/        # URL rewriting logic
│   ├── urltraversal/    # URL tracking and traversal
│   ├── workerpool/      # Concurrent job processing
│   ├── writer/          # File system operations
│   └── robots/          # robots.txt handling
└── README.md

Code Quality and CI/CD

The project uses pre-commit hooks and automated CI/CD with Forgejo Actions.

Pre-commit Hooks

Install pre-commit and run:

pre-commit install

This will set up hooks for:

  • Go formatting (gofmt)
  • Go module tidiness (go mod tidy)
  • Go vetting (go vet)
  • Optional golangci-lint (if installed)
  • Markdown linting
  • YAML validation
  • Spell checking

Continuous Integration

CI runs on every push and pull request via Forgejo Actions:

  • Unit tests with race detection and coverage
  • go vet and gofmt checks
  • golangci-lint analysis
  • Multi-platform binary builds on version tags

Release Automation

When a tag starting with v is pushed (e.g., v1.0.0), the workflow:

  1. Runs all tests and linting
  2. Builds binaries for Linux (amd64, arm64), macOS (arm64), and Windows (amd64)
  3. Packages artifacts as .tar.gz files
  4. Creates a release with the artifacts

License

[To be determined - MIT or GPL]