A site mirroring tool in the vein of httrack.

Go 99.9%
Makefile 0.1%

Find a file

Mike 'Fuzzy' Partin f6c892af58 Some checks failed Test and Release / test (push) Has been cancelled Details Test and Release / lint (push) Has been cancelled Details Test and Release / release (amd64, linux) (push) Has been cancelled Details Test and Release / release (amd64, windows) (push) Has been cancelled Details Test and Release / release (arm64, darwin) (push) Has been cancelled Details Test and Release / release (arm64, linux) (push) Has been cancelled Details feat: implement void element handling in HTML rewriter with pretty print support		2026-04-19 07:12:28 -07:00
.forgejo/workflows	feat: add CI workflows and golangci-lint configuration for automated testing and linting	2026-04-18 09:44:49 -07:00
cmd/hoover	docs: update README with comprehensive usage docs and features	2026-04-18 07:32:38 -07:00
internal	feat: implement void element handling in HTML rewriter with pretty print support	2026-04-19 07:12:28 -07:00
.gitignore	feat: add CI workflows and golangci-lint configuration for automated testing and linting	2026-04-18 09:44:49 -07:00
.golangci.yml	feat: add CI workflows and golangci-lint configuration for automated testing and linting	2026-04-18 09:44:49 -07:00
.pre-commit-config.yaml	feat: add CI workflows and golangci-lint configuration for automated testing and linting	2026-04-18 09:44:49 -07:00
go.mod	feat: complete Phase 2 core fetching engine	2026-04-17 18:51:13 -07:00
go.sum	feat: complete Phase 2 core fetching engine	2026-04-17 18:51:13 -07:00
Makefile	feat: complete Phase 2 core fetching engine	2026-04-17 18:51:13 -07:00
README.md	docs: add code quality and CI/CD documentation	2026-04-18 09:46:05 -07:00
ROADMAP.md	docs: update README with comprehensive usage docs and features	2026-04-18 07:32:38 -07:00

README.md

hoover

A tool for archiving web pages locally, preserving directory structure and rewriting URLs for offline viewing.

Overview

Hoover is a web archiving tool that downloads web pages and their associated resources (CSS, JavaScript, images, fonts, etc.) while converting absolute URLs to relative paths. This allows archived content to be viewed offline without network dependencies.

Installation

From source

git clone https://git.lan.thwap.org/thwap/hoover.git
cd hoover
make install

Building manually

go build -o hoover ./cmd/hoover

Usage

Basic command structure:

hoover -url <URL> -output <DIRECTORY> [options]

Required flags

-url, -u: URL to archive (e.g., https://example.com)
-output, -o: Output directory where archived files will be saved

Optional flags

-depth, -d: Maximum recursion depth (default: 1)
-delay, -D: Delay between requests in milliseconds (default: 100)
-verbose, -v: Enable verbose logging
-no-robots, -R: Ignore robots.txt restrictions (default: respect robots.txt)
-force, -f: Overwrite existing files (default: skip)
-pretty, -p: Pretty-print HTML output (default: preserve original formatting)

Examples

Basic archiving

Archive a single page with its immediate assets:

hoover -url https://docs.example.com -output ./docs

Recursive archiving with depth limit

Archive pages up to 2 levels deep:

hoover -url https://docs.example.com -output ./docs -depth 2

Archiving with rate limiting

Add a 200ms delay between requests to be respectful to servers:

hoover -url https://example.com -output ./archive -delay 200

Force overwrite existing files

hoover -url https://example.com -output ./archive -force

Pretty-print HTML output

hoover -url https://example.com -output ./archive -pretty

Output Structure

Hoover preserves the original URL path structure in the output directory. Here's an example of what the output might look like:

archive/
├── index.html
├── about.html
├── contact.html
├── css/
│   └── style.css
├── js/
│   └── app.js
├── images/
│   ├── logo.png
│   └── banner.jpg
└── cdn.example.com/
    └── library.js

Key features of the output structure:

HTML pages are saved with their original paths
Directories are created automatically as needed
External domain assets are placed in subdirectories named after the host (with ports converted to underscores)
index.html is created for directory URLs
All URLs in HTML, CSS, and JavaScript files are rewritten to be relative

Features

Recursive downloading: Follow links within the same domain up to a configurable depth
Asset discovery: Automatically finds and downloads CSS, JavaScript, images, fonts, and media files
URL rewriting: Converts absolute URLs to relative paths for offline viewing
Directory preservation: Maintains the original site structure in the local filesystem
Rate limiting: Configurable delay between requests to be server-friendly
Robots.txt support: Respects robots.txt restrictions by default
Safe file writing: Sanitizes filenames, limits path lengths, and avoids overwriting existing files (unless -force is used)
Parallel downloading: Uses a worker pool for concurrent asset downloads
Summary reporting: Provides statistics on files fetched, skipped, failed, and total bytes written

Limitations

Hoover has several important limitations to be aware of:

No JavaScript execution

Hoover performs static analysis only and does not execute JavaScript
Dynamically loaded content (via XHR/fetch, client-side rendering, etc.) will not be captured unless explicitly linked in the initial HTML
URLs constructed dynamically in JavaScript will not be discovered

Hoover does not handle authentication or maintain sessions beyond basic cookies
Password-protected areas cannot be archived
Sites requiring login will not be accessible

Limited dynamic content detection

Content loaded via AJAX after page load is not captured
Infinite scroll implementations will only capture initial content
Single Page Applications (SPAs) may not archive correctly

Static analysis of JavaScript

JavaScript files are analyzed for string patterns containing resource URLs
Complex JavaScript that constructs URLs dynamically may not be parsed correctly

Certificate validation

Hoover validates SSL certificates; self-signed certificates may cause errors
Use with caution on sites with invalid certificates

No interactive content

Forms, buttons, and other interactive elements will not function in the archived version
Search functionality and other server-side interactions will not work offline

Development

Running tests

make test

Building from source

make build

Project structure

hoover/
├── cmd/hoover/          # Main CLI entry point
├── internal/
│   ├── archiver/        # Core archiving orchestration
│   ├── fetcher/         # HTTP client and fetching logic
│   ├── parser/          # HTML/CSS/JS parsing
│   ├── rewriter/        # URL rewriting logic
│   ├── urltraversal/    # URL tracking and traversal
│   ├── workerpool/      # Concurrent job processing
│   ├── writer/          # File system operations
│   └── robots/          # robots.txt handling
└── README.md

Code Quality and CI/CD

The project uses pre-commit hooks and automated CI/CD with Forgejo Actions.

Pre-commit Hooks

Install pre-commit and run:

pre-commit install

This will set up hooks for:

Go formatting (gofmt)
Go module tidiness (go mod tidy)
Go vetting (go vet)
Optional golangci-lint (if installed)
Markdown linting
YAML validation
Spell checking

Continuous Integration

CI runs on every push and pull request via Forgejo Actions:

Unit tests with race detection and coverage
go vet and gofmt checks
golangci-lint analysis
Multi-platform binary builds on version tags

Release Automation

When a tag starting with v is pushed (e.g., v1.0.0), the workflow:

Runs all tests and linting
Builds binaries for Linux (amd64, arm64), macOS (arm64), and Windows (amd64)
Packages artifacts as .tar.gz files
Creates a release with the artifacts

License

[To be determined - MIT or GPL]