- Go 99.9%
- Makefile 0.1%
|
Some checks failed
Test and Release / test (push) Has been cancelled
Test and Release / lint (push) Has been cancelled
Test and Release / release (amd64, linux) (push) Has been cancelled
Test and Release / release (amd64, windows) (push) Has been cancelled
Test and Release / release (arm64, darwin) (push) Has been cancelled
Test and Release / release (arm64, linux) (push) Has been cancelled
|
||
|---|---|---|
| .forgejo/workflows | ||
| cmd/hoover | ||
| internal | ||
| .gitignore | ||
| .golangci.yml | ||
| .pre-commit-config.yaml | ||
| go.mod | ||
| go.sum | ||
| Makefile | ||
| README.md | ||
| ROADMAP.md | ||
hoover
A tool for archiving web pages locally, preserving directory structure and rewriting URLs for offline viewing.
Overview
Hoover is a web archiving tool that downloads web pages and their associated resources (CSS, JavaScript, images, fonts, etc.) while converting absolute URLs to relative paths. This allows archived content to be viewed offline without network dependencies.
Installation
From source
git clone https://git.lan.thwap.org/thwap/hoover.git
cd hoover
make install
Building manually
go build -o hoover ./cmd/hoover
Usage
Basic command structure:
hoover -url <URL> -output <DIRECTORY> [options]
Required flags
-url,-u: URL to archive (e.g.,https://example.com)-output,-o: Output directory where archived files will be saved
Optional flags
-depth,-d: Maximum recursion depth (default:1)-delay,-D: Delay between requests in milliseconds (default:100)-verbose,-v: Enable verbose logging-no-robots,-R: Ignorerobots.txtrestrictions (default: respect robots.txt)-force,-f: Overwrite existing files (default: skip)-pretty,-p: Pretty-print HTML output (default: preserve original formatting)
Examples
Basic archiving
Archive a single page with its immediate assets:
hoover -url https://docs.example.com -output ./docs
Recursive archiving with depth limit
Archive pages up to 2 levels deep:
hoover -url https://docs.example.com -output ./docs -depth 2
Archiving with rate limiting
Add a 200ms delay between requests to be respectful to servers:
hoover -url https://example.com -output ./archive -delay 200
Force overwrite existing files
hoover -url https://example.com -output ./archive -force
Pretty-print HTML output
hoover -url https://example.com -output ./archive -pretty
Output Structure
Hoover preserves the original URL path structure in the output directory. Here's an example of what the output might look like:
archive/
├── index.html
├── about.html
├── contact.html
├── css/
│ └── style.css
├── js/
│ └── app.js
├── images/
│ ├── logo.png
│ └── banner.jpg
└── cdn.example.com/
└── library.js
Key features of the output structure:
- HTML pages are saved with their original paths
- Directories are created automatically as needed
- External domain assets are placed in subdirectories named after the host (with ports converted to underscores)
index.htmlis created for directory URLs- All URLs in HTML, CSS, and JavaScript files are rewritten to be relative
Features
- Recursive downloading: Follow links within the same domain up to a configurable depth
- Asset discovery: Automatically finds and downloads CSS, JavaScript, images, fonts, and media files
- URL rewriting: Converts absolute URLs to relative paths for offline viewing
- Directory preservation: Maintains the original site structure in the local filesystem
- Rate limiting: Configurable delay between requests to be server-friendly
- Robots.txt support: Respects
robots.txtrestrictions by default - Safe file writing: Sanitizes filenames, limits path lengths, and avoids overwriting existing files (unless
-forceis used) - Parallel downloading: Uses a worker pool for concurrent asset downloads
- Summary reporting: Provides statistics on files fetched, skipped, failed, and total bytes written
Limitations
Hoover has several important limitations to be aware of:
No JavaScript execution
- Hoover performs static analysis only and does not execute JavaScript
- Dynamically loaded content (via XHR/fetch, client-side rendering, etc.) will not be captured unless explicitly linked in the initial HTML
- URLs constructed dynamically in JavaScript will not be discovered
No login/session handling
- Hoover does not handle authentication or maintain sessions beyond basic cookies
- Password-protected areas cannot be archived
- Sites requiring login will not be accessible
Limited dynamic content detection
- Content loaded via AJAX after page load is not captured
- Infinite scroll implementations will only capture initial content
- Single Page Applications (SPAs) may not archive correctly
Static analysis of JavaScript
- JavaScript files are analyzed for string patterns containing resource URLs
- Complex JavaScript that constructs URLs dynamically may not be parsed correctly
Certificate validation
- Hoover validates SSL certificates; self-signed certificates may cause errors
- Use with caution on sites with invalid certificates
No interactive content
- Forms, buttons, and other interactive elements will not function in the archived version
- Search functionality and other server-side interactions will not work offline
Development
Running tests
make test
Building from source
make build
Project structure
hoover/
├── cmd/hoover/ # Main CLI entry point
├── internal/
│ ├── archiver/ # Core archiving orchestration
│ ├── fetcher/ # HTTP client and fetching logic
│ ├── parser/ # HTML/CSS/JS parsing
│ ├── rewriter/ # URL rewriting logic
│ ├── urltraversal/ # URL tracking and traversal
│ ├── workerpool/ # Concurrent job processing
│ ├── writer/ # File system operations
│ └── robots/ # robots.txt handling
└── README.md
Code Quality and CI/CD
The project uses pre-commit hooks and automated CI/CD with Forgejo Actions.
Pre-commit Hooks
Install pre-commit and run:
pre-commit install
This will set up hooks for:
- Go formatting (
gofmt) - Go module tidiness (
go mod tidy) - Go vetting (
go vet) - Optional
golangci-lint(if installed) - Markdown linting
- YAML validation
- Spell checking
Continuous Integration
CI runs on every push and pull request via Forgejo Actions:
- Unit tests with race detection and coverage
go vetandgofmtchecksgolangci-lintanalysis- Multi-platform binary builds on version tags
Release Automation
When a tag starting with v is pushed (e.g., v1.0.0), the workflow:
- Runs all tests and linting
- Builds binaries for Linux (amd64, arm64), macOS (arm64), and Windows (amd64)
- Packages artifacts as
.tar.gzfiles - Creates a release with the artifacts
License
[To be determined - MIT or GPL]