Integrate worker pool for parallel downloading of assets #99

New issue

Open

opened 2026-04-18 00:36:20 +00:00 by fuzzy · 0 comments

fuzzy commented

2026-04-18 00:36:20 +00:00

Owner

Phase 2 of the roadmap includes "Download assets in parallel" with worker pool. The worker pool implementation exists in internal/workerpool but is not integrated into the archiver's fetching loop.

Currently, archiver fetches pages sequentially. To improve performance, we should use the worker pool to fetch pages (and later assets) concurrently, respecting rate limiting and depth constraints.

This task involves modifying internal/archiver/archiver.go's Archive method to use the worker pool for parallel downloading.

Acceptance criteria:

Worker pool is used to fetch HTML pages concurrently (up to configured worker count)
Rate limiting (delay between requests) is respected across all workers
Depth limits are still enforced (breadth-first or depth-first ordering)
Visited URL tracking remains thread-safe (already added mutex to URLTracker)
Existing functionality (robots.txt checking, retries, directory structure preservation) continues to work.

This issue completes the remaining item of Phase 2.

Phase 2 of the roadmap includes "Download assets in parallel" with worker pool. The worker pool implementation exists in `internal/workerpool` but is not integrated into the archiver's fetching loop. Currently, archiver fetches pages sequentially. To improve performance, we should use the worker pool to fetch pages (and later assets) concurrently, respecting rate limiting and depth constraints. This task involves modifying `internal/archiver/archiver.go`'s `Archive` method to use the worker pool for parallel downloading. Acceptance criteria: - Worker pool is used to fetch HTML pages concurrently (up to configured worker count) - Rate limiting (delay between requests) is respected across all workers - Depth limits are still enforced (breadth-first or depth-first ordering) - Visited URL tracking remains thread-safe (already added mutex to URLTracker) - Existing functionality (robots.txt checking, retries, directory structure preservation) continues to work. This issue completes the remaining item of Phase 2.