Web Crawler

A web crawler that crawls a website, extracts content as Markdown, and saves each page to disk.

Quick Start

# Set your target URL and run
./crawler --url https://example.com

# Check the results
ls output/

For verbose output:

DEBUG=1 ./crawler --url https://example.com

Configuration

The crawler reads runtime options from environment variables. The only required input is --url; everything else has sensible defaults.

Variable	Default	Effect
`DEBUG`	unset	When `1`, logs every queued URL and every URL skipped by a filter rule.
`MAX_URL_LENGTH`	`0` (unlimited)	URLs longer than this (after fragment/trailing-slash cleanup) are skipped. Catches session-encoded URLs and crawler traps. Set to `0` (the default — also used when the var is unset or non-numeric) to disable the length check entirely.
`SKIP_WORDS_IN_URL`	unset	Comma-separated list of stopwords. A URL is skipped when any of its path/query tokens exactly matches one of these words. Token boundaries are non-alphanumeric characters, so `en` matches `/en/` but not `/engineering/`.

Example:

DEBUG=1 \
MAX_URL_LENGTH=300 \
SKIP_WORDS_IN_URL=en,veranstaltungen,login \
./crawler --url https://example.com

In addition to these env vars, FilterUrl Handler (in links.aro) hard-codes two filters that aren't configurable: a non-HTML extension blocklist (images, archives, fonts, …) and a repeated-path-segment guard (URLs like /x/x/x/… are usually broken templates).

How It Works

The crawler is fully event-driven. No feature set calls another directly -- they communicate exclusively through events and a repository observer:

Application-Start
       |
       v
  QueueUrl event ───> QueueUrl Handler
                           |
                      Store into crawled-repository
                           |
                      crawled-repository Observer
                           |
                      CrawlPage event ───> CrawlPage Handler
                                                |
                                    ┌───────────┴───────────┐
                                    v                       v
                              SavePage event          ExtractLinks event
                                    |                       |
                              SavePage Handler        ExtractLinks Handler
                              (write .md file)              |
                                                   parallel for each link:
                                                            |
                                                   NormalizeUrl Handler
                                                            |
                                                   FilterUrl Handler
                                                            |
                                                   QueueUrl event ──┐
                                                                    |
                                               (loops back, repo   |
                                                deduplicates)  <───┘

URL deduplication happens automatically: the repository ignores stores with duplicate IDs (each URL is hashed to a deterministic ID). The observer only fires for new entries.

Project Structure

Crawler/
├── main.aro        # Entry point: reads --url parameter, creates output dir, emits first QueueUrl
├── crawler.aro     # CrawlPage Handler: fetches HTML, extracts Markdown, emits SavePage + ExtractLinks
├── links.aro       # Link pipeline: ExtractLinks, NormalizeUrl, FilterUrl, QueueUrl, repository Observer
├── storage.aro     # SavePage Handler: writes Markdown files to output/
├── openapi.yaml    # Event schemas (CrawlPageEvent, ExtractLinksEvent, etc.)
└── output/         # Crawled pages as .md files (created at runtime)

Output Format

Each crawled page is saved as a Markdown file named by the SHA-256 hash of its URL:

# Page Title

**Source:** https://example.com/page

---

Content with headings, links, lists, and formatting preserved...

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
Assets		Assets
Documentation		Documentation
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
README.md		README.md
crawler.aro		crawler.aro
docker-compose.yml		docker-compose.yml
links.aro		links.aro
main.aro		main.aro
openapi.yaml		openapi.yaml
storage.aro		storage.aro

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Quick Start

Configuration

How It Works

Project Structure

Output Format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Quick Start

Configuration

How It Works

Project Structure

Output Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages