🕷️ Python Website Cloner & HTML Crawl Downloader

A powerful, robust Python website crawl downloader and HTML cloner to download entire websites for offline viewing while bypassing anti-bot protections.

📖 Overview

This Python Website Cloner is an advanced website crawl downloader built with BeautifulSoup4 and cloudscraper. It acts as a powerful HTML cloner, allowing you to recursively crawl and download all HTML pages, CSS files, JavaScript, and images from any target website, maintaining the exact directory structure.

If you are looking to crawl a website and use a downloader to create a perfect HTML clone, this tool is for you. What makes this tool unique is its ability to seamlessly bypass WAFs and anti-bot protections (like Cloudflare) and its built-in URL rewriting engine that automatically updates links in the downloaded HTML and CSS files so the cloned site can be fully navigated offline.

✨ Key Features

🛡️ Anti-Bot Bypass: Uses cloudscraper and realistic TLS fingerprinting to bypass Cloudflare, Sucuri, and other Web Application Firewalls (WAF).
🔗 Intelligent Link Rewriting: Automatically converts absolute URLs and absolute paths into relative local paths, ensuring offline browsing works flawlessly.
🖼️ Deep Asset Downloading: Parses inline <style> tags and style="..." attributes to download hidden CSS background images (url(...)), not just standard <img> tags.
📁 Smart Directory Structuring: Automatically appends index.html to extension-less routes to prevent file/folder naming collisions on your operating system.
💤 Rate Limit Protection: Built-in polite crawling delays to prevent triggering IP bans on deeper site crawls.
🌐 Domain Restricted: Safely crawls pages within the target domain without accidentally downloading the entire internet.

🚀 Installation & Setup

Clone the repository:

git clone https://github.com/yourusername/python-website-cloner.git
cd python-website-cloner

Install the dependencies: It is recommended to use a virtual environment.
```
pip install -r requirements.txt
```
Required packages: requests, beautifulsoup4, urllib3, cloudscraper.

💻 Usage

Run the clone_site.py script via the command line, providing the target URL and your desired output directory.

python clone_site.py <TARGET_URL> <OUTPUT_DIRECTORY>

Example:

python clone_site.py https://example.com ./downloads/example_clone

Once the crawl is finished, navigate to the ./downloads/example_clone directory and open index.html in any web browser to view the fully functional offline clone!

🛠️ How It Works Under The Hood

Queue-Based Crawling: Instead of deep recursion which can cause memory issues, it uses a breadth-first queue approach.
HTML Parsing: Uses BeautifulSoup4 to find all internal href and src links.
Regex CSS Parsing: Uses Regular Expressions to capture url(...) declarations inside CSS blocks.
Content-Type Detection: Accurately distinguishes between HTML pages and assets to determine the correct way to save the file.

⚠️ Disclaimer

This tool is intended for educational purposes, archiving your own sites, or offline viewing of public data. Please respect the robots.txt policies of websites and ensure you have permission to download copyrighted assets.

🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
clone_site.py		clone_site.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Python Website Cloner & HTML Crawl Downloader

📖 Overview

✨ Key Features

🚀 Installation & Setup

💻 Usage

🛠️ How It Works Under The Hood

⚠️ Disclaimer

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ Python Website Cloner & HTML Crawl Downloader

📖 Overview

✨ Key Features

🚀 Installation & Setup

💻 Usage

🛠️ How It Works Under The Hood

⚠️ Disclaimer

🤝 Contributing

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages