A powerful, robust Python website crawl downloader and HTML cloner to download entire websites for offline viewing while bypassing anti-bot protections.
This Python Website Cloner is an advanced website crawl downloader built with BeautifulSoup4 and cloudscraper. It acts as a powerful HTML cloner, allowing you to recursively crawl and download all HTML pages, CSS files, JavaScript, and images from any target website, maintaining the exact directory structure.
If you are looking to crawl a website and use a downloader to create a perfect HTML clone, this tool is for you. What makes this tool unique is its ability to seamlessly bypass WAFs and anti-bot protections (like Cloudflare) and its built-in URL rewriting engine that automatically updates links in the downloaded HTML and CSS files so the cloned site can be fully navigated offline.
- 🛡️ Anti-Bot Bypass: Uses
cloudscraperand realistic TLS fingerprinting to bypass Cloudflare, Sucuri, and other Web Application Firewalls (WAF). - 🔗 Intelligent Link Rewriting: Automatically converts absolute URLs and absolute paths into relative local paths, ensuring offline browsing works flawlessly.
- 🖼️ Deep Asset Downloading: Parses inline
<style>tags andstyle="..."attributes to download hidden CSS background images (url(...)), not just standard<img>tags. - 📁 Smart Directory Structuring: Automatically appends
index.htmlto extension-less routes to prevent file/folder naming collisions on your operating system. - 💤 Rate Limit Protection: Built-in polite crawling delays to prevent triggering IP bans on deeper site crawls.
- 🌐 Domain Restricted: Safely crawls pages within the target domain without accidentally downloading the entire internet.
-
Clone the repository:
git clone https://github.com/yourusername/python-website-cloner.git cd python-website-cloner -
Install the dependencies: It is recommended to use a virtual environment.
pip install -r requirements.txt
Required packages:
requests,beautifulsoup4,urllib3,cloudscraper.
Run the clone_site.py script via the command line, providing the target URL and your desired output directory.
python clone_site.py <TARGET_URL> <OUTPUT_DIRECTORY>Example:
python clone_site.py https://example.com ./downloads/example_cloneOnce the crawl is finished, navigate to the ./downloads/example_clone directory and open index.html in any web browser to view the fully functional offline clone!
- Queue-Based Crawling: Instead of deep recursion which can cause memory issues, it uses a breadth-first queue approach.
- HTML Parsing: Uses
BeautifulSoup4to find all internalhrefandsrclinks. - Regex CSS Parsing: Uses Regular Expressions to capture
url(...)declarations inside CSS blocks. - Content-Type Detection: Accurately distinguishes between HTML pages and assets to determine the correct way to save the file.
This tool is intended for educational purposes, archiving your own sites, or offline viewing of public data. Please respect the robots.txt policies of websites and ensure you have permission to download copyrighted assets.
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
This project is licensed under the MIT License - see the LICENSE file for details.