Skip to content

MrTheMech/html-website-cloner-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

🕷️ Python Website Cloner & HTML Crawl Downloader

A powerful, robust Python website crawl downloader and HTML cloner to download entire websites for offline viewing while bypassing anti-bot protections.

Python version License Dependencies


📖 Overview

This Python Website Cloner is an advanced website crawl downloader built with BeautifulSoup4 and cloudscraper. It acts as a powerful HTML cloner, allowing you to recursively crawl and download all HTML pages, CSS files, JavaScript, and images from any target website, maintaining the exact directory structure.

If you are looking to crawl a website and use a downloader to create a perfect HTML clone, this tool is for you. What makes this tool unique is its ability to seamlessly bypass WAFs and anti-bot protections (like Cloudflare) and its built-in URL rewriting engine that automatically updates links in the downloaded HTML and CSS files so the cloned site can be fully navigated offline.

✨ Key Features

  • 🛡️ Anti-Bot Bypass: Uses cloudscraper and realistic TLS fingerprinting to bypass Cloudflare, Sucuri, and other Web Application Firewalls (WAF).
  • 🔗 Intelligent Link Rewriting: Automatically converts absolute URLs and absolute paths into relative local paths, ensuring offline browsing works flawlessly.
  • 🖼️ Deep Asset Downloading: Parses inline <style> tags and style="..." attributes to download hidden CSS background images (url(...)), not just standard <img> tags.
  • 📁 Smart Directory Structuring: Automatically appends index.html to extension-less routes to prevent file/folder naming collisions on your operating system.
  • 💤 Rate Limit Protection: Built-in polite crawling delays to prevent triggering IP bans on deeper site crawls.
  • 🌐 Domain Restricted: Safely crawls pages within the target domain without accidentally downloading the entire internet.

🚀 Installation & Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/python-website-cloner.git
    cd python-website-cloner
  2. Install the dependencies: It is recommended to use a virtual environment.

    pip install -r requirements.txt

    Required packages: requests, beautifulsoup4, urllib3, cloudscraper.

💻 Usage

Run the clone_site.py script via the command line, providing the target URL and your desired output directory.

python clone_site.py <TARGET_URL> <OUTPUT_DIRECTORY>

Example:

python clone_site.py https://example.com ./downloads/example_clone

Once the crawl is finished, navigate to the ./downloads/example_clone directory and open index.html in any web browser to view the fully functional offline clone!

🛠️ How It Works Under The Hood

  1. Queue-Based Crawling: Instead of deep recursion which can cause memory issues, it uses a breadth-first queue approach.
  2. HTML Parsing: Uses BeautifulSoup4 to find all internal href and src links.
  3. Regex CSS Parsing: Uses Regular Expressions to capture url(...) declarations inside CSS blocks.
  4. Content-Type Detection: Accurately distinguishes between HTML pages and assets to determine the correct way to save the file.

⚠️ Disclaimer

This tool is intended for educational purposes, archiving your own sites, or offline viewing of public data. Please respect the robots.txt policies of websites and ensure you have permission to download copyrighted assets.

🤝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check the issues page.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

An advanced HTML website cloner and crawler. Downloads sites for offline use with built-in Cloudflare bypass, CSS parsing, and automatic link rewriting.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages