Skip to content

Latest commit

 

History

History
155 lines (112 loc) · 3.08 KB

File metadata and controls

155 lines (112 loc) · 3.08 KB

Zyte

Web data extraction platform with anti-ban technology, browser automation, and Scrapy Cloud hosting.

GitHub Education Access

Claim your free access at GitHub Education Pack - look for Zyte.

Dashboard

Features

  • Zyte API: HTTP requests without bans
  • Browser Automation: Headless browser control
  • AI Parsing: Automatic data extraction
  • Scrapy Cloud: Hosted spider deployment
  • Web Scraping Copilot: Code generation

Quick Setup

Zyte API (HTTP)

pip install zyte-api
from zyte_api import ZyteAPI

client = ZyteAPI(api_key=os.environ.get("ZYTE_API_KEY"))

response = client.get({
    "url": "https://example.com",
    "browserHtml": True
})

Scrapy with Zyte

pip install scrapy scrapy-zyte-api
# settings.py
DOWNLOAD_HANDLERS = {
    "http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}
ZYTE_API_KEY = os.environ.get("ZYTE_API_KEY")

Scrapy Cloud Deployment

pip install shub

# Login
shub login

# Deploy spider
shub deploy

Configuration

scrapy.cfg

[settings]
default = myproject.settings

[deploy]
project = YOUR_PROJECT_ID

Environment Variables

ZYTE_API_KEY=your_api_key
# Get from: https://app.zyte.com/o/account/api-access

Zyte API Features

Browser Rendering

response = client.get({
    "url": "https://example.com",
    "browserHtml": True,
    "javascript": True,
    "screenshot": True
})

AI Data Extraction

response = client.get({
    "url": "https://example.com/product",
    "product": True  # Auto-extract product data
})

product = response["product"]
print(product["name"], product["price"])

Geolocation

response = client.get({
    "url": "https://example.com",
    "geolocation": "US"  # Request from US IP
})

Scrapy Spider Example

import scrapy
from scrapy_zyte_api import ZyteAPISpider

class ProductSpider(ZyteAPISpider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for product in response.css(".product"):
            yield {
                "name": product.css(".title::text").get(),
                "price": product.css(".price::text").get(),
            }

        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Best Practices

  1. Use browser rendering sparingly: More expensive than HTTP
  2. Set appropriate delays: Respect rate limits
  3. Handle errors gracefully: Implement retry logic
  4. Use geolocation: When content varies by region
  5. Schedule in Scrapy Cloud: For recurring scrapes

Resources