diff --git a/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md b/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md index f466c22349..3d7aceca25 100644 --- a/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md +++ b/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md @@ -253,8 +253,7 @@ apify run It runs, that's nice! But looking at the output, we can't really verify what exactly gets scraped! While we're at it, let's change that with another prompt: ```text -In the output of the scraper I want to see -how the items being saved look like. +I want the scraper to log each item before it's saved. ``` We'll approve all changes and go to the command line again: @@ -263,7 +262,7 @@ We'll approve all changes and go to the command line again: apify run ``` -Now, the output of the scraper contains the actual items being scraped and we can verify we've been successful in changing the format of the prices (they appear at the very end of each line): +Now, the scraper prints the actual items being scraped and we can verify we've been successful in changing the format of the prices (they appear at the very end of each line): ```text ... diff --git a/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md b/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md index 62957d551b..307879df4a 100644 --- a/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md +++ b/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md @@ -1,25 +1,237 @@ --- -title: Docs driven prompting -description: TBD +title: Developing a scraper with docs-driven prompting +description: Improve your Apify scraper by documenting its behavior first and letting an AI agent follow it as a practical spec. slug: /scraping-with-apify-and-ai/docs-driven-prompting unlisted: true --- - +**In this lesson, we'll keep improving our app for tracking prices on an e-commerce website. We'll write documentation which isn't only useful for people to read, but also gives Cursor the context it needs.** + +--- + +We made our lives easier with an AI agent. Improving our scraper now takes way less back-and-forth than using a regular AI chat. Still, both approaches share one downside. + +Prompting a chat or agent is quick and straightforward, but it doesn't leave much trace of our intentions: + +- If we want someone else to take over later, it'll be hard for them to figure out why we made some decisions and whether behavior is intentional or accidental. +- If we get busy with other things and return after a few months, we'll basically become that "someone else" from the first bullet. After a week, we remember why we process prices a certain way. After a year, it's mostly fuzzy memories. +- If we want other people to use our scraper, they need simple instructions on how to run it and what to expect. + +Traditionally, we'd write this documentation after finishing the software. With AI, we can describe how the program should work before it's done, point the agent to that spec, and ask it to make it real. + +## Starting with README + +It's good practice to have a README file in every software project. It's a plain text file where project authors write down the info people usually need to understand what the project is about. + +The file is just text, but people use special characters to format it. A popular convention for that is Markdown, and when a README uses it, the file is usually called `README.md`. + +If we look at our files in Cursor, we'll see the Apify template already includes a `README.md`. After opening it, we'll see something like this: + +```md +# JavaScript Crawlee & CheerioCrawler Actor Template + + + +This template example was built with [Crawlee](https://crawlee.dev/) to scrape data from a website using [Cheerio](https://cheerio.js.org/) wrapped into [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler). + +## Quick Start + +... +``` + +Many sections follow, covering how the project works, how to develop it, how to deploy it, and more. + +Notice that headings start with one or more `#` characters. We also get bullet points, links, and code blocks. That's Markdown. + +![Markdown syntax highlighting in Cursor](images/cursor-readme.webp) + +Cursor understands Markdown, so it improves readability by coloring the formatted parts (this is called syntax highlighting). The **Preview** button shows how your Markdown will look once rendered. + +![Markdown preview in Cursor](images/cursor-readme-preview.webp) + +:::tip README and Markdown basics + +The [Make a README](https://www.makeareadme.com/) website explains why people shouldn't skip adding a README to their projects. To learn Markdown basics, check out [Getting Started](https://www.markdownguide.org/getting-started/) on Markdown Guide, and keep their [Cheat Sheet](https://www.markdownguide.org/cheat-sheet/) handy. -:::note Course under construction -This page hasn't been written yet. Come later, please! ::: +## Recreating README.md + +We could edit the existing `README.md`, but for this lesson it's easier to start from scratch. We'll clear the file and start with a new title and intro: + +```md +# My Actor + +Small app for tracking prices on an e-commerce website. +``` + +Now let's add a short section on how to work with the project: + +```md +## Development + +This is an Apify Actor that runs on Apify. + +- Have Node.js and Apify CLI ready +- Run `npm install` to install dependencies +- Run `apify run` to start scraping +- Run `apify push` to upload new version of the program to Apify +``` + +This is enough for both people and AI agents to quickly understand how to set things up and run them. + +## Documenting current behavior + +Now let's add a summary of what our scraper already does: + +```md +## Behavior + +- Downloads the Sales page: https://warehouse-theme-metal.myshopify.com/collections/sales +- The Sales page is the default input URL of the Actor. +- Extracts all products and saves this for each one: + - Product name + - Product detail page URL + - Price +- Logs each item before it's saved. +- Before it ends, it logs how many products it collected. +- The Actor output schema ensures that Apify interface shows saved items in the best way. + +### Prices handling + +Saves prices as numbers. Some prices are "from" values, so we call the field `minPrice`. + +- `Sale price$74.95` becomes `74.95` +- `Sale priceFrom $1,398.00` becomes `1398.00` +- `Sale price$158.00` becomes `158.00` +``` + +Most of the text above is just our past prompts, slightly rephrased. Because we now describe behavior in the README, anyone can understand details like price handling. If a bug shows up later, our original intent is clear. + +## Adding vendor name + +The README documents what we already have. Now let's use it as a spec for what comes next. We'll add vendor name to the output: + +```md +- Extracts all products and saves this for each one: + - Product name + - Product detail page URL + - Price + + - Vendor name +``` + +We'll save the file with Ctrl+S (or ⌘+S on macOS), then send this prompt to the AI agent: + +```text +Ensure all behavior documented in README is correctly implemented. +``` + +We'll likely need to approve some commands, because the agent may fetch the Warehouse store page and run local tools. + +When it's done, it'll print a summary and we'll review the changes. Then we'll approve them and run this command to check whether the Actor now scrapes vendor name too: + +```text +apify run +``` + +In the output, we should see each item logged before it's saved, and each item should now include vendor name. It's a bit hard to spot, but in the example below, the first product has `vendorName` set to `JBL` and the second to `Sony`: + +```text +INFO Saving product {"productName":"JBL Flip 4 Waterproof Portable B +luetooth Speaker","productUrl":"https://warehouse-theme-metal.myshopi +fy.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker","ve +ndorName":"JBL","minPrice":74.95} +INFO Saving product {"productName":"Sony XBR-950G BRAVIA 4K HDR Ultr +a HD TV","productUrl":"https://warehouse-theme-metal.myshopify.com/pr +oducts/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv" +,"vendorName":"Sony","minPrice":1398} +... +``` + +Nice! We just used a docs-first approach with an AI agent! + +## Adding image URL and SKU + +Now let's add two more details for each product. We want the scraper to get the product image URL and the number of units in stock, also called [SKU](https://en.wikipedia.org/wiki/Stock_keeping_unit): + +```md +- Extracts all products and saves this for each one: + - Product name + - Product detail page URL + - Price + - Vendor name + + - Product image URL + + - SKU +``` + +For SKU, it's better to describe exactly how we want it handled, so we'll add another section to the README. We'll scroll through the Sales page, find different SKU formats, and write clear examples of what should happen: + +```md +### SKU handling + +Saves SKU as a number. Examples: + +- `In stock, 672 units` becomes `672` +- `Only 2 units left` becomes `2` +- `Sold out` becomes `0` +``` + +We'll save the file again and repeat the same prompt as before to turn our spec into code: + +```text +Ensure all behavior documented in README is correctly implemented. +``` + +When it's done, let's check how the scraped items look now: + +```text +apify run +``` + +This is the first product we see in the output: + +```text +INFO Saving product {"productName":"JBL Flip 4 Waterproof Portable B +luetooth Speaker","productUrl":"https://warehouse-theme-metal.myshopi +fy.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker","ve +ndorName":"JBL","imageUrl":"https://warehouse-theme-metal.myshopify.c +om/cdn/shop/products/13549_790__2_73a2a189-b3d5-4ec8-a4c3-b506e1beab7 +0.jpg?v=1559820925&width=500","minPrice":74.95,"sku":672} +``` + +With a bit of effort, we can see `sku` is `672`. If we copy the `imageUrl` value into a browser, we can check it's the right image for the JBL Bluetooth speaker. That's a bit tedious, so let's see if Apify shows it better. + +## Pushing Actor to Apify + +We've made quite a few changes to our Actor and tested them, so this is a good time to push a new version to Apify: + +```text +apify push +``` + +After the command finishes, we'll open the URL it gives us at the end: + +```text +... +Actor detail https://console.apify.com/actors/EL7U7aNddXOzwEJ66 +Success: Actor was deployed to Apify cloud and built there. +``` + +In the Apify interface, we'll click the **Start** button. Soon we should see items popping up in the **Output** section. + +Thanks to the sentence "The Actor output schema ensures that Apify interface shows saved items in the best way," the agent improved how our Actor talks to Apify, so we don't have to switch to **All fields** anymore: + +![Improved Apify output](images/apify-output-products.webp) + +Even better, we can see images right away! + +## Wrapping up + +We wrote down how our scraper should behave, waved a magic wand, and those words turned into working code. But instead of letting all our decisions disappear in prompt windows, we saved them in a file as durable documentation anyone can read later. - +In the next lesson, we'll take a look at how we can develop our scraper by saving pieces of the target website and testing our program against it. diff --git a/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md b/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md index 69abc81fd8..c21390efd0 100644 --- a/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md +++ b/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md @@ -16,12 +16,18 @@ This page hasn't been written yet. Come later, please! ::: + + diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-products.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-products.webp new file mode 100644 index 0000000000..bdb9e72537 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-products.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme-preview.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme-preview.webp new file mode 100644 index 0000000000..cd27f0350f Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme-preview.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme.webp new file mode 100644 index 0000000000..44079b8e85 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-readme.webp differ diff --git a/typos.toml b/typos.toml index bf2696b1e2..04fa2e543a 100644 --- a/typos.toml +++ b/typos.toml @@ -6,6 +6,12 @@ extend-ignore-re = [ "https?://[^\\s]+", # Ignore URLs ] +[type.md] +extend-ignore-re = [ + # Ignore fenced code blocks marked as 'text' in Markdown + "(?s)```(?:text)\\b[^\\n]*\\n.*?```", +] + [files] # Extend the default exclude list extend-exclude = [