Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh

## Advanced Actor overview {#advanced-actors}

In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in [three short lessons](../../webscraping/scraping_basics_legacy/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same.
In this course, we'll be working out of the Amazon scraper built in the old [Web scraping basics for JavaScript devs](../../webscraping/scraping_basics_legacy/challenge/index.md) course (not the [current scraping basics course](../../webscraping/scraping_basics_javascript/index.md)). If you haven't gone through it yet, we recommend doing so - it covers the fundamentals this project is built on. If you'd rather skip straight to this course, you can use this working implementation instead: [academy-amazon-scraper](https://github.com/apify-projects/academy-amazon-scraper).

Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile).

Expand All @@ -46,9 +46,11 @@ Prior to moving forward, please read over these resources:

## Our task {#our-task}

In this task, we'll be building on top of what we already created in the [Web scraping basics for JavaScript devs](../../webscraping/scraping_basics_legacy/challenge/index.md) course's final challenge, so keep those files safe!
In this task, we'll be building on top of the [Amazon scraper project mentioned above](#advanced-actors).

Once our Amazon Actor has completed its run, we will, rather than sending an email to ourselves, call an Actor through a webhook. The Actor called will be a new Actor that we will create together, which will take the dataset ID as input, then subsequently filter through all of the results and return only the cheapest one for each product. All of the results of the Actor will be pushed to its default dataset.
Once our Amazon Actor has completed its run, we might want to send an email to ourselves, but instead of that let's call another Actor through a webhook. The Actor called will be a new Actor that you will create, which will take the dataset ID as input, then filter through all of the results and return only the cheapest result for each unique ASIN. All of the results of the Actor will be pushed to its default dataset.

> Note: the [starter repo](https://github.com/apify-projects/academy-amazon-scraper) produces one result per product, so in practice the filtering Actor will pass every item through unchanged. That is fine. The goal here is to learn how to pass data between Actors using webhooks, not to do complex filtering.

[**Solution**](./solutions/integrating_webhooks.md)

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -40,27 +40,23 @@ First, let's create a repository. This can be done [in a number of ways](https:/

Then, we'll run the commands it tells us in our terminal (while within the **demo-actor** directory) to initialize the repository locally, and then push all of the files to the remote one.

After you've created your repo, navigate on the Apify platform to the Actor we called **demo-actor**. In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far.
After you've created your repo, navigate on the Apify platform to your Amazon scraping Actor (referred to throughout this course as **demo-actor**). In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far.

![Select source code location](./images/select-source-location.png)

Then, go ahead and paste the link to your repository into the **Git URL** text field and click **Save**.

The final step is to click on **API** in the top right corner of your Actor's page:
The final step is to check your **Build settings**. Under the Git URL field, you'll see two options: **Automatic builds** and **Manual builds**. Make sure **Automatic builds** is selected - this tells Apify to rebuild your Actor whenever you push to GitHub, with no extra configuration needed.

![API button](./images/api-button.jpg)

And scroll through all of the links until you find the **Build Actor** API endpoint. Copy this endpoint's URL, then head back over to your GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook.

![Adding a webhook to your GitHub repo](../../../platform/actors/development/deployment/images/ci-github-integration.png)
![Build settings with Automatic builds selected](./images/build-settings.webp)

And you're done! 🎉

## Quick chat about code management {#code-management}

This was a bit of overhead, but the good news is that you don't ever have to configure this stuff again for this Actor. Now, every time the content of your **main**/**master** branch changes, the Actor on the Apify platform will rebuild based on the newest code.

Think of it as combining two steps into one! Normally, you'd have to do a `git push` from your terminal in order to get the newest code onto GitHub, then run `apify push` to push it to the platform.
Think of it as combining two steps into one. Normally, you'd have to do a `git push` from your terminal to get the newest code onto GitHub, then run `apify push` to push it to the platform.

It's also important to know that GitHub/Gitlab repository integration is standard practice. As projects grow and the number of contributors and maintainers increases, it only makes sense to have a GitHub repository integrated with the project's Actor. For the remainder of this course, all Actors created will be integrated with a GitHub repository.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ In our Amazon Actor, each dataset result must now have the following extra keys:
}
```

Also, an object including these values should be persisted during the run in th Key-Value store and logged to the console every 10 seconds:
Also, an object including these values should be persisted during the run in the Key-Value store and logged to the console every 10 seconds:

```json
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ In the next lesson, we'll start with our Node.js project. First we'll be figurin

### Extract the price of IKEA's most expensive artificial plant

At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number.
At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number (you may need [`replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) to handle spaces).

<details>
<summary>Solution</summary>
Expand All @@ -93,8 +93,8 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a
1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value.
1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price.
1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`.
1. Convert the price text into a number by executing `parseInt(price.textContent)`.
1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek).
1. Convert the price text into a number by executing `parseInt(price.textContent.replace(' ', ''))`. The price text contains spaces as thousand separators, so `.replace(' ', '')` strips them before parsing - a technique you'll explore further in the [Extracting data](./07_extracting_data.md#removing-dollar-sign-and-commas) lesson.
1. At the time of writing, this returns `1299`, meaning [1 299 SEK](https://www.google.com/search?q=1299%20sek).

</details>

Expand All @@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto

### Extract details about the first post on Guardian's F1 news

On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo.
On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph (if it has one), and URL of the associated photo.

![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png)

Expand All @@ -132,7 +132,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone),
1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead.
1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post.
1. Extract the post's title by executing `post.querySelector('h3').textContent`.
1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`.
1. Extract the lead paragraph (if it has one) by executing `post.querySelector('span div').textContent`.
1. Extract the photo URL by executing `post.querySelector('img').src`.

</details>
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ if (response.ok) {
}
```

Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/Cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element.
Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element.

Cheerio requires us to wrap each element with `$()` again before we can work with it further, and then we call `.text()`. If we run the code, it… well, it definitely prints _something_…

Expand Down Expand Up @@ -136,7 +136,7 @@ When translated to a tree of JavaScript objects, the element with class `price`
- a `span` HTML element,
- a textual node representing the actual amount and possibly also white space.

We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/Cheerio#contents) method to access individual nodes. It returns a list of nodes like this:
We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/cheerio#contents) method to access individual nodes. It returns a list of nodes like this:

```text
LoadedCheerio {
Expand Down Expand Up @@ -197,7 +197,7 @@ if (response.ok) {
}
```

We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/Cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/Cheerio#last). If we run the scraper now, it should print prices as only amounts:
We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/cheerio#last). If we run the scraper now, it should print prices as only amounts:

```text
$ node index.js
Expand Down Expand Up @@ -237,7 +237,7 @@ Macao, China

:::tip Need a nudge?

You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/Cheerio#eq).
You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/cheerio#eq).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -290,7 +290,7 @@ Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian
:::tip Need a nudge?

- HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601.
- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/Cheerio#attr) to access attributes.
- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/cheerio#attr) to access attributes.
- In JavaScript you can use an ISO 8601 string to create a [`Date`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date) object.
- To get the date, you can call `.toDateString()` on `Date` objects.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,31 +43,20 @@ Our crawler's input will look like this:
}
```

The goal at hand is to scrape all of the products from the first page of results for whatever keyword was provided (for our test case, it will be **iPhone**), then to scrape all available offers of each product and push the results to the dataset. For context, the offers for a product look like this:

![Amazon product offers](../../../platform/expert_scraping_with_apify/images/product-offers.jpg)
The goal is to scrape all of the products from the first page of results for whatever keyword was provided (for our test case, it will be **iPhone**), then for each product visit its page and scrape the featured offer. Push the results to the dataset.

In the end, we'd like our final output to look something like this:

```json
[
{
"title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)",
"asin": "B07P6Y7954",
"itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone",
"description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.",
"title": "Apple iPhone 16e, 128GB, Black - Unlocked (Renewed)",
"asin": "B0F4RM7Y2L",
"itemUrl": "https://www.amazon.com/Apple-iPhone-128GB-eSIM-Black/dp/B0F4RM7Y2L",
"description": "This product is certified refurbished...",
"keyword": "iphone",
"sellerName": "Blutek Intl",
"offer": "$162.97"
},
{
"title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)",
"asin": "B07P6Y7954",
"itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone",
"description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.",
"keyword": "iphone",
"sellerName": "PLATINUM DEALS",
"offer": "$169.98"
"offer": "$329.99"
},
{
"...": "..."
Expand All @@ -78,7 +67,7 @@ In the end, we'd like our final output to look something like this:

> The `asin` is the ID of the product, which is data present on the Amazon website.

Each of the items in the dataset will represent a scraped offer and will have the same `title`, `asin`, `itemUrl`, and `description`. The offer-specific fields will be `sellerName` and `offer`.
Each item in the dataset represents one product. The `sellerName` and `offer` fields come from the featured offer shown on the product page, which means you will end up with one result per product.

<!-- After the scrape has completed, we'll programmatically call a [public Actor which sends emails](https://apify.com/apify/send-mail) to send ourselves an email with a publicly viewable link to the Actor's final dataset. -->

Expand Down
Loading
Loading