diff --git a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md index c5c7d1d4a8..4300825980 100644 --- a/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md +++ b/sources/academy/platform/expert_scraping_with_apify/actors_webhooks.md @@ -20,7 +20,7 @@ Thus far, you've run Actors on the platform and written an Actor of your own, wh ## Advanced Actor overview {#advanced-actors} -In this course, we'll be working out of the Amazon scraper project from the **Web scraping basics for JavaScript devs** course. If you haven't already built that project, you can do it in [three short lessons](../../webscraping/scraping_basics_legacy/challenge/index.md). We've made a few small modifications to the project with the Apify SDK, but 99% of the code is still the same. +In this course, we'll be working out of the Amazon scraper built in the old [Web scraping basics for JavaScript devs](../../webscraping/scraping_basics_legacy/challenge/index.md) course (not the [current scraping basics course](../../webscraping/scraping_basics_javascript/index.md)). If you haven't gone through it yet, we recommend doing so - it covers the fundamentals this project is built on. If you'd rather skip straight to this course, you can use this working implementation instead: [academy-amazon-scraper](https://github.com/apify-projects/academy-amazon-scraper). Take another look at the files within your Amazon scraper project. You'll notice that there is a **Dockerfile**. Every single Actor has a Dockerfile (the Actor's **Image**) which tells Docker how to spin up a container on the Apify platform which can successfully run the Actor's code. "Apify Actors" is a serverless platform that runs multiple Docker containers. For a deeper understanding of Actor Dockerfiles, refer to the [Apify Actor Dockerfile docs](/sdk/js/docs/guides/docker-images#example-dockerfile). @@ -46,9 +46,11 @@ Prior to moving forward, please read over these resources: ## Our task {#our-task} -In this task, we'll be building on top of what we already created in the [Web scraping basics for JavaScript devs](../../webscraping/scraping_basics_legacy/challenge/index.md) course's final challenge, so keep those files safe! +In this task, we'll be building on top of the [Amazon scraper project mentioned above](#advanced-actors). -Once our Amazon Actor has completed its run, we will, rather than sending an email to ourselves, call an Actor through a webhook. The Actor called will be a new Actor that we will create together, which will take the dataset ID as input, then subsequently filter through all of the results and return only the cheapest one for each product. All of the results of the Actor will be pushed to its default dataset. +Once our Amazon Actor has completed its run, we might want to send an email to ourselves, but instead of that let's call another Actor through a webhook. The Actor called will be a new Actor that you will create, which will take the dataset ID as input, then filter through all of the results and return only the cheapest result for each unique ASIN. All of the results of the Actor will be pushed to its default dataset. + +> Note: the [starter repo](https://github.com/apify-projects/academy-amazon-scraper) produces one result per product, so in practice the filtering Actor will pass every item through unchanged. That is fine. The goal here is to learn how to pass data between Actors using webhooks, not to do complex filtering. [**Solution**](./solutions/integrating_webhooks.md) diff --git a/sources/academy/platform/expert_scraping_with_apify/images/build-settings.webp b/sources/academy/platform/expert_scraping_with_apify/images/build-settings.webp new file mode 100644 index 0000000000..fc6037dc60 Binary files /dev/null and b/sources/academy/platform/expert_scraping_with_apify/images/build-settings.webp differ diff --git a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md index e9cf36d342..51dbea531a 100644 --- a/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md +++ b/sources/academy/platform/expert_scraping_with_apify/managing_source_code.md @@ -40,19 +40,15 @@ First, let's create a repository. This can be done [in a number of ways](https:/ Then, we'll run the commands it tells us in our terminal (while within the **demo-actor** directory) to initialize the repository locally, and then push all of the files to the remote one. -After you've created your repo, navigate on the Apify platform to the Actor we called **demo-actor**. In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far. +After you've created your repo, navigate on the Apify platform to your Amazon scraping Actor (referred to throughout this course as **demo-actor**). In the **Source** tab, click the dropdown menu under **Source code** and select **Git repository**. By default, this is set to **Web IDE**, which is what we've been using so far. ![Select source code location](./images/select-source-location.png) Then, go ahead and paste the link to your repository into the **Git URL** text field and click **Save**. -The final step is to click on **API** in the top right corner of your Actor's page: +The final step is to check your **Build settings**. Under the Git URL field, you'll see two options: **Automatic builds** and **Manual builds**. Make sure **Automatic builds** is selected - this tells Apify to rebuild your Actor whenever you push to GitHub, with no extra configuration needed. -![API button](./images/api-button.jpg) - -And scroll through all of the links until you find the **Build Actor** API endpoint. Copy this endpoint's URL, then head back over to your GitHub repository and navigate to **Settings > Webhooks > Add webhook**. The final thing to do is to paste the URL and save the webhook. - -![Adding a webhook to your GitHub repo](../../../platform/actors/development/deployment/images/ci-github-integration.png) +![Build settings with Automatic builds selected](./images/build-settings.webp) And you're done! 🎉 @@ -60,7 +56,7 @@ And you're done! 🎉 This was a bit of overhead, but the good news is that you don't ever have to configure this stuff again for this Actor. Now, every time the content of your **main**/**master** branch changes, the Actor on the Apify platform will rebuild based on the newest code. -Think of it as combining two steps into one! Normally, you'd have to do a `git push` from your terminal in order to get the newest code onto GitHub, then run `apify push` to push it to the platform. +Think of it as combining two steps into one. Normally, you'd have to do a `git push` from your terminal to get the newest code onto GitHub, then run `apify push` to push it to the platform. It's also important to know that GitHub/Gitlab repository integration is standard practice. As projects grow and the number of contributors and maintainers increases, it only makes sense to have a GitHub repository integrated with the project's Actor. For the remainder of this course, all Actors created will be integrated with a GitHub repository. diff --git a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md index 6bc13433f1..e62ffd69bd 100644 --- a/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md +++ b/sources/academy/platform/expert_scraping_with_apify/saving_useful_stats.md @@ -41,7 +41,7 @@ In our Amazon Actor, each dataset result must now have the following extra keys: } ``` -Also, an object including these values should be persisted during the run in th Key-Value store and logged to the console every 10 seconds: +Also, an object including these values should be persisted during the run in the Key-Value store and logged to the console every 10 seconds: ```json { diff --git a/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md index e774b7e1d7..da431dfc19 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript/03_devtools_extracting_data.md @@ -81,7 +81,7 @@ In the next lesson, we'll start with our Node.js project. First we'll be figurin ### Extract the price of IKEA's most expensive artificial plant -At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number. +At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use the [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number (you may need [`replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) to handle spaces).
Solution @@ -93,8 +93,8 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a 1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value. 1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price. 1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`. - 1. Convert the price text into a number by executing `parseInt(price.textContent)`. - 1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek). + 1. Convert the price text into a number by executing `parseInt(price.textContent.replace(' ', ''))`. The price text contains spaces as thousand separators, so `.replace(' ', '')` strips them before parsing - a technique you'll explore further in the [Extracting data](./07_extracting_data.md#removing-dollar-sign-and-commas) lesson. + 1. At the time of writing, this returns `1299`, meaning [1 299 SEK](https://www.google.com/search?q=1299%20sek).
@@ -119,7 +119,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto ### Extract details about the first post on Guardian's F1 news -On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. +On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph (if it has one), and URL of the associated photo. ![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png) @@ -132,7 +132,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), 1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead. 1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post. 1. Extract the post's title by executing `post.querySelector('h3').textContent`. - 1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`. + 1. Extract the lead paragraph (if it has one) by executing `post.querySelector('span div').textContent`. 1. Extract the photo URL by executing `post.querySelector('img').src`. diff --git a/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md b/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md index fedd418abd..b4d1f8124f 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md +++ b/sources/academy/webscraping/scraping_basics_javascript/06_locating_elements.md @@ -39,7 +39,7 @@ if (response.ok) { } ``` -Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/Cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element. +Calling [`toArray()`](https://cheerio.js.org/docs/api/classes/cheerio#toarray) converts the Cheerio selection to a standard JavaScript array. We can then loop over that array and process each selected element. Cheerio requires us to wrap each element with `$()` again before we can work with it further, and then we call `.text()`. If we run the code, it… well, it definitely prints _something_… @@ -136,7 +136,7 @@ When translated to a tree of JavaScript objects, the element with class `price` - a `span` HTML element, - a textual node representing the actual amount and possibly also white space. -We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/Cheerio#contents) method to access individual nodes. It returns a list of nodes like this: +We can use Cheerio's [`.contents()`](https://cheerio.js.org/docs/api/classes/cheerio#contents) method to access individual nodes. It returns a list of nodes like this: ```text LoadedCheerio { @@ -197,7 +197,7 @@ if (response.ok) { } ``` -We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/Cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/Cheerio#last). If we run the scraper now, it should print prices as only amounts: +We're enjoying the fact that Cheerio selections provide utility methods for accessing items, such as [`.first()`](https://cheerio.js.org/docs/api/classes/cheerio#first) or [`.last()`](https://cheerio.js.org/docs/api/classes/cheerio#last). If we run the scraper now, it should print prices as only amounts: ```text $ node index.js @@ -237,7 +237,7 @@ Macao, China :::tip Need a nudge? -You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/Cheerio#eq). +You may want to check out Cheerio's [`.eq()`](https://cheerio.js.org/docs/api/classes/cheerio#eq). ::: diff --git a/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md b/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md index 74b440b4bc..d728da6954 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_javascript/07_extracting_data.md @@ -290,7 +290,7 @@ Hamilton reveals distress over ‘devastating’ groundhog accident at Canadian :::tip Need a nudge? - HTML's `time` element can have an attribute `datetime`, which [contains data in a machine-readable format](https://developer.mozilla.org/en-US/docs/Web/HTML/Element/time), such as the ISO 8601. -- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/Cheerio#attr) to access attributes. +- Cheerio gives you [.attr()](https://cheerio.js.org/docs/api/classes/cheerio#attr) to access attributes. - In JavaScript you can use an ISO 8601 string to create a [`Date`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date) object. - To get the date, you can call `.toDateString()` on `Date` objects. diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md index 8410611660..da3e594d2b 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md +++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/index.md @@ -43,31 +43,20 @@ Our crawler's input will look like this: } ``` -The goal at hand is to scrape all of the products from the first page of results for whatever keyword was provided (for our test case, it will be **iPhone**), then to scrape all available offers of each product and push the results to the dataset. For context, the offers for a product look like this: - -![Amazon product offers](../../../platform/expert_scraping_with_apify/images/product-offers.jpg) +The goal is to scrape all of the products from the first page of results for whatever keyword was provided (for our test case, it will be **iPhone**), then for each product visit its page and scrape the featured offer. Push the results to the dataset. In the end, we'd like our final output to look something like this: ```json [ { - "title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)", - "asin": "B07P6Y7954", - "itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone", - "description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.", + "title": "Apple iPhone 16e, 128GB, Black - Unlocked (Renewed)", + "asin": "B0F4RM7Y2L", + "itemUrl": "https://www.amazon.com/Apple-iPhone-128GB-eSIM-Black/dp/B0F4RM7Y2L", + "description": "This product is certified refurbished...", "keyword": "iphone", "sellerName": "Blutek Intl", - "offer": "$162.97" - }, - { - "title": "Apple iPhone 6 a1549 16GB Space Gray Unlocked (Certified Refurbished)", - "asin": "B07P6Y7954", - "itemUrl": "https://www.amazon.com/Apple-iPhone-Unlocked-Certified-Refurbished/dp/B00YD547Q6/ref=sr_1_2?s=wireless&ie=UTF8&qid=1539772626&sr=1-2&keywords=iphone", - "description": "What's in the box: Certified Refurbished iPhone 6 Space Gray 16GB Unlocked , USB Cable/Adapter. Comes in a Generic Box with a 1 Year Limited Warranty.", - "keyword": "iphone", - "sellerName": "PLATINUM DEALS", - "offer": "$169.98" + "offer": "$329.99" }, { "...": "..." @@ -78,7 +67,7 @@ In the end, we'd like our final output to look something like this: > The `asin` is the ID of the product, which is data present on the Amazon website. -Each of the items in the dataset will represent a scraped offer and will have the same `title`, `asin`, `itemUrl`, and `description`. The offer-specific fields will be `sellerName` and `offer`. +Each item in the dataset represents one product. The `sellerName` and `offer` fields come from the featured offer shown on the product page, which means you will end up with one result per product. diff --git a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md index 94545b3e7d..065bf8d291 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md +++ b/sources/academy/webscraping/scraping_basics_legacy/challenge/scraping_amazon.md @@ -31,29 +31,17 @@ router.addHandler(labels.PRODUCT, async ({ $ }) => { ``` -Great! But wait, where do we go from here? We need to go to the offers page next and scrape each offer, but how can we do that? Let's take a small break from writing the scraper and open up Proxyman to analyze requests which we might be difficult to find in the network tab, then we'll click the button on the product page that loads up all of the product offers: +Great! But where do we go from here? We need to visit the product page and scrape the offer from there. -![View offers button](./images/view-offers-button.jpg) +:::note Why only one offer per product? -After clicking this button and checking back in Proxyman, we discovered this link: +Amazon product pages list a single featured offer in their static HTML. The full list of sellers is loaded separately by JavaScript after the page loads, which means CheerioCrawler (which only fetches the raw HTML, without running JavaScript) cannot see it. -> You can find the request below in the network tab just fine, but with Proxyman, it is much easier and faster due to the extended filtering options. +If you need all seller offers, you would have to use [PlaywrightCrawler](../../puppeteer_playwright/index.md), which runs a real browser and can wait for that content to load. For this course, scraping the featured offer is enough to cover the key concepts. -```text -https://www.amazon.com/gp/aod/ajax/ref=auto_load_aod?asin=B07ZPKBL9V&pc=dp -``` - -The `asin` [query parameter](https://www.branch.io/glossary/query-parameters/) matches up with our product's ASIN, which means we can use this for any product of which we have the ASIN. - -Here's what this page looks like: - -![View offers page](./images/offers-page.jpg) - -Wow, that's ugly. But for our scenario, this is really great. When we click the **View offers** button, we usually have to wait for the offers to load and render, which would mean we could have to switch our entire crawler to a **PuppeteerCrawler** or **PlaywrightCrawler**. The data on this page we've just found appears to be loaded statically, which means we can still use CheerioCrawler and keep the scraper as efficient as possible 😎 - -> It's totally possible to scrape the same data as this crawler using [Puppeteer or Playwright](../../puppeteer_playwright/index.md); however, with this offers link found in Postman, we can follow the same workflow much more quickly with static HTTP requests using CheerioCrawler. +::: -First, we'll create a request for each product's offers page: +We'll add a request to the product page for each product: ```js // routes.js @@ -63,23 +51,22 @@ First, we'll create a request for each product's offers page: router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { const { data } = request.userData; - const element = $('div#productDescription'); + const description = $('div#productDescription').text().trim(); - // Add to the request queue await crawler.addRequests([{ - url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`, + url: `${BASE_URL}/dp/${data.asin}?th=1&psc=1`, label: labels.OFFERS, userData: { data: { ...data, - description: element.text().trim(), + description, }, }, }]); }); ``` -Finally, we can handle the offers in a separate handler: +Then we handle it in the OFFERS handler: ```js // routes.js @@ -87,16 +74,14 @@ Finally, we can handle the offers in a separate handler: router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; - for (const offer of $('#aod-offer')) { - const element = $(offer); - - await Dataset.pushData({ - ...data, - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), - offer: element.find('.a-price .a-offscreen').text().trim(), - }); + const price = $('.a-price .a-offscreen').first().text().trim(); + const sellerName = $('#sellerProfileTriggerId, #merchant-info a').first().text().trim(); - } + await Dataset.pushData({ + ...data, + sellerName, + offer: price, + }); }); ``` @@ -153,16 +138,16 @@ router.addHandler(labels.START, async ({ $, crawler, request }) => { router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { const { data } = request.userData; - const element = $('div#productDescription'); + const description = $('div#productDescription').text().trim(); await crawler.addRequests([ { - url: `${BASE_URL}/gp/aod/ajax/ref=auto_load_aod?asin=${data.asin}&pc=dp`, + url: `${BASE_URL}/dp/${data.asin}?th=1&psc=1`, label: labels.OFFERS, userData: { data: { ...data, - description: element.text().trim(), + description, }, }, }, @@ -172,15 +157,14 @@ router.addHandler(labels.PRODUCT, async ({ $, crawler, request }) => { router.addHandler(labels.OFFERS, async ({ $, request }) => { const { data } = request.userData; - for (const offer of $('#aod-offer')) { - const element = $(offer); + const price = $('.a-price .a-offscreen').first().text().trim(); + const sellerName = $('#sellerProfileTriggerId, #merchant-info a').first().text().trim(); - await Dataset.pushData({ - ...data, - sellerName: element.find('div[id*="soldBy"] a[aria-label]').text().trim(), - offer: element.find('.a-price .a-offscreen').text().trim(), - }); - } + await Dataset.pushData({ + ...data, + sellerName, + offer: price, + }); }); ``` diff --git a/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md b/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md index d185e8b01a..3b9c03aad0 100644 --- a/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md +++ b/sources/academy/webscraping/scraping_basics_legacy/crawling/finding_links.md @@ -57,7 +57,7 @@ We'll start from a boilerplate that's very similar to the scraper we built in [B {Example} -Aside from importing libraries and downloading HTML, we load the HTML into Cheerio and then use it to retrieve all the `` elements. After that, we iterate over the collected links and print their `href` attributes, which we access using the [`.attr()`](https://cheerio.js.org/docs/api/classes/Cheerio#attr) method. +Aside from importing libraries and downloading HTML, we load the HTML into Cheerio and then use it to retrieve all the `` elements. After that, we iterate over the collected links and print their `href` attributes, which we access using the [`.attr()`](https://cheerio.js.org/docs/api/classes/cheerio#attr) method. When you run the above code, you'll see quite a lot of links in the terminal. Some of them may look wrong, because they don't start with the regular `https://` protocol. We'll learn what to do with them in the following lessons. diff --git a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md index f864362f8a..ad21d99501 100644 --- a/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md +++ b/sources/academy/webscraping/scraping_basics_python/03_devtools_extracting_data.md @@ -78,7 +78,7 @@ In the next lesson, we'll start with our Python project. First we'll be figuring ### Extract the price of IKEA's most expensive artificial plant -At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number. +At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/artificial-plants-flowers-20492/), use CSS selectors and HTML elements manipulation in the **Console** to extract the price of the most expensive artificial plant (sold in Sweden, as you'll be browsing their Swedish offer). Before opening DevTools, use your judgment to adjust the page to make the task as straightforward as possible. Finally, use JavaScript's [`parseInt()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt) function to convert the price text into a number (you may need [`replace()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/replace) to handle spaces).
Solution @@ -90,8 +90,8 @@ At IKEA's [Artificial plants & flowers listing](https://www.ikea.com/se/en/cat/a 1. Notice that the price is structured into two elements, with the integer separated from the currency, under a class named `plp-price__integer`. This structure is convenient for extracting the value. 1. In the **Console**, execute `document.querySelector('.plp-price__integer')`. This returns the element representing the first price in the listing. Since `document.querySelector()` returns the first matching element, it directly selects the most expensive plant's price. 1. Save the element in a variable by executing `price = document.querySelector('.plp-price__integer')`. - 1. Convert the price text into a number by executing `parseInt(price.textContent)`. - 1. At the time of writing, this returns `699`, meaning [699 SEK](https://www.google.com/search?q=699%20sek). + 1. Convert the price text into a number by executing `parseInt(price.textContent.replace(' ', ''))`. The price text contains spaces as thousand separators, so `.replace(' ', '')` strips them before parsing - a technique you'll explore further in the [Extracting data](./07_extracting_data.md#removing-dollar-sign-and-commas) lesson. + 1. At the time of writing, this returns `1299`, meaning [1 299 SEK](https://www.google.com/search?q=1299%20sek).
@@ -116,7 +116,7 @@ On Fandom's [Movies page](https://www.fandom.com/topics/movies), use CSS selecto ### Extract details about the first post on Guardian's F1 news -On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph, and URL of the associated photo. +On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), use CSS selectors and HTML manipulation in the **Console** to extract details about the first post. Specifically, extract its title, lead paragraph (if it has one), and URL of the associated photo. ![F1 news page](../scraping_basics/images/devtools-exercise-guardian2.png) @@ -129,7 +129,7 @@ On the Guardian's [F1 news page](https://www.theguardian.com/sport/formulaone), 1. Notice that the markup does not provide clear, reusable class names for this task. The structure uses generic tag names and randomized classes, requiring you to rely on the element hierarchy and order instead. 1. In the **Console**, execute `post = document.querySelector('#maincontent ul li')`. This returns the element representing the first post. 1. Extract the post's title by executing `post.querySelector('h3').textContent`. - 1. Extract the lead paragraph by executing `post.querySelector('span div').textContent`. + 1. Extract the lead paragraph (if it has one) by executing `post.querySelector('span div').textContent`. 1. Extract the photo URL by executing `post.querySelector('img').src`. diff --git a/sources/platform/integrations/data-storage/drive.md b/sources/platform/integrations/data-storage/drive.md index 126103484a..700dd9a0ee 100644 --- a/sources/platform/integrations/data-storage/drive.md +++ b/sources/platform/integrations/data-storage/drive.md @@ -1,6 +1,6 @@ --- title: Google Drive integration -description: Automatically save Apify Actor run results to Google Drive. Set up the integration on your task to upload files to your Drive after each successful run. +description: Automatically save Apify Actor run results to Google Drive. Set up the integration on an Actor or saved task to upload files to your Drive after each successful run. sidebar_label: Google Drive sidebar_position: 3 slug: /integrations/drive @@ -8,7 +8,7 @@ slug: /integrations/drive import ThirdPartyDisclaimer from '@site/sources/_partials/_third-party-integration.mdx'; -Save Apify Actor run results directly to Google Drive. Set up the integration on your task to automatically upload files after each successful run. +Save Apify Actor run results directly to Google Drive. Set up the integration on an Actor or saved task to automatically upload files after each successful run. @@ -18,11 +18,10 @@ To use the Apify integration for Google Drive, you will need: - An [Apify account](https://console.apify.com/). - A Google account -- A saved Actor Task ## Set up Google Drive integration -1. Head over to **Integrations** tab in your saved task and click on the **Upload results to GDrive** integration. +1. Head over to the **Integrations** tab of your Actor or saved task and click on the **Upload results to GDrive** integration. ![Google Drive integration](../images/google/google-integrations-add.png) diff --git a/sources/platform/integrations/workflows-and-notifications/gmail.md b/sources/platform/integrations/workflows-and-notifications/gmail.md index cb2fd4814a..357643504d 100644 --- a/sources/platform/integrations/workflows-and-notifications/gmail.md +++ b/sources/platform/integrations/workflows-and-notifications/gmail.md @@ -8,7 +8,7 @@ slug: /integrations/gmail import ThirdPartyDisclaimer from '@site/sources/_partials/_third-party-integration.mdx'; -Send automated email notifications with Actor run results to any Gmail address. Set up the integration on your task to receive emails after each successful run. +Send automated email notifications with Actor run results to any Gmail address. Set up the integration on an Actor or saved task to receive emails after each successful run. @@ -18,11 +18,10 @@ To use the Apify integration for Gmail, you will need: - An [Apify account](https://console.apify.com/). - A Google account -- A saved Actor Task ## Set up Gmail integration -1. Head over to **Integrations** tab in your task and click on the **Send results email via Gmail** integration. +1. Head over to the **Integrations** tab of your Actor or saved task and click on the **Send results email via Gmail** integration. ![Google Drive integration](../images/google/google-integrations-add.png)