-
Notifications
You must be signed in to change notification settings - Fork 183
feat(docs): Multiple datasets #2228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
93a1efe
29d954f
9bd61b8
dbf37ac
da0b5f0
97effac
c0dd0c4
de69ce7
6f690e7
4c48fdf
9af53b2
0a79a25
1f7afe0
1b23702
fdbcb23
be76aa7
3bb3fa9
64f7dcb
d6064df
3606805
e8bf944
a43ca79
9052aef
e7628bc
cb2a3be
64be8f1
9a1c122
aa4dae6
31b586f
d6381bc
206c1aa
9f612ed
0798374
8eb93fc
d679842
13f6171
8346081
a04a694
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,143 @@ | ||
| --- | ||
| title: Multiple datasets | ||
| description: Learn how to use multiple datasets within your Actors to organize and store different types of data separately. | ||
| sidebar_position: 2 | ||
| slug: /actors/development/actor-definition/dataset-schema/multiple-datasets | ||
| --- | ||
|
|
||
| import Tabs from '@theme/Tabs'; | ||
| import TabItem from '@theme/TabItem'; | ||
|
|
||
| Actors that scrape different data types can store each type in its own dataset with separate validation rules. For example, an e-commerce scraper might store products in one dataset and categories in another. | ||
|
|
||
| Each dataset: | ||
|
|
||
|
valekjo marked this conversation as resolved.
|
||
| - Is created when the run starts | ||
| - Follows the run's data retention policy | ||
| - Can have its own validation schema | ||
|
|
||
| ## Define multiple datasets | ||
|
|
||
| Define datasets in your Actor schema using the `datasets` object: | ||
|
|
||
| ```json title=".actor/actor.json" | ||
| { | ||
| "actorSpecification": 1, | ||
| "name": "my-e-commerce-scraper", | ||
| "title": "E-Commerce Scraper", | ||
| "version": "1.0.0", | ||
| "storages": { | ||
| "datasets": { | ||
| "default": "./products_dataset_schema.json", | ||
| "categories": "./categories_dataset_schema.json" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| Provide schemas for individual datasets as file references or inline. Schemas follow the same structure as single-dataset schemas. | ||
|
|
||
| The keys of the `datasets` object are aliases that refer to specific datasets. The previous example defines two datasets aliased as `default` and `categories`. | ||
|
|
||
| :::info Alias versus named dataset | ||
|
|
||
| Aliases and names are different. Named datasets have specific behavior on the Apify platform (the automatic data retention policy doesn't apply to them). Aliased datasets follow the data retention of their run. Aliases only have meaning within a specific run. | ||
|
|
||
| ::: | ||
|
|
||
| Requirements: | ||
|
|
||
| - The `datasets` object must contain the `default` alias | ||
| - The `datasets` and `dataset` objects are mutually exclusive (use one or the other) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It'd be great to link the actor.json reference from here so that the reader can understand the difference between
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean the reference from docs or the publicly available schema files?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The reference in the docs
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding |
||
|
|
||
| See the full [Actor schema reference](../actor_json.md#reference). | ||
|
|
||
| ## Access datasets in Actor code | ||
|
|
||
| Access aliased datasets: using the Apify SDK, or reading the `ACTOR_STORAGES_JSON` environment variable directly. | ||
|
|
||
|
valekjo marked this conversation as resolved.
|
||
| ### Apify SDK | ||
|
|
||
| <Tabs groupId="main"> | ||
| <TabItem value="JavaScript" label="JavaScript"> | ||
|
|
||
| In the JavaScript/TypeScript SDK `>=3.7.0`, use `openDataset` with `alias` option: | ||
|
|
||
| ```js | ||
| const categoriesDataset = await Actor.openDataset({alias: 'categories'}); | ||
| ``` | ||
|
|
||
| :::note Running outside the Apify platform | ||
|
|
||
| When the JavaScript SDK runs outside the Apify platform, aliases fall back to names (using an alias is the same as using a named dataset). The dataset is purged on the first access when accessed using the `alias` option. | ||
|
|
||
| ::: | ||
|
|
||
| </TabItem> | ||
| <TabItem value="Python" label="Python"> | ||
|
|
||
| In the Python SDK `>=3.3.0`, use `open_dataset` with `alias` parameter: | ||
|
|
||
| ```py | ||
| categories_dataset = await Actor.open_dataset(alias='categories') | ||
| ``` | ||
|
|
||
|
valekjo marked this conversation as resolved.
|
||
| :::note Running outside the Apify platform | ||
|
|
||
| When the Python SDK runs outside the Apify platform, it uses the [Crawlee for Python aliasing mechanism](https://crawlee.dev/python/docs/guides/storages#named-and-unnamed-storages). Aliases are created as unnamed and purged on Actor start. | ||
|
|
||
| ::: | ||
|
|
||
| </TabItem> | ||
| </Tabs> | ||
|
|
||
|
|
||
| ### Environment variable | ||
|
|
||
| `ACTOR_STORAGES_JSON` contains JSON-encoded unique identifiers of all storages associated with the current Actor run. Use this approach when | ||
| working without the SDK: | ||
|
|
||
| ```sh | ||
| echo $ACTOR_STORAGES_JSON | jq '.datasets.categories' | ||
| # This will output id of the categories dataset, e.g. `"3ZojQDdFTsyE7Moy4"` | ||
| ``` | ||
|
|
||
|
|
||
| ## Configure the output schema | ||
|
|
||
| ### Storage tab | ||
|
|
||
| The **Storage** tab in the Actor run view displays all datasets defined by the Actor and used by the run (up to 10). | ||
|
|
||
| The Storage tab shows data but doesn't surface it clearly to end users. To present datasets more clearly, define an [output schema](../../actor_definition/output_schema/index.md). | ||
|
|
||
| ### Output schema | ||
|
|
||
| Actors with output schemas can reference datasets through variables using aliases: | ||
|
|
||
| ```json | ||
| { | ||
| "actorOutputSchemaVersion": 1, | ||
| "title": "Output schema", | ||
| "properties": { | ||
| "products": { | ||
| "type": "string", | ||
| "title": "Products", | ||
| "template": "{{storages.datasets.default.apiUrl}}/items" | ||
| }, | ||
| "categories": { | ||
| "type": "string", | ||
| "title": "Categories", | ||
| "template": "{{storages.datasets.categories.apiUrl}}/items" | ||
| } | ||
|
Comment on lines
+123
to
+132
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So... what does this (mainly the template field) actually do?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That should be clear from the output schema docs. Basically, this will show a dropdown (or tabs) in the Actor run Output tab, with "Products" and "Categories", and if the datasets also have some views defined, the users can switch between those too.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bold of you to assume that the reader saw and understood that 🙂 Maybe you could add a link to the "How templates work" section?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. :D Adding |
||
| } | ||
| } | ||
| ``` | ||
|
|
||
| [Read more](../output_schema/index.md#how-templates-work) about how templates work. | ||
|
|
||
| ## Billing for non-default datasets | ||
|
|
||
| When an Actor uses multiple datasets, only items pushed to the `default` dataset trigger the built-in `apify-default-dataset-item` event. Items in other datasets are not charged automatically. | ||
|
|
||
| To charge for items in other datasets, implement custom billing in your Actor code. Refer to the [billing documentation](../../../publishing/monetize/pay_per_event.mdx) for implementation details. | ||
Uh oh!
There was an error while loading. Please reload this page.