Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
93a1efe
feat(docs): Multiple datasets
valekjo Feb 5, 2026
29d954f
Added todos
valekjo Feb 5, 2026
9bd61b8
Update variable name
valekjo Feb 9, 2026
dbf37ac
Address changes
valekjo Feb 12, 2026
da0b5f0
Comments
valekjo Feb 24, 2026
97effac
Mention UI
valekjo Mar 3, 2026
c0dd0c4
Progress
valekjo Mar 11, 2026
de69ce7
Progress
valekjo Mar 11, 2026
6f690e7
Progress
valekjo Mar 12, 2026
4c48fdf
Fix lint error
valekjo Mar 12, 2026
9af53b2
Merge branch 'master' into feature/multiple-datasets
valekjo Mar 12, 2026
0a79a25
missing .
valekjo Mar 12, 2026
1f7afe0
fix
valekjo Mar 12, 2026
1b23702
Address CR
valekjo Mar 12, 2026
fdbcb23
Update multiple datasets documentation
jgagne Mar 13, 2026
be76aa7
Fix formatting of info and warning blocks in markdown
jgagne Mar 13, 2026
3bb3fa9
MD blocks end marks
valekjo Mar 13, 2026
64f7dcb
fix: trailing newline
valekjo Mar 13, 2026
d6064df
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
3606805
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
e8bf944
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
a43ca79
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
9052aef
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
e7628bc
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
cb2a3be
Fix link
valekjo Mar 18, 2026
64be8f1
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
9a1c122
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
aa4dae6
fix link
valekjo Mar 18, 2026
31b586f
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
d6381bc
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 18, 2026
206c1aa
Address CR
valekjo Mar 18, 2026
9f612ed
Added more links
valekjo Mar 19, 2026
0798374
Fix lint
valekjo Mar 19, 2026
8eb93fc
Fix lint
valekjo Mar 19, 2026
d679842
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 20, 2026
13f6171
Update sources/platform/actors/development/actor_definition/dataset_s…
valekjo Mar 20, 2026
8346081
Update sources/platform/actors/development/actor_definition/actor_jso…
valekjo Mar 20, 2026
a04a694
Merge branch 'master' into feature/multiple-datasets
valekjo Mar 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ Actor `name`, `version`, `buildTag`, and `environmentVariables` are currently on
| `input` | Optional | You can embed your [input schema](./input_schema/index.md) object directly in `actor.json` under the `input` field. You can also provide a path to a custom input schema. If not provided, the input schema at `.actor/INPUT_SCHEMA.json` or `INPUT_SCHEMA.json` is used, in this order of preference. |
| `changelog` | Optional | The path to the CHANGELOG file displayed in the Information tab of the Actor in Apify Console next to Readme. If not provided, the CHANGELOG at `.actor/CHANGELOG.md` or `CHANGELOG.md` is used, in this order of preference. Your Actor doesn't need to have a CHANGELOG but it is a good practice to keep it updated for published Actors. |
| `storages.dataset` | Optional | You can define the schema of the items in your dataset under the `storages.dataset` field. This can be either an embedded object or a path to a JSON schema file. [Read more](/platform/actors/development/actor-definition/dataset-schema) about Actor dataset schemas. |
| `storages.datasets` | Optional | You can define multiple datasets for the Actor under the `storages.datasets` field. This can be an object containing embedded objects or paths to a JSON schema files. [Read more](/platform/actors/development/actor-definition/dataset-schema/multiple-datasets) about multiple dataset schemas. |
| `defaultMemoryMbytes` | Optional | Specifies the default amount of memory in megabytes to be used when the Actor is started. Can be an integer or a [dynamic memory expression string](./dynamic_actor_memory/index.md). |
| `minMemoryMbytes` | Optional | Specifies the minimum amount of memory in megabytes required by the Actor to run. Requires an _integer_ value. If both `minMemoryMbytes` and `maxMemoryMbytes` are set, then `minMemoryMbytes` must be equal or lower than `maxMemoryMbytes`. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
| `maxMemoryMbytes` | Optional | Specifies the maximum amount of memory in megabytes required by the Actor to run. It can be used to control the costs of run, especially when developing pay per result Actors. Requires an _integer_ value. Refer to the [Usage and resources](https://docs.apify.com/platform/actors/running/usage-and-resources#memory) for more details about memory allocation. |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: Multiple datasets
description: Learn how to use multiple datasets within your Actors to organize and store different types of data separately.
sidebar_position: 2
slug: /actors/development/actor-definition/dataset-schema/multiple-datasets
---

Comment thread
valekjo marked this conversation as resolved.
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

Actors that scrape different data types can store each type in its own dataset with separate validation rules. For example, an e-commerce scraper might store products in one dataset and categories in another.

Each dataset:

Comment thread
valekjo marked this conversation as resolved.
- Is created when the run starts
- Follows the run's data retention policy
- Can have its own validation schema

## Define multiple datasets

Define datasets in your Actor schema using the `datasets` object:

```json title=".actor/actor.json"
{
"actorSpecification": 1,
"name": "my-e-commerce-scraper",
"title": "E-Commerce Scraper",
"version": "1.0.0",
"storages": {
"datasets": {
"default": "./products_dataset_schema.json",
"categories": "./categories_dataset_schema.json"
}
}
}
```

Provide schemas for individual datasets as file references or inline. Schemas follow the same structure as single-dataset schemas.

The keys of the `datasets` object are aliases that refer to specific datasets. The previous example defines two datasets aliased as `default` and `categories`.

:::info Alias versus named dataset

Aliases and names are different. Named datasets have specific behavior on the Apify platform (the automatic data retention policy doesn't apply to them). Aliased datasets follow the data retention of their run. Aliases only have meaning within a specific run.

:::

Requirements:

- The `datasets` object must contain the `default` alias
- The `datasets` and `dataset` objects are mutually exclusive (use one or the other)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great to link the actor.json reference from here so that the reader can understand the difference between dataset and datasets better.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the reference from docs or the publicly available schema files?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference in the docs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding

See the full [Actor schema reference](../actor_json.md#reference).


See the full [Actor schema reference](../actor_json.md#reference).

## Access datasets in Actor code

Access aliased datasets: using the Apify SDK, or reading the `ACTOR_STORAGES_JSON` environment variable directly.

Comment thread
valekjo marked this conversation as resolved.
### Apify SDK

<Tabs groupId="main">
<TabItem value="JavaScript" label="JavaScript">

In the JavaScript/TypeScript SDK `>=3.7.0`, use `openDataset` with `alias` option:

```js
const categoriesDataset = await Actor.openDataset({alias: 'categories'});
```

:::note Running outside the Apify platform

When the JavaScript SDK runs outside the Apify platform, aliases fall back to names (using an alias is the same as using a named dataset). The dataset is purged on the first access when accessed using the `alias` option.

:::

</TabItem>
<TabItem value="Python" label="Python">

In the Python SDK `>=3.3.0`, use `open_dataset` with `alias` parameter:

```py
categories_dataset = await Actor.open_dataset(alias='categories')
```

Comment thread
valekjo marked this conversation as resolved.
:::note Running outside the Apify platform

When the Python SDK runs outside the Apify platform, it uses the [Crawlee for Python aliasing mechanism](https://crawlee.dev/python/docs/guides/storages#named-and-unnamed-storages). Aliases are created as unnamed and purged on Actor start.

:::

</TabItem>
</Tabs>


### Environment variable

`ACTOR_STORAGES_JSON` contains JSON-encoded unique identifiers of all storages associated with the current Actor run. Use this approach when
working without the SDK:

```sh
echo $ACTOR_STORAGES_JSON | jq '.datasets.categories'
# This will output id of the categories dataset, e.g. `"3ZojQDdFTsyE7Moy4"`
```


## Configure the output schema

### Storage tab

The **Storage** tab in the Actor run view displays all datasets defined by the Actor and used by the run (up to 10).

The Storage tab shows data but doesn't surface it clearly to end users. To present datasets more clearly, define an [output schema](../../actor_definition/output_schema/index.md).

### Output schema

Actors with output schemas can reference datasets through variables using aliases:

```json
{
"actorOutputSchemaVersion": 1,
"title": "Output schema",
"properties": {
"products": {
"type": "string",
"title": "Products",
"template": "{{storages.datasets.default.apiUrl}}/items"
},
"categories": {
"type": "string",
"title": "Categories",
"template": "{{storages.datasets.categories.apiUrl}}/items"
}
Comment on lines +123 to +132
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... what does this (mainly the template field) actually do?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be clear from the output schema docs. Basically, this will show a dropdown (or tabs) in the Actor run Output tab, with "Products" and "Categories", and if the datasets also have some views defined, the users can switch between those too.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bold of you to assume that the reader saw and understood that 🙂 Maybe you could add a link to the "How templates work" section?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D Adding

[Read more](../output_schema/index.md#how-templates-work) about how templates work.

}
}
```

[Read more](../output_schema/index.md#how-templates-work) about how templates work.

## Billing for non-default datasets

When an Actor uses multiple datasets, only items pushed to the `default` dataset trigger the built-in `apify-default-dataset-item` event. Items in other datasets are not charged automatically.

To charge for items in other datasets, implement custom billing in your Actor code. Refer to the [billing documentation](../../../publishing/monetize/pay_per_event.mdx) for implementation details.
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ Here's a table of key system environment variables:
| `ACTOR_BUILD_TAGS` | A comma-separated list of tags of the Actor build used in the run. Note that this environment variable is assigned at the time of start of the Actor and doesn't change over time, even if the assigned build tags change. |
| `ACTOR_TASK_ID` | ID of the Actor task. Empty if Actor is run outside of any task, e.g. directly using the API. |
| `ACTOR_EVENTS_WEBSOCKET_URL` | Websocket URL where Actor may listen for [events](/platform/actors/development/programming-interface/system-events) from Actor platform. |
| `ACTOR_STORAGES_JSON` | JSON-encoded unique identifiers of storages associated with the current Actor run. |
| `ACTOR_DEFAULT_DATASET_ID` | Unique identifier for the default dataset associated with the current Actor run. |
| `ACTOR_DEFAULT_KEY_VALUE_STORE_ID` | Unique identifier for the default key-value store associated with the current Actor run. |
| `ACTOR_DEFAULT_REQUEST_QUEUE_ID` | Unique identifier for the default request queue associated with the current Actor run. |
Expand Down
Loading