diff --git a/src/blog/2026-02-25-persistence/index.malloynb b/src/blog/2026-02-25-persistence/index.malloynb new file mode 100644 index 00000000..780a0ace --- /dev/null +++ b/src/blog/2026-02-25-persistence/index.malloynb @@ -0,0 +1,171 @@ +>>>markdown +# Persistence in Malloy: A Foundation, Not a Framework + +*February 25, 2026 by Michael Toy* + +As complex data is mined for information, inevitably there are transformations that are expensive to compute. A summary table that takes thirty seconds to build is fine the first time, but not when it gets recomputed every time someone opens a dashboard. The obvious answer is to save those intermediate results as tables — but "obvious" and "simple" are very different things. + +Persistence in data computation is a huge problem space. Caching. Invalidation. Scheduling. Garbage collection. Production versus staging environments. Developer workflows. Multi-tenant quotas. The list goes on. Most tools either ignore this entirely — leaving you to manage it outside the system — or ship an opinionated pipeline that works great until your needs don't fit the opinions. + +We wanted a third option. So Malloy provides a **simple** implementation of persistence — and a foundation for **sophisticated** ones. + +## Separate the Machinery from the Policy + +The simple implementation is `malloy-cli build` and the VS Code extension: annotate sources, run a build, query pre-built tables. It handles the common case — a developer or small team persisting expensive intermediate results — with minimal setup. + +But the persistence *foundation* is deliberately more general. The core provides three primitives: + +1. **Annotation** — mark a source as persistent with `#@ persist` +2. **Graph examination** — walk a model's dependency graph to understand what needs building, and in what order +3. **Table substitution** — given a manifest of what's been built, swap in the pre-built table at query time + +Everything else — when to build, when to invalidate, how to handle environments, what to do about failures — is the application's problem. This is a design choice, not a limitation. The simple builder and a multi-tenant SAAS platform serving thousands of users with age-based refresh and per-user quotas both use the same three primitives. The foundation doesn't care what you build on it. + +## The Design Journey: From Queries to Sources + +Our initial thinking focused on queries as the unit of persistence. After all, when you persist something, you're persisting the result of running a query. + +But as we worked through the design, we realized that while a *query* is what gets executed and persisted, it is always being persisted *as a source*. The resulting table becomes something that other Malloy code references — and references in Malloy are to sources, not queries. + +This shift clarified the design significantly. To mark something as persistent, you annotate the source: + +```malloy +##! experimental.persistence + +#@ persist name=by_carrier +source: by_carrier is flights -> { + group_by: carrier + aggregate: flight_count is count() +} +``` + +The `#@ persist` annotation is metadata, not a language keyword. It lives in Malloy's tag system, which means it's extensible — a builder application can look at whatever additional data it wants in the annotation. The `name=by_carrier` part tells the simple builder what table to create, but a more sophisticated application might use annotations for table lifetime, environment targeting, or refresh policy. The language doesn't prescribe any of that. + +### Inheritance + +Persistence flows through `extend`. If the base data is worth persisting, derived versions usually are too: + +```malloy +#@ persist name=by_carrier +source: by_carrier is flights -> { + group_by: carrier + aggregate: flight_count is count() +} + +// Also persistent — inherits from by_carrier +source: enriched is by_carrier extend { + dimension: upper_carrier is upper(carrier) +} + +// Break the chain when you need to +#@ -persist +source: scratch is by_carrier extend { ... } +``` + +## The Simple Implementation: Builder + VS Code + +Here's what the simple implementation looks like in practice. + +### Annotate your sources + +Add `#@ persist name=` to any source backed by a query. Enable the experiment at the top of your file: + +```malloy +##! experimental.persistence + +source: flights is duckdb.table('flights.parquet') extend { + measure: flight_count is count() +} + +#@ persist name=by_carrier +source: by_carrier is flights -> { + group_by: carrier + aggregate: flight_count +} + +#@ persist name=by_origin +source: by_origin is flights -> { + group_by: origin + aggregate: flight_count +} +``` + +### Build with the CLI + +`malloy-cli build` is the builder. It compiles your models, walks the dependency graph, and creates tables in order: + +```bash +malloy-cli build models/ +``` + +It's incremental — unchanged sources are skipped. When you need to force a rebuild (say the underlying data has refreshed but your Malloy source hasn't changed), use `--refresh`: + +```bash +malloy-cli build --refresh duckdb:by_carrier +``` + +The builder writes a manifest when it's done. That's its only output beyond the database tables themselves. + +### Query in VS Code + +The VS Code extension, once it's set up to read the same config the builder uses, reads the manifest automatically. After a build, queries against persistent sources use the pre-built tables — no restart needed. + +To verify it's working, compile a query without running it and check the SQL. You'll see `FROM by_carrier` instead of an inlined subquery. + +### Setup + +Many connections work with no configuration at all — if a connection name matches a registered database type, Malloy creates one with default settings. For persistence, the main thing you need to decide is where the manifest lives: + +- **Global config** (simplest): The CLI reads `~/.config/malloy/malloy-config.json` by default and writes the manifest next to it. Point VS Code's `malloy.globalConfigDirectory` at the same directory, and both tools share the same manifest. No flags needed. + +- **Project config**: Put a `malloy-config.json` in your project root (it can be as minimal as `{}`) to anchor the manifest to your project. VS Code finds it automatically; the CLI needs `--config .`. + +See the [persistence documentation](https://docs.malloydata.dev/experiments/persistence.malloynb) for the full setup guide. + +## How It Works Under the Hood + +### BuildID: Content-Addressed Identity + +Every persistent source gets a **BuildID** — a hash of the SQL that would be generated and the connection it targets. This captures both the logical content of the source and the context in which it runs. Two users with different connection parameters get different BuildIDs for the same Malloy source, which is correct — their built tables may differ because the underlying data differs. + +Because the BuildID is content-addressed, incremental builds fall out naturally. If the SQL hasn't changed and the connection hasn't changed, the BuildID is the same, and the table doesn't need rebuilding. Change a `where` clause, and the BuildID changes, and the builder knows to rebuild. + +### Manifests: The Bridge Between Builder and Runtime + +The manifest is a simple data structure: a map from BuildID to information about the built table (connection, table name, when it was built). A builder writes it. The runtime reads it. That's the entire contract. + +This simplicity is the point. The manifest is just a JSON file. How it's stored and shared is up to you — it could be a file on disk, a blob in cloud storage, a row in a database. The runtime doesn't care where it came from. + +### The Spectrum of Manifest Usage + +Because the manifest is optional, persistence supports very different experiences depending on what you provide: + +**No manifest (development):** Everything expands inline. The model runs exactly as if no `#@ persist` annotations existed. Persistence is invisible until you opt in. This is the default — you can always run a model without building anything first. + +**Partial manifest (incremental builds):** Some sources are pre-built, others expand inline. The compiler doesn't care which sources are in the manifest and which aren't. This lets a builder do incremental work — build what's missing, skip what's fresh. + +**Full manifest with `strictPersist` (production):** The `strictPersist` option tells the compiler to fail if any persistent source is *missing* from the manifest. This catches configuration errors — if a table should have been built but wasn't, you find out at compile time rather than by accidentally running an expensive query in production. + +## Building a Sophisticated Persistence Application + +The simple implementation is built on top of the same API that a sophisticated application would use. If you need more — multi-environment manifests, custom invalidation logic, integration with your deployment pipeline, per-user table management — the foundation is there. + +The key API surfaces: + +- **`MalloyConfig`** — loads config and manifest, creates connections from the registry +- **`Manifest`** — in-memory manifest store; load, update, serialize +- **Model compilation** — compile a model and walk it to find persistent sources and their dependencies +- **BuildID computation** — content-addressed identity for each persistent source +- **`buildManifest` on Runtime** — set a manifest and the compiler handles substitution + +A builder application owns everything else: which models to scan, when to rebuild, how to handle partial failures, where to store manifests, how to manage environments. The foundation provides the "what" (these sources need building, in this order, and here's how to record what you built). Your application provides the "when," "where," and "how." + +For a concrete starting point, [`simple_builder.ts`](https://github.com/malloydata/malloy/blob/main/scripts/simple_builder.ts) is a skeleton builder application that shows how to find what needs building, build it, and record the results in a manifest. It's a good place to start if you want to build your own persistence layer. + +For the full persistence design, see [WN-0022: Persistent Sources](https://github.com/malloydata/whatsnext/blob/main/wns/WN-0022-persistence/wn-0022.md). For the shared configuration system that underpins both the simple workflow and custom applications, see [WN-0023: Shared Configuration](https://github.com/malloydata/whatsnext/blob/main/wns/WN-0023-connection-config/wn-0023.md). + +## Try It + +Persistence is experimental — enable it with `##! experimental.persistence` and let us know what you think. The [documentation](https://docs.malloydata.dev/experiments/persistence.malloynb) covers setup and usage. The WN specs linked above cover the design in full. And if you're building something interesting on the persistence API, we'd love to hear about it. + +>>>markdown diff --git a/src/documentation/experiments/experiments.malloynb b/src/documentation/experiments/experiments.malloynb index 65b8a9ee..7f73550e 100644 --- a/src/documentation/experiments/experiments.malloynb +++ b/src/documentation/experiments/experiments.malloynb @@ -13,4 +13,5 @@ Below is a list of currently running experiements and how to turn them on. * `##! experimental.sql_functions` - [Write expression in SQL](sql_expressions.malloynb) * `##! experimental.parameters` - [Declare sources with parameters](parameters.malloynb) * `##! experimental.composite_sources` - [Create virtual sources backed by multiple cube tables or source definitions](composite_sources.malloynb) -* `##! experimental.access_modifiers` - [Limit access to fields in a source](include.malloynb) \ No newline at end of file +* `##! experimental.access_modifiers` - [Limit access to fields in a source](include.malloynb) +* `##! experimental.persistence` - [Persist sources as tables with the CLI builder](persistence.malloynb) \ No newline at end of file diff --git a/src/documentation/experiments/persistence.malloynb b/src/documentation/experiments/persistence.malloynb new file mode 100644 index 00000000..c2bf9718 --- /dev/null +++ b/src/documentation/experiments/persistence.malloynb @@ -0,0 +1,158 @@ +>>>markdown +## Persistence + +Malloy sources backed by queries can be "persisted" — their results saved as database tables. When queries run against a persistent source, the runtime reads from the pre-built table instead of recomputing the query. + +This is an experimental feature. Enable it with `##! experimental.persistence` at the top of your `.malloy` file. + +### Persistence in Malloy + +Malloy's persistence support is a foundation, not a complete solution. The core provides machinery for annotating sources, examining models for dependencies, and substituting pre-built tables at query time — but all policy decisions (scheduling, invalidation, environments, quotas, garbage collection) are left to the application layer. This makes it possible to build sophisticated persistence strategies for complex applications. See [WN-0022 (Persistent Sources)](https://github.com/malloydata/whatsnext/blob/main/wns/WN-0022-persistence/wn-0022.md) and [WN-0023 (Shared Configuration)](https://github.com/malloydata/whatsnext/blob/main/wns/WN-0023-connection-config/wn-0023.md) for the full design. + +**This document** covers the simple, built-in persistence workflow: annotate sources with `#@ persist`, build tables with `malloy-cli build`, and use them in the [VS Code extension](../setup/extension.malloynb). No custom application code required. + +### Annotating sources + +Add `#@ persist name=` before any source backed by a query. The `name` is required — it specifies the database table that will hold the results. + +```malloy +##! experimental.persistence + +source: flights is duckdb.table('flights.parquet') extend { + measure: flight_count is count() +} + +#@ persist name=by_carrier +source: by_carrier is flights -> { + group_by: carrier + aggregate: flight_count +} + +#@ persist name=by_origin +source: by_origin is flights -> { + group_by: origin + aggregate: flight_count +} +``` + +Persistence is inherited when you `extend` a persistent source. The child keeps the same table name unless you override it. Use `#@ -persist` to opt out: + +```malloy +// Inherits persistence from by_carrier +source: enriched is by_carrier extend { + dimension: upper_carrier is upper(carrier) +} + +// Opt out of persistence +#@ -persist +source: temporary is by_carrier extend { ... } +``` + +### Setup + +Many database connections work with no [configuration](../setup/config.malloynb) at all — if a connection name matches a registered database type (DuckDB, BigQuery, Postgres, etc.), Malloy creates one with default settings. So for simple cases, you may not need a `malloy-config.json` to get started. + +The builder writes a **manifest** (`malloy-manifest.json`) that tells the runtime which tables have been built. The manifest lives in a directory next to the config file (default: `MANIFESTS/`, configurable via `manifestPath` in `malloy-config.json`). Both the builder and VS Code read the manifest from the same location. + +You need to decide where your config and manifest live. There are two common setups: + +#### Global config + +The CLI reads `~/.config/malloy/malloy-config.json` by default. The manifest is written to `~/.config/malloy/MANIFESTS/malloy-manifest.json`. + +This is the simplest setup — no flags needed when running the builder: + +```bash +malloy-cli build models/analytics.malloy +``` + +To have VS Code use the same global config and manifest, set the `malloy.globalConfigDirectory` setting to `~/.config/malloy`. + +#### Project config + +For project-specific connections or to keep the manifest with your project, place a `malloy-config.json` in the project directory. It can be as minimal as `{}` if default connections are sufficient — what matters is that it anchors the manifest location. + +``` +my-project/ + malloy-config.json + MANIFESTS/ + malloy-manifest.json ← created by the builder + models/ + analytics.malloy +``` + +VS Code detects `malloy-config.json` in the workspace root automatically. + +When using a project config with the CLI, pass `--config`: + +```bash +malloy-cli --config . build models/ +``` + +### Building + +The builder compiles `.malloy` files, finds `#@ persist` sources, and creates the database tables: + +```bash +# Build all .malloy files in the current directory (recursive) +malloy-cli build + +# Build a specific file +malloy-cli build models/analytics.malloy + +# Build all files in a directory (recursive) +malloy-cli build models/ + +# Preview what would be built without executing +malloy-cli build --dry-run +``` + +The builder: +1. Finds `#@ persist` sources in the specified files or directories +2. Computes a dependency graph and processes sources in topological order +3. Checks whether each table is already up to date (same SQL, same connection) +4. Skips unchanged sources; creates or replaces changed ones +5. Writes the manifest once at the end + +Output shows the status of each source: + +``` +models/analytics.malloy + ✓ by_carrier (duckdb) — up to date + ✓ by_origin (duckdb) — built (1.2s) → by_origin + +Manifest written: MANIFESTS/malloy-manifest.json + +Build complete: 1 built, 1 up to date +``` + +### Refreshing tables + +When a table needs to be rebuilt even though the Malloy source hasn't changed — for example, a summary of data that updates daily — use `--refresh` to force a rebuild: + +```bash +# Refresh a specific table +malloy-cli build --refresh duckdb:daily_summary + +# Refresh multiple tables +malloy-cli build --refresh duckdb:daily_summary,duckdb:hourly_counts +``` + +Tables not named in `--refresh` are still checked normally and skipped if up to date. + +Since the builder is a command-line tool, you can schedule refreshes however you like — cron, CI pipelines, or any other scheduler. For example: + +```bash +# crontab example +0 0 * * * malloy-cli --config /path/to/project build --refresh duckdb:daily_summary +``` + +### Using persisted tables in VS Code + +Once the builder has written the manifest, VS Code picks it up automatically — no restart needed. Queries against persistent sources use the persisted tables instead of recomputing. + +To verify, compile a query (without running it) and check the generated SQL. With a manifest, you'll see `FROM by_carrier` instead of an inlined subquery. + +If VS Code and the builder are reading different config files, they'll have different manifests. Make sure both point at the same `malloy-config.json` (see [Setup](#setup) above). +>>>markdown + diff --git a/src/documentation/malloy_cli/index.malloynb b/src/documentation/malloy_cli/index.malloynb index 517487e3..083549ab 100644 --- a/src/documentation/malloy_cli/index.malloynb +++ b/src/documentation/malloy_cli/index.malloynb @@ -21,7 +21,7 @@ For detailed setup instructions including environment variables, connection comm ### Default Connections -Two connections are created automatically if you don't already have a name that overrides them — `bigquery` and `duckdb`. If `.malloy` or `.malloysql` files reference these connection names, they work without explicit setup. DuckDB uses a built-in instance, and BigQuery attempts to connect using any existing gcloud authentication on your computer. +If a connection name in your model matches a registered database type (DuckDB, BigQuery, Postgres, Snowflake, Trino, or Presto), the CLI creates one with default settings automatically. For example, `duckdb.table('data.parquet')` works out of the box. Some database types rely on environment variables for their defaults (e.g. `SNOWFLAKE_ACCOUNT`, `TRINO_SERVER`) — see the [Configuration](../setup/config.malloynb) reference for the full list of defaults per connection type. ## Usage diff --git a/src/documentation/setup/cli.malloynb b/src/documentation/setup/cli.malloynb index f72322dd..c3d301fe 100644 --- a/src/documentation/setup/cli.malloynb +++ b/src/documentation/setup/cli.malloynb @@ -81,7 +81,7 @@ malloy-cli connections create postgres pg host=localhost port=5432 databaseName= ### Default Connections -Two connections are created automatically if you don't already have a connection that overrides them — `bigquery` and `duckdb`. If your Malloy files reference these connection names, they work without explicit setup. DuckDB uses a built-in in-memory instance, and BigQuery attempts to connect using any existing gcloud authentication on your computer. +If a connection name in your model matches a registered database type (DuckDB, BigQuery, Postgres, Snowflake, Trino, or Presto), the CLI creates one with default settings automatically. For example, `duckdb.table('data.parquet')` works out of the box. Some database types rely on environment variables for their defaults (e.g. `SNOWFLAKE_ACCOUNT`, `TRINO_SERVER`) — see the [Configuration](config.malloynb) reference for the full list of defaults per connection type. ### DuckDB Working Directory @@ -116,12 +116,41 @@ Useful for: - Copying SQL to other tools - Understanding what Malloy generates -### Get Help +### Build Persistent Tables + +Build persistent tables defined with `#@ persist` in Malloy files: ```bash +# Build all .malloy files in the current directory +malloy-cli build + +# Build a specific file +malloy-cli build path/to/model.malloy + +# Build all files in a directory (recursive) +malloy-cli build models/ + +# Preview what would be built without executing +malloy-cli build --dry-run + +# Force rebuild specific tables +malloy-cli build --refresh duckdb:daily_summary,duckdb:hourly_counts +``` + +See **[Persistence](../experiments/persistence.malloynb)** for full details on setting up persistent sources, manifests, and using them with VS Code. + +### Global Options + +```bash +# Use a project-level config instead of the global default +malloy-cli --config . run model.malloy +malloy-cli --config /path/to/malloy-config.json build + +# Get help malloy-cli --help malloy-cli run --help malloy-cli compile --help +malloy-cli build --help ``` --- @@ -130,4 +159,5 @@ malloy-cli compile --help - **[Database Support](database_support.malloynb)** — Overview of all supported databases - **[Transform & Materialize](../user_guides/transform.malloynb)** — Use the CLI with MalloySQL for data transformations +- **[Persistence](../experiments/persistence.malloynb)** — Persist sources as tables with the CLI builder - [malloy-cli on GitHub](https://github.com/malloydata/malloy-cli) diff --git a/src/documentation/setup/config.malloynb b/src/documentation/setup/config.malloynb index e847226d..cba7e2b2 100644 --- a/src/documentation/setup/config.malloynb +++ b/src/documentation/setup/config.malloynb @@ -145,6 +145,21 @@ SET search_path TO analytics; CREATE TEMP TABLE foo AS SELECT 1; --- +## Manifest Path + +The config file can specify where [persistence](../experiments/persistence.malloynb) manifests are stored: + +```json +{ + "connections": { ... }, + "manifestPath": "MANIFESTS" +} +``` + +The `manifestPath` property is optional and defaults to `"MANIFESTS"`. The manifest file is `/malloy-manifest.json`, relative to the config file's location. Both the builder (`malloy-cli build`) and the VS Code extension read the manifest from this path. + +--- + ## Environment Variables Any property value can be replaced with an environment variable reference. This is especially useful for sensitive values like passwords and API tokens, so they don't need to appear directly in your config file: diff --git a/src/documentation/setup/extension.malloynb b/src/documentation/setup/extension.malloynb index 394a3d32..55ffca08 100644 --- a/src/documentation/setup/extension.malloynb +++ b/src/documentation/setup/extension.malloynb @@ -66,7 +66,9 @@ Click any connection to open the connection editor, or use the **+** button to c Place a `malloy-config.json` file in the root of your project (workspace root). The extension detects it automatically and picks up changes whenever you save. In multi-root workspaces, each workspace root can have its own file with independent connection namespaces. -See **[Configuration](config.malloynb)** for the full config file format, connection type properties, setup SQL, and environment variables. +The extension also reads [persistence manifests](../experiments/persistence.malloynb) from the config's `manifestPath` directory. When the builder (`malloy-cli build`) writes a manifest, VS Code picks it up automatically — no restart needed. + +See **[Configuration](config.malloynb)** for the full config file format, connection type properties, manifest path, setup SQL, and environment variables. ### VS Code Settings @@ -175,3 +177,4 @@ Add a Trino or Presto connection via **Malloy: Edit Connections**. Enter the ser - **[Database Support](database_support.malloynb)** — Overview of all supported databases - **[Query existing models](../user_guides/querying_a_model.malloynb)** — Explore the malloy-samples to see Malloy in action - **[Build a semantic model](../user_guides/quickstart_modeling.malloynb)** — Create your first Malloy model from scratch +- **[Persistence](../experiments/persistence.malloynb)** — Persist sources as pre-built tables diff --git a/src/table_of_contents.json b/src/table_of_contents.json index 6e590dbb..42aab4fe 100644 --- a/src/table_of_contents.json +++ b/src/table_of_contents.json @@ -416,6 +416,10 @@ { "title": "Access Modifiers", "link": "/experiments/include.malloynb" + }, + { + "title": "Persistence", + "link": "/experiments/persistence.malloynb" } ] }