Feat (bigquery): store resource schema in the table's own OPTIONS, drop the _table_metadata store#3
Conversation
…op the _table_metadata store The Frictionless schema (fields + the `info` data dictionary + primaryKey/unique_key) is now JSON-encoded into each data table's own table-level `description` OPTION under a `datastore` key — written via `OPTIONS(...)` on CREATE and `SET OPTIONS(...)` on ALTER, and read back with a single `tables.get` call (no query job, ~200ms). Tables created outside the engine fall back to BigQuery-column inference. This removes the need for a separate per-resource metadata table. - Delete bigquery/metadata.py (BigQueryMetadataStore) and the MetadataStore Protocol from engines/base.py; encode/parse helpers now live in bigquery/lib.py (table_options_clause / set_table_options_sql / table_to_schema). - Drop the `_table_metadata` mention from the BIGQUERY_DATASET config description. Write-path refactor (readability): - Collapse the duplicated insert/merge/update plumbing into two helpers — _build_dml (builder SQL; ValueError -> 400) and _write_rows (run a @rows write; BQ errors -> 400); remove _insert_records_sql. - Share normalize_pk() and _user_fields() in lib.py (used by merge_sql/update_sql/info/_drop_columns). - Extract search_sql's COUNT-path selection into _search_sql_count_query. Tests: replace the obsolete test_metadata.py (which tested the deleted store and broke collection) with unit coverage for the native-metadata helpers; update test_tables.py. Full suite: 292 passing, ruff clean. Also: add PERF_PROFILE_ENABLED config flag (on-demand pyinstrument profiling via X-Profile header); minor ckan_client docstring tidy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 25 minutes and 57 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
📝 WalkthroughWalkthroughRemove the BigQuery MetadataStore and _table_metadata; store Frictionless schemas JSON-encoded in BigQuery table descriptions, read schemas via direct table metadata, refactor BigQuery backend DDL/DML into shared helpers, and update tests to exercise native-metadata helpers and new orchestration. ChangesBigQuery Metadata Storage Migration
Sequence DiagramsequenceDiagram
participant Client
participant Backend
participant LibHelpers
participant BigQuery
Client->>Backend: create(resource_id, schema, rows)
Backend->>BigQuery: get_table(resource_id)
alt exists
BigQuery-->>Backend: table metadata
Backend->>LibHelpers: table_to_schema
LibHelpers-->>Backend: schema
Backend->>BigQuery: client.query(ALTER statements or INSERT)
else missing
BigQuery-->>Backend: NotFound
Backend->>LibHelpers: _create_table_sql
LibHelpers-->>Backend: CREATE TABLE with OPTIONS and optional INSERT
Backend->>BigQuery: client.query(CREATE and optional INSERT)
end
BigQuery-->>Backend: job result
Backend-->>Client: success or error
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…F_PROFILE_ENABLED `create_app()` runs at import (module-level `app`), and the auth refactor's validator rejects AUTH_TYPE=ckan with an empty CKAN_URL — so `pytest` failed at collection in CI (no .env, no CKAN_URL env var). conftest now sets a dummy CKAN_URL in os.environ before importing datastore.main (pydantic prioritises env over .env); tests override the CKAN client via DI, so it is never contacted. This had left main red since the auth/bigquery merge. Also drop the unused PERF_PROFILE_ENABLED config field (no consumer; the profiling middleware that would read it isn't in this branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
datastore/infrastructure/engines/bigquery/backend.py (1)
78-96:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't make missing BigQuery config fall through to placeholder CRUD.
Leaving
client=Nonehere makes the no-op/empty-data branches below reachable for ordinary misconfiguration. That meanscreate/upsert/deletecan acknowledge success without persisting anything, whilesearch/infofabricate empty results. Please fail fast here, or gate placeholder mode behind an explicit test-only switch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@datastore/infrastructure/engines/bigquery/backend.py` around lines 78 - 96, The current client-builder returns early with self.client left as None, which allows normal CRUD methods (create, upsert, delete, search, info) to silently run in placeholder/no-op mode on misconfiguration; change the build logic in the BigQueryBackend client construction method so it fails fast instead of falling through: either raise a clear configuration error (e.g., RuntimeError) when BIGQUERY_PROJECT or BIGQUERY_DATASET are missing, or explicitly set a guarded placeholder flag (e.g., self._placeholder_mode) that is only enabled via an explicit test-only config switch and is checked by create/upsert/delete/search/info; update the code paths that currently assume client may be None to rely on the explicit flag or to never be reached because an exception was raised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@datastore/infrastructure/engines/bigquery/backend.py`:
- Around line 105-126: The dump() implementation still calls
self.metadata.get(...), but self.metadata was removed; update dump() to call the
existing _read_schema(resource_id) for the existence check instead: replace the
self.metadata.get(...) call in dump() with schema =
self._read_schema(resource_id) and if schema is None raise the same not-found
error type used elsewhere in this module (e.g., ResourceNotFound or equivalent)
or return early, then proceed using the schema/result; this removes the
dependency on self.metadata and reuses _read_schema(resource_id) for presence
checking.
- Around line 105-126: Several public and helper signatures in BigQuery backend
use bare generics (e.g., `_read_schema(...)-> dict | None`, variables `schema:
dict`, `records: list`, and return types like `-> tuple[str | None, list]`)
which fails mypy --strict; update these to parameterized container types: use
dict[str, Any] for schema maps, list[dict[str, Any]] for record lists, and
explicit element types for any parameter lists (or list[Any] if heterogeneous).
Locate `_read_schema`, plus the functions/variables around the noted ranges
(including create, upsert, search and any helper that returns (maybe_resource,
records)), and replace bare dict/list annotations with the concrete typed forms
(import Any from typing if needed) so all signatures and local variable type
hints are parameterized. Ensure return annotations (e.g., Optional[dict[str,
Any]] or tuple[str | None, list[dict[str, Any]]]) match the new types.
- Around line 173-178: The search() method currently calls
self.client.query(...) for both COUNT and page queries outside of the
_run_query() wrapper so submit-time exceptions aren't converted to ServerError;
update search() to wrap each self.client.query(...) call (both the COUNT submit
and the page/data query submit) in the same try/except logic used by
_run_query() or call _run_query() for submission so any exception from
client.query is caught and re-raised as ServerError (use the same op +
resource_id format), and ensure search_sql() remains consistent by treating the
non-fatal COUNT submit the same way.
In `@datastore/infrastructure/engines/bigquery/lib.py`:
- Around line 518-529: The fallback schema returned by _infer_schema_from_bq
currently uses raw BigQuery names (e.g., "INT64", "STRING") which downstream
helpers like table_to_schema() / delete_sql() and _FILTER_PARAM_TYPE don't
recognize; update _infer_schema_from_bq to map col.field_type to the module's
canonical Frictionless types before returning (e.g., use the existing
BigQuery->Frictionless mapping or add one in this module) so the returned
{"fields": ...} uses the normalized type names that _FILTER_PARAM_TYPE expects,
or alternatively extend the downstream type map to accept both representations.
---
Outside diff comments:
In `@datastore/infrastructure/engines/bigquery/backend.py`:
- Around line 78-96: The current client-builder returns early with self.client
left as None, which allows normal CRUD methods (create, upsert, delete, search,
info) to silently run in placeholder/no-op mode on misconfiguration; change the
build logic in the BigQueryBackend client construction method so it fails fast
instead of falling through: either raise a clear configuration error (e.g.,
RuntimeError) when BIGQUERY_PROJECT or BIGQUERY_DATASET are missing, or
explicitly set a guarded placeholder flag (e.g., self._placeholder_mode) that is
only enabled via an explicit test-only config switch and is checked by
create/upsert/delete/search/info; update the code paths that currently assume
client may be None to rely on the explicit flag or to never be reached because
an exception was raised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7bf7b408-d04a-44e1-9d06-a80243d01326
📒 Files selected for processing (8)
datastore/core/config.pydatastore/infrastructure/ckan_client.pydatastore/infrastructure/engines/base.pydatastore/infrastructure/engines/bigquery/backend.pydatastore/infrastructure/engines/bigquery/lib.pydatastore/infrastructure/engines/bigquery/metadata.pytests/engines/bigquery/test_metadata.pytests/engines/bigquery/test_tables.py
💤 Files with no reviewable changes (1)
- datastore/infrastructure/engines/bigquery/metadata.py
| def table_options_clause(schema: dict[str, Any]) -> str: | ||
| """Return ` OPTIONS(description = '…', labels = […])` for a table DDL.""" | ||
| desc = _encode_table_description(schema) | ||
| return ( | ||
| f" OPTIONS(description = {_sql_literal(desc)}, " | ||
| f"labels = [(\"datastore_managed\", \"true\")])" | ||
| ) | ||
|
|
||
|
|
||
| def set_table_options_sql(table_ref: str, schema: dict[str, Any]) -> str: | ||
| """Stand-alone `ALTER TABLE … SET OPTIONS(...)` statement. | ||
|
|
||
| Separate from column ALTERs because BQ refuses to mix column | ||
| actions and SET OPTIONS in one statement. | ||
| """ | ||
| desc = _encode_table_description(schema) | ||
| return ( | ||
| f"ALTER TABLE {table_ref} " | ||
| f"SET OPTIONS(description = {_sql_literal(desc)}, " | ||
| f"labels = [(\"datastore_managed\", \"true\")])" | ||
| ) |
There was a problem hiding this comment.
This metadata writer still overwrites unrelated table metadata.
_encode_table_description() and set_table_options_sql() synthesize description and labels from schema alone, so every refresh drops any existing human-authored description content and non-datastore labels. That contradicts the preservation claim at Lines 41-44, and unless the backend contract is being updated in lockstep, it also moves schema metadata off the repo's required _table_metadata round-trip path. As per coding guidelines, "Store field schema metadata (info dict with title, description, comment, example, unit, custom keys) verbatim via _table_metadata table in BigQuery; round-trip via datastore_info."
Also applies to: 566-575
- dump(): drop the dead `self.metadata` existence check (the metadata store was removed in this PR, so it would AttributeError on the first dump). Rely on the existing get_table call, mapping NotFound -> 404. [Critical] - search(): move the COUNT + page query submits inside the try so submit-time failures map to ServerError, not raw google exceptions. [Major] - _infer_schema_from_bq(): map BigQuery column types to canonical Frictionless names so downstream filter type maps work for unmanaged tables. [Major] - lib.py: correct the table-description note — the description is engine-owned and rewritten wholesale; manual edits / non-datastore labels are NOT preserved. [Major] - CLAUDE.md: update the stale `_table_metadata` references to the native table-level metadata design (schema stored in the table's own `description` OPTION). - tests: expect Frictionless types for unmanaged-table inference; drop the now-dead `backend.metadata` setup.
When datastore_delete drops columns (the `fields` path), the response now includes `result.schema` — the Frictionless Table Schema after the listed columns were removed — so callers can confirm the table's new shape without a follow-up datastore_info. Threaded through the existing `new_schema` that `_drop_columns` already builds: _drop_columns now returns it, delete() carries it on WriteResult.schema, and the service maps it onto the response. Row-delete and whole-table-drop paths leave schema unset (omitted from the envelope).
…y force guard
- datastore_create's resource-dict path now sends `url_type="datastore"` in the
resource_create payload, so the new CKAN resource is marked datastore-managed.
- datastore_create / _upsert / _delete refuse to write a CKAN resource whose
record carries `url_type="datastore"` unless the request sets `force: true`,
raising a Validation Error ("Cannot update a read-only resource. Use \"force\"
to force update."). Mirrors CKAN's guard against clobbering managed data;
no-op under non-CKAN auth (no resource record).
- Guard lives in api/auth.py (ensure_resource_writable), called after authorize
in each write endpoint.
The guard already no-op'd under non-CKAN auth (only the CKAN provider returns a resource record with url_type), but make the scope explicit: ensure_resource_writable now takes auth_type and returns early unless it's 'ckan', so a custom non-CKAN provider that happens to surface a url_type can't trip it. Adds unit tests for the gate (ckan blocks/forces, anonymous/jwt skip, non-datastore url_type ignored).
What
Stores the resource's Frictionless schema inside each data table's own metadata instead of a separate per-resource
_table_metadatatable.The schema (fields + the
infodata dictionary +primaryKey/unique_key) is JSON-encoded under adatastorekey in the table-leveldescriptionOPTION:OPTIONS(...)onCREATE TABLEandSET OPTIONS(...)onALTER,tables.getcall (no query job, ~200ms),This removes the need for the separate metadata store entirely.
Changes
_table_metadatastore: deletebigquery/metadata.py(BigQueryMetadataStore) and theMetadataStoreProtocol fromengines/base.py. Encode/parse helpers now live inbigquery/lib.py(table_options_clause/set_table_options_sql/table_to_schema)._build_dml(builder SQL;ValueError→ 400) and_write_rows(run a@rowswrite; BQ errors → 400); remove_insert_records_sql. Sharenormalize_pk()/_user_fields()inlib.py. Extractsearch_sql's COUNT-path selection into_search_sql_count_query.test_metadata.py(it tested the deleted store and broke collection) with unit coverage for the native-metadata helpers; updatetest_tables.py.PERF_PROFILE_ENABLEDconfig flag; minorckan_clientdocstring tidy.Testing
pytest: 318 passing (includes the dump-feature suite already onmain).ruff check: clean.datastore_dumpfeature from Feat (dump): /datastore/dump/<rid> with 302-or-stream-concat download #2 is preserved —client.pywas resolved to keepload_credentials, whichbackend.dumpdepends on.Notes
mainascc4cf95).mypy --stricthas pre-existing failures across the engine package (barelist/dictgenerics,Config | Noneaccess,WriteResult-as-dict) that this PR neither introduces nor fixes.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores
Documentation
Tests