Skip to content

[HWORKS-2802 / -2807] Document partitioned_by on feature group creation#585

Open
jimdowling wants to merge 17 commits into
logicalclocks:mainfrom
jimdowling:HWORKS-2802
Open

[HWORKS-2802 / -2807] Document partitioned_by on feature group creation#585
jimdowling wants to merge 17 commits into
logicalclocks:mainfrom
jimdowling:HWORKS-2802

Conversation

@jimdowling

@jimdowling jimdowling commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

User-guide section documenting the partitioned_by parameter on feature group creation, under the existing partitioning area in docs/user_guides/fs/feature_group/create.md.

Covers:

  • Usage with create_feature_group / get_or_create_feature_group, and the resulting Hive on-disk layout (year=.../month=.../day=.../).
  • The contract: the user's dataframe never carries the grain columns. The client derives them from event_time on each write, and the backend registers them as ordinary partition columns through the normal table-creation path — there are no Delta GENERATED columns and no dedicated backend Spark job.
  • Validation rules: mutually exclusive with partition_key, requires event_time, grain enum membership with no duplicates, no collision with event_time or an existing feature name, and the hour grain requires a timestamp event_time.
  • Partition pruning: the grain columns are real partition columns, so a direct grain filter prunes natively, and an event_time-range read is rewritten into equivalent grain predicates by the query layer (pruning for hierarchical specs). Includes a hierarchical vs non-hierarchical behavior table.
  • Online feature store: online-enabled partitioned_by is not supported yet (deferred to HWORKS-2808), so the grains are offline-only and online_partition_columns is effectively always False today. A feature view may still select a derived grain — it is served offline (training data / batch inference), but get_feature_vector / get_feature_vectors raise a FeatureStoreException, and feature-view creation warns when the view also joins an online-enabled feature group.
  • Formats: DELTA, ICEBERG, and HUDI on non-stream feature groups (the client materializes the grains; Hudi partitions on them). HUDI on the Python engine becomes a stream feature group and is not yet supported, and stream feature groups are not yet supported.
  • The feature group UI Table DDL card.

Pairs with:

  • hopsworks-api#961 — Python client: client-side grain materialization, cross-engine predicate translator, and the feature-view online-serving guard.
  • hopsworks-ee#3034 — Backend: persistence, validation, Hudi activation, and the offline_only online groundwork.
  • loadtest#859 — End-to-end workflows (feature group + feature-view serving guard).

JIRA: HWORKS-2802. Engineering walkthrough: Confluence page.

Test plan

  • hopsworks-docs markdownlint clean (221 files, 0 errors).
  • hopsworks-docs snakeoil clean (Python code blocks pass ruff at line length 88).
  • hopsworks-docs check (mkdocs strict build) clean — API cross-references resolve, nav intact, no broken links.

🤖 Generated with Claude Code

…tion

https://hopsworks.atlassian.net/browse/HWORKS-2802

Add a section to docs/user_guides/fs/feature_group/create.md
describing the storage-engine-native partitioned_by parameter for
Delta feature groups. Covers:

- Usage example with create_feature_group / get_or_create_feature_group.
- The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract:
  the storage layer derives the partition columns; the user's
  dataframe never carries them.
- Validation rules: mutual exclusion with partition_key, requires
  event_time.
- Partition pruning table — Delta auto-derives partition predicates
  from the GENERATED expressions for hierarchical specs (year /
  year+month / year+month+day / year+month+day+hour), so
  `fg.read(start_time=..., end_time=...)` and
  `fg.filter(fg.event_time >= ...)` prune at the partition level.
  Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid
  but skip the auto-derivation — only direct predicates on the
  grain columns prune. Recommend hierarchical specs.
- Online feature store behavior: derived columns live offline-only
  by default; online_partition_columns=true opts into online
  materialization. Until the onlinefs consumer filter ships, the
  backend rejects partitioned_by + online_enabled=true with the
  default online_partition_columns=false. Document both
  workarounds.
- Hudi: partitioned_by + HUDI is rejected at creation; Hudi support
  is tracked under a separate follow-up ticket.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jimdowling jimdowling changed the title [HWORKS-2802] Document partitioned_by parameter on feature group creation [HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation May 21, 2026
jimdowling and others added 7 commits May 30, 2026 11:43
https://hopsworks.atlassian.net/browse/HWORKS-2802

The partitioned_by section described Delta GENERATED ALWAYS AS columns and
storage-engine-side derivation, which is no longer how it works. Document
the real design: the client derives the grain columns from event_time and
writes them as real partition columns, pruning works natively on grain
filters and via predicate translation on event_time ranges. Correct the
online-store note: online-enabled partitioned_by feature groups are
rejected entirely until HWORKS-2808, not only with the default
online_partition_columns.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…note

https://hopsworks.atlassian.net/browse/HWORKS-2802

The Hudi follow-up materializes the grain columns server-side and
partitions on them directly; the CustomKeyGenerator phrasing described
a mechanism the revised design no longer uses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@jimdowling jimdowling marked this pull request as ready for review June 11, 2026 04:35
@jimdowling jimdowling requested a review from Copilot June 11, 2026 04:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation to the Feature Group creation guide describing the new partitioned_by parameter for time-grain partitioning.

Changes:

  • Introduces a new “Time-grain partitioning with partitioned_by” section with a Python usage example.
  • Documents partition-pruning behavior for hierarchical vs non-hierarchical grain specs.
  • Adds notes about online feature store and Hudi behavior (currently conflicting with the PR description).

Comment thread docs/user_guides/fs/feature_group/create.md Outdated
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
Comment thread docs/user_guides/fs/feature_group/create.md
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
jimdowling and others added 4 commits June 11, 2026 06:41
https://hopsworks.atlassian.net/browse/HWORKS-2802

Flesh out the partitioned_by section into reference for the shipped
feature: the parameter list (partitioned_by + online_partition_columns
with their constraints), cross-session persistence and the round-trip
through get_feature_group, the on-disk Hive layout, a read/partition-
pruning example with the hierarchical-vs-non-hierarchical matrix, a
clickstream-by-hour example, and the current online and Hudi
limitations (online rejected at create and on enable).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2807

partitioned_by now works on DELTA and ICEBERG; NONE is rejected alongside
Hudi. Update the section heading, supported-formats note, and the Hudi
fallback guidance.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2807

Non-stream Hudi feature groups now support partitioned_by (direct Spark
write); stream feature groups and NONE are rejected. Update the section
heading, supported-formats note, Hudi note, and add a stream note.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread docs/user_guides/fs/feature_group/create.md
Comment thread docs/user_guides/fs/feature_group/create.md
Comment thread docs/user_guides/fs/feature_group/create.md Outdated
jimdowling and others added 4 commits June 15, 2026 10:43
…eature views

https://hopsworks.atlassian.net/browse/HWORKS-2802

Document that the hour grain requires a timestamp event_time (rejected on
a date event_time), and that a feature view may select the derived grain
columns even when it joins online-enabled feature groups: the grains are
served from the offline store (training data, batch inference) and
excluded from the online feature vector.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2802

Document that the feature group overview shows a Table DDL card with the
Spark SQL CREATE TABLE for the offline table (format + partition columns)
and the RonDB CREATE TABLE for the online table when online-enabled.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y grains

https://hopsworks.atlassian.net/browse/HWORKS-2802

Selecting a derived partitioned_by grain column into a feature view does
not silently exclude it from the online vector: get_feature_vector and
get_feature_vectors raise a FeatureStoreException, and feature-view
creation warns when the view also joins an online-enabled feature group.
The grains remain available offline (training data, batch inference).

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jimdowling jimdowling changed the title [HWORKS-2802 / -2807] Document partitioned_by parameter on feature group creation [HWORKS-2802 / -2807] Document partitioned_by on feature group creation Jun 18, 2026
https://hopsworks.atlassian.net/browse/HWORKS-2802

Reword the hourly-partitioning example so "clickstream" is not mistaken
for a stream feature group (stream=True), which partitioned_by does not
yet support.

Signed-off-by: Jim Dowling <jim@logicalclocks.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants