[HWORKS-2802 / -2807] Document partitioned_by on feature group creation#585
Open
jimdowling wants to merge 17 commits into
Open
[HWORKS-2802 / -2807] Document partitioned_by on feature group creation#585jimdowling wants to merge 17 commits into
jimdowling wants to merge 17 commits into
Conversation
…tion https://hopsworks.atlassian.net/browse/HWORKS-2802 Add a section to docs/user_guides/fs/feature_group/create.md describing the storage-engine-native partitioned_by parameter for Delta feature groups. Covers: - Usage example with create_feature_group / get_or_create_feature_group. - The CREATE TABLE … USING DELTA … GENERATED ALWAYS AS … contract: the storage layer derives the partition columns; the user's dataframe never carries them. - Validation rules: mutual exclusion with partition_key, requires event_time. - Partition pruning table — Delta auto-derives partition predicates from the GENERATED expressions for hierarchical specs (year / year+month / year+month+day / year+month+day+hour), so `fg.read(start_time=..., end_time=...)` and `fg.filter(fg.event_time >= ...)` prune at the partition level. Non-hierarchical specs (e.g. ["month"], ["year","week"]) are valid but skip the auto-derivation — only direct predicates on the grain columns prune. Recommend hierarchical specs. - Online feature store behavior: derived columns live offline-only by default; online_partition_columns=true opts into online materialization. Until the onlinefs consumer filter ships, the backend rejects partitioned_by + online_enabled=true with the default online_partition_columns=false. Document both workarounds. - Hudi: partitioned_by + HUDI is rejected at creation; Hudi support is tracked under a separate follow-up ticket. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2802 The partitioned_by section described Delta GENERATED ALWAYS AS columns and storage-engine-side derivation, which is no longer how it works. Document the real design: the client derives the grain columns from event_time and writes them as real partition columns, pruning works natively on grain filters and via predicate translation on event_time ranges. Correct the online-store note: online-enabled partitioned_by feature groups are rejected entirely until HWORKS-2808, not only with the default online_partition_columns. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…io into HWORKS-2802
…io into HWORKS-2802
…note https://hopsworks.atlassian.net/browse/HWORKS-2802 The Hudi follow-up materializes the grain columns server-side and partitions on them directly; the CustomKeyGenerator phrasing described a mechanism the revised design no longer uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…io into HWORKS-2802
…io into HWORKS-2802
Contributor
There was a problem hiding this comment.
Pull request overview
Adds documentation to the Feature Group creation guide describing the new partitioned_by parameter for time-grain partitioning.
Changes:
- Introduces a new “Time-grain partitioning with
partitioned_by” section with a Python usage example. - Documents partition-pruning behavior for hierarchical vs non-hierarchical grain specs.
- Adds notes about online feature store and Hudi behavior (currently conflicting with the PR description).
https://hopsworks.atlassian.net/browse/HWORKS-2802 Flesh out the partitioned_by section into reference for the shipped feature: the parameter list (partitioned_by + online_partition_columns with their constraints), cross-session persistence and the round-trip through get_feature_group, the on-disk Hive layout, a read/partition- pruning example with the hierarchical-vs-non-hierarchical matrix, a clickstream-by-hour example, and the current online and Hudi limitations (online rejected at create and on enable). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…io into HWORKS-2802
https://hopsworks.atlassian.net/browse/HWORKS-2807 partitioned_by now works on DELTA and ICEBERG; NONE is rejected alongside Hudi. Update the section heading, supported-formats note, and the Hudi fallback guidance. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
https://hopsworks.atlassian.net/browse/HWORKS-2807 Non-stream Hudi feature groups now support partitioned_by (direct Spark write); stream feature groups and NONE are rejected. Update the section heading, supported-formats note, Hudi note, and add a stream note. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…eature views https://hopsworks.atlassian.net/browse/HWORKS-2802 Document that the hour grain requires a timestamp event_time (rejected on a date event_time), and that a feature view may select the derived grain columns even when it joins online-enabled feature groups: the grains are served from the offline store (training data, batch inference) and excluded from the online feature vector. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…io into HWORKS-2802
https://hopsworks.atlassian.net/browse/HWORKS-2802 Document that the feature group overview shows a Table DDL card with the Spark SQL CREATE TABLE for the offline table (format + partition columns) and the RonDB CREATE TABLE for the online table when online-enabled. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…y grains https://hopsworks.atlassian.net/browse/HWORKS-2802 Selecting a derived partitioned_by grain column into a feature view does not silently exclude it from the online vector: get_feature_vector and get_feature_vectors raise a FeatureStoreException, and feature-view creation warns when the view also joins an online-enabled feature group. The grains remain available offline (training data, batch inference). Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gibchikafa
approved these changes
Jun 18, 2026
https://hopsworks.atlassian.net/browse/HWORKS-2802 Reword the hourly-partitioning example so "clickstream" is not mistaken for a stream feature group (stream=True), which partitioned_by does not yet support. Signed-off-by: Jim Dowling <jim@logicalclocks.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
User-guide section documenting the
partitioned_byparameter on feature group creation, under the existing partitioning area indocs/user_guides/fs/feature_group/create.md.Covers:
create_feature_group/get_or_create_feature_group, and the resulting Hive on-disk layout (year=.../month=.../day=.../).event_timeon each write, and the backend registers them as ordinary partition columns through the normal table-creation path — there are no DeltaGENERATEDcolumns and no dedicated backend Spark job.partition_key, requiresevent_time, grain enum membership with no duplicates, no collision withevent_timeor an existing feature name, and thehourgrain requires atimestampevent_time.event_time-range read is rewritten into equivalent grain predicates by the query layer (pruning for hierarchical specs). Includes a hierarchical vs non-hierarchical behavior table.partitioned_byis not supported yet (deferred to HWORKS-2808), so the grains are offline-only andonline_partition_columnsis effectively alwaysFalsetoday. A feature view may still select a derived grain — it is served offline (training data / batch inference), butget_feature_vector/get_feature_vectorsraise aFeatureStoreException, and feature-view creation warns when the view also joins an online-enabled feature group.Pairs with:
offline_onlyonline groundwork.JIRA: HWORKS-2802. Engineering walkthrough: Confluence page.
Test plan
hopsworks-docs markdownlintclean (221 files, 0 errors).hopsworks-docs snakeoilclean (Python code blocks pass ruff at line length 88).hopsworks-docs check(mkdocs strict build) clean — API cross-references resolve, nav intact, no broken links.🤖 Generated with Claude Code