-
Notifications
You must be signed in to change notification settings - Fork 186
pack-objects: integrate --path-walk and some --filter options #2101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: en/backfill-fixes-and-edges
Are you sure you want to change the base?
Changes from all commits
0840110
d7c8754
fb8a0f9
e77c8a6
f4904f8
f37467e
133c1b1
0f517be
b4dc09a
0b1eed0
b23244c
7e1e503
a615b1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -402,9 +402,11 @@ will be automatically changed to version `1`. | |
| of filenames that cause collisions in Git's default name-hash | ||
| algorithm. | ||
| + | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taylor Blau wrote on the Git mailing list (how to reply to this email): On Mon, May 11, 2026 at 06:13:04PM +0000, Derrick Stolee via GitGitGadget wrote:
> [...] Blobs whose size cannot be determined (e.g. missing in a partial
> clone) are conservatively included, matching the existing filter
> behavior.
Makes sense, but...
> Notice that this inspection of object sizes requires the content to be
> present in the repository. The odb_read_object_info() call will download
> a missing blob on-demand.
... this says that we do download missing blobs on-demand. Should we be
(temporarily) disabling 'fetch_if_missing' for this phase, or using
odb_read_object_info_extended() with the OBJECT_INFO_SKIP_FETCH_OBJECT
bit set?
I don't know enough about 'git backfill' to know whether the current
behavior is more reasonable than the above suggestion, so please let me
know if I'm missing something here!
The rest looks good to me.
Thanks,
TaylorThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 5/11/2026 9:33 PM, Taylor Blau wrote:
> On Mon, May 11, 2026 at 06:13:04PM +0000, Derrick Stolee via GitGitGadget wrote:
>> [...] Blobs whose size cannot be determined (e.g. missing in a partial
>> clone) are conservatively included, matching the existing filter
>> behavior.
>
> Makes sense, but...
>
>> Notice that this inspection of object sizes requires the content to be
>> present in the repository. The odb_read_object_info() call will download
>> a missing blob on-demand.
>
> ... this says that we do download missing blobs on-demand. Should we be
> (temporarily) disabling 'fetch_if_missing' for this phase, or using
> odb_read_object_info_extended() with the OBJECT_INFO_SKIP_FETCH_OBJECT
> bit set?
I don't know that we'd want to set this everywhere. The case that is
technically correct, but probably not ideal, is someone running 'git
pack-objects --filter=blob:<size>' from a blobless partial clone. They
are asking for something inefficient to create, but we _can_ still
create it.
> I don't know enough about 'git backfill' to know whether the current
> behavior is more reasonable than the above suggestion, so please let me
> know if I'm missing something here!
For 'git backfill', a size limit filter doesn't make sense as it's all
about the client making decisions about how to download batches of
objects based on the local tree data. Filtering on size isn't something
that the client can infer in advance of downloading everything. Further,
the direct blob requests to the server can't be filtered by size, by
design.
Thanks,
-Stolee
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taylor Blau wrote on the Git mailing list (how to reply to this email): On Mon, May 11, 2026 at 06:13:07PM +0000, Taylor Blau via GitGitGadget wrote:
> @@ -534,6 +545,18 @@ static int setup_pending_objects(struct path_walk_info *info,
> free(tagged_blobs);
> }
> }
> + if (tagged_trees) {
> + if (tagged_trees->oids.nr) {
> + const char *tagged_tree_path = "/tagged-trees";
> + tagged_trees->type = OBJ_TREE;
> + tagged_trees->maybe_interesting = 1;
> + strmap_put(&ctx->paths_to_lists, tagged_tree_path, tagged_trees);
> + push_to_stack(ctx, tagged_tree_path);
> + } else {
> + oid_array_clear(&tagged_trees->oids);
> + free(tagged_trees);
> + }
> + }
> if (tags) {
> if (tags->oids.nr) {
> const char *tag_path = "/tags";
It looks like there is some prior art here for enumerating a sentinel
path for "/tags", but I am curious why we did the same for
directly-listed trees in the presence of --filter=tree:0.
Thanks,
Taylor |
||
| Incompatible with `--delta-islands`, `--shallow`, or `--filter`. The | ||
| `--use-bitmap-index` option will be ignored in the presence of | ||
| `--path-walk.` | ||
| Incompatible with `--delta-islands`. The `--use-bitmap-index` option is | ||
| ignored in the presence of `--path-walk`. The `--path-walk` option | ||
| supports the `--filter=<spec>` forms `blob:none`, `blob:limit=<n>`, | ||
| `tree:0`, `object:type=<type>`, and `sparse:<oid>`. These supported filter | ||
| types can be combined with the `combine:<spec>+<spec>` form. | ||
|
|
||
|
|
||
| DELTA ISLANDS | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,6 +48,13 @@ commits. | |
| applications could disable some options to make it simpler to walk | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taylor Blau wrote on the Git mailing list (how to reply to this email): On Mon, May 11, 2026 at 06:13:01PM +0000, Derrick Stolee via GitGitGadget wrote:
> We can tell that a path is part of the directly-referenced objects if its
> path name starts with '/' (other paths, including root trees never have this
> starting character). Create a path_is_for_direct_objects() to make this
> meaning clear, especially as we add more references in the future as we
> integrate the path-walk API with partial clone filter options.
I don't know that I have anything in the way of a better suggestion, but
I can't help but feel like the `path_is_for_direct_objects()` check is
somewhat brittle as-is.
I am not familiar enough with the path-walk.c internals to come up with
a good suggestion off the top of my head, but I figured I'd raise it
here in case you had thoughts on alternatives.
> diff --git a/path-walk.c b/path-walk.c
> index 6e426af433..59a7670c5b 100644
> --- a/path-walk.c
> +++ b/path-walk.c
> @@ -248,6 +248,16 @@ static int add_tree_entries(struct path_walk_context *ctx,
> return 0;
> }
>
> +/*
> + * Paths starting with '/' (e.g., "/tags", "/tagged-blobs") hold objects that
> + * were directly requested by 'pending' objects rather than discovered during
> + * tree traversal.
> + */
> +static int path_is_for_direct_objects(const char *path)
> +{
> + return path[0] == '/';
If we do end up keeping this approach, should we have a NULL check on
path itself here? I think that could even be an ASSERT(), since
something has gone wrong if we have a NULL at this point, but I'd rather
die by an assertion rather than a segfault here if so.
Thanks,
TaylorThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 5/11/2026 9:23 PM, Taylor Blau wrote:
> On Mon, May 11, 2026 at 06:13:01PM +0000, Derrick Stolee via GitGitGadget wrote:
>> We can tell that a path is part of the directly-referenced objects if its
>> path name starts with '/' (other paths, including root trees never have this
>> starting character). Create a path_is_for_direct_objects() to make this
>> meaning clear, especially as we add more references in the future as we
>> integrate the path-walk API with partial clone filter options.
>
> I don't know that I have anything in the way of a better suggestion, but
> I can't help but feel like the `path_is_for_direct_objects()` check is
> somewhat brittle as-is.
>
> I am not familiar enough with the path-walk.c internals to come up with
> a good suggestion off the top of my head, but I figured I'd raise it
> here in case you had thoughts on alternatives.
The brittle-ness is due to how trees and blobs without known paths are
added to sets that are emitted using distinct paths. The starting '/'
character does prevent the "/tagged-blobs" and "/tagged-trees" sets
from ever colliding with valid paths that were discovered starting from
root trees (with path "").
I could imagine a world where we store a more robust struct that has a
flag member to indicate that these objects were not discovered via
normal tree walking. Changing that could lead to this method not doing
what we expect it to do. But we are not in that future.
>> diff --git a/path-walk.c b/path-walk.c
>> index 6e426af433..59a7670c5b 100644
>> --- a/path-walk.c
>> +++ b/path-walk.c
>> @@ -248,6 +248,16 @@ static int add_tree_entries(struct path_walk_context *ctx,
>> return 0;
>> }
>>
>> +/*
>> + * Paths starting with '/' (e.g., "/tags", "/tagged-blobs") hold objects that
>> + * were directly requested by 'pending' objects rather than discovered during
>> + * tree traversal.
>> + */
>> +static int path_is_for_direct_objects(const char *path)
>> +{
>> + return path[0] == '/';
>
> If we do end up keeping this approach, should we have a NULL check on
> path itself here? I think that could even be an ASSERT(), since
> something has gone wrong if we have a NULL at this point, but I'd rather
> die by an assertion rather than a segfault here if so.
The ASSERT() is a good idea to prevent incorrect use.
Thanks,
-Stolee |
||
| the objects or to have fewer calls to `path_fn`. | ||
| + | ||
| Note that objects directly requested as pending objects (such as targets | ||
| of lightweight tags or other ref tips) are always emitted to `path_fn`, | ||
| even when the corresponding type flag is disabled. Only objects | ||
| discovered during the tree walk are subject to these type filters. This | ||
| ensures that objects specifically requested through the revision input | ||
| are never silently dropped. | ||
| + | ||
| While it is possible to walk only commits in this way, consumers would be | ||
| better off using the revision walk API instead. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -96,9 +96,10 @@ static void reject_unsupported_rev_list_options(struct rev_info *revs) | |
| if (revs->explicit_diff_merges) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taylor Blau wrote on the Git mailing list (how to reply to this email): On Mon, May 11, 2026 at 06:13:03PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> The 'git backfill' command uses the path-walk API in a critical way: it
> uses the objects output from the command to find the batches of missing
> objects that should be requested from the server. Unlike 'git
> pack-objects', we cannot fall back to another mechanism.
>
> The previous change added the path_walk_filter_compatible() method that
> we can reuse here. Use it during argument validation in cmd_backfill().
Makes sense.
> @@ -96,9 +96,8 @@ static void reject_unsupported_rev_list_options(struct rev_info *revs)
> if (revs->explicit_diff_merges)
> die(_("'%s' cannot be used with 'git backfill'"),
> "--diff-merges");
> - if (revs->filter.choice)
> - die(_("'%s' cannot be used with 'git backfill'"),
> - "--filter");
> + if (!path_walk_filter_compatible(&revs->filter))
> + die(_("cannot backfill with these filter options"));
I was going to suggest that we indicate the type of object filter which
was incompatible, but that gets a little tricky if the incompatible
filter is a child of a LOFC_COMBINE filter.
Resolving that does not seem worth our while, so I think that what you
wrote here is more than sufficient.
Thanks,
Taylor |
||
| die(_("'%s' cannot be used with 'git backfill'"), | ||
| "--diff-merges"); | ||
| if (revs->filter.choice) | ||
| die(_("'%s' cannot be used with 'git backfill'"), | ||
| "--filter"); | ||
| if (!path_walk_filter_compatible(&revs->filter)) | ||
| die(_("cannot backfill with these filter options")); | ||
| if (revs->filter.blob_limit_value) | ||
| die(_("cannot backfill with blob size limits")); | ||
| } | ||
|
|
||
| static int do_backfill(struct backfill_context *ctx) | ||
|
|
@@ -108,6 +109,7 @@ static int do_backfill(struct backfill_context *ctx) | |
|
|
||
| if (ctx->sparse) { | ||
| CALLOC_ARRAY(info.pl, 1); | ||
| info.pl_sparse_trees = 1; | ||
| if (get_sparse_checkout_patterns(info.pl)) { | ||
| path_walk_info_clear(&info); | ||
| return error(_("problem loading sparse-checkout")); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4754,7 +4754,7 @@ static int add_objects_by_path(const char *path, | |
| return 0; | ||
| } | ||
|
|
||
| static void get_object_list_path_walk(struct rev_info *revs) | ||
| static int get_object_list_path_walk(struct rev_info *revs) | ||
| { | ||
| struct path_walk_info info = PATH_WALK_INFO_INIT; | ||
| unsigned int processed = 0; | ||
|
|
@@ -4777,8 +4777,9 @@ static void get_object_list_path_walk(struct rev_info *revs) | |
| result = walk_objects_by_path(&info); | ||
| trace2_region_leave("pack-objects", "path-walk", revs->repo); | ||
|
|
||
| if (result) | ||
| die(_("failed to pack objects via path-walk")); | ||
| path_walk_info_clear(&info); | ||
|
|
||
| return result; | ||
| } | ||
|
|
||
| static void get_object_list(struct rev_info *revs, struct strvec *argv) | ||
|
|
@@ -4841,8 +4842,13 @@ static void get_object_list(struct rev_info *revs, struct strvec *argv) | |
| fn_show_object = show_object; | ||
|
|
||
| if (path_walk) { | ||
| get_object_list_path_walk(revs); | ||
| } else { | ||
| if (get_object_list_path_walk(revs)) { | ||
| warning(_("failed to pack objects via path-walk")); | ||
| path_walk = 0; | ||
| } | ||
| } | ||
|
|
||
| if (!path_walk) { | ||
| if (prepare_revision_walk(revs)) | ||
| die(_("revision walk setup failed")); | ||
| mark_edges_uninteresting(revs, show_edge, sparse); | ||
|
|
@@ -5177,7 +5183,7 @@ int cmd_pack_objects(int argc, | |
|
|
||
| if (path_walk) { | ||
| const char *option = NULL; | ||
| if (filter_options.choice) | ||
| if (!path_walk_filter_compatible(&filter_options)) | ||
| option = "--filter"; | ||
| else if (use_delta_islands) | ||
| option = "--delta-islands"; | ||
|
|
@@ -5190,10 +5196,7 @@ int cmd_pack_objects(int argc, | |
| } | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: Derrick Stolee <stolee@gmail.com>
>
> When 'git pack-objects' has the --path-walk option enabled, it uses a
> different set of revision walk parameters than normal. For once,
"once" -> "one" (or "instance")?
> --objects was previously assumed by the path-walk API and was not needed
> to be added. We also needed --boundary to allow discovering
> UNINTERESTING objects to use as delta bases.
>
> We will be updating the path-walk API soon to work with some filter
> options. However, the revision machinery will trigger a fatal error:
>
> fatal: object filtering requires --objects
>
> The fix is easy: add the --objects option as an argument. This has no
> effect on the path-walk API but does simplify the revision option
> parsing for the objects filter.
>
> We can remove the comment about "removing" the options because they were
> never removed and instead not added. We still need to disable using
> bitmaps.
In the old code, there was a valid reason why bitmaps were not used
(i.e., "--objects" not enabled), but that no longer holds (i.e., now
we add "--objects" ourselves). Do we need to give an updated
rationale to keep bitmap disabled?
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
> builtin/pack-objects.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index dd2480a73d..4338962904 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -5190,10 +5190,7 @@ int cmd_pack_objects(int argc,
> }
> if (path_walk) {
> strvec_push(&rp, "--boundary");
> - /*
> - * We must disable the bitmaps because we are removing
> - * the --objects / --objects-edge[-aggressive] options.
> - */
> + strvec_push(&rp, "--objects");
> use_bitmap_index = 0;
> } else if (thin) {
> use_internal_rev_list = 1;There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Derrick Stolee wrote on the Git mailing list (how to reply to this email): On 5/3/2026 8:49 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> When 'git pack-objects' has the --path-walk option enabled, it uses a
>> different set of revision walk parameters than normal. For once,
>
> "once" -> "one" (or "instance")?
Yes, "one". Sorry for the typo.
>> --objects was previously assumed by the path-walk API and was not needed
>> to be added. We also needed --boundary to allow discovering
>> UNINTERESTING objects to use as delta bases.
>>
>> We will be updating the path-walk API soon to work with some filter
>> options. However, the revision machinery will trigger a fatal error:
>>
>> fatal: object filtering requires --objects
>>
>> The fix is easy: add the --objects option as an argument. This has no
>> effect on the path-walk API but does simplify the revision option
>> parsing for the objects filter.
>>
>> We can remove the comment about "removing" the options because they were
>> never removed and instead not added. We still need to disable using
>> bitmaps.
>
> In the old code, there was a valid reason why bitmaps were not used
> (i.e., "--objects" not enabled), but that no longer holds (i.e., now
> we add "--objects" ourselves). Do we need to give an updated
> rationale to keep bitmap disabled?
>> if (path_walk) {
>> strvec_push(&rp, "--boundary");
>> - /*
>> - * We must disable the bitmaps because we are removing
>> - * the --objects / --objects-edge[-aggressive] options.
>> - */
>> + strvec_push(&rp, "--objects");
>> use_bitmap_index = 0;
>> } else if (thin) {
This old comment is perhaps confusing things. The important thing here
is to disable bitmaps with 'use_bitmap_index = 0;' (though perhaps not
for long [1]).
[1] https://lore.kernel.org/git/f50f8df01a9f216d5b4388b2fe4ff58077b574f3.1777853408.git.me@ttaylorr.com/
The path-walk API itself disables the objects walk for the revision
machinery in walk_objects_by_path():
info->revs->blob_objects = info->revs->tree_objects = 0;
This allows the path-walk API to rely on the revision walk for a
_commits only_ walk and then have the path-walk API handle the trees
and blobs.
The reason we need to add "--objects" now is to allow for parsing the
"--filter" option without the revision logic complaining.
Thanks,
-StoleeThere was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Taylor Blau wrote on the Git mailing list (how to reply to this email): On Mon, May 11, 2026 at 06:12:59PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <stolee@gmail.com>
>
> When 'git pack-objects' has the --path-walk option enabled, it uses a
> different set of revision walk parameters than normal. For once,
s/once/one/ ? Not sure.
> --objects was previously assumed by the path-walk API and was not needed
s/was not/did not/ ? Also not sure.
> ---
> builtin/pack-objects.c | 5 +----
> 1 file changed, 1 insertion(+), 4 deletions(-)
Looks good.
Thanks,
Taylor |
||
| if (path_walk) { | ||
| strvec_push(&rp, "--boundary"); | ||
| /* | ||
| * We must disable the bitmaps because we are removing | ||
| * the --objects / --objects-edge[-aggressive] options. | ||
| */ | ||
| strvec_push(&rp, "--objects"); | ||
| use_bitmap_index = 0; | ||
| } else if (thin) { | ||
| use_internal_rev_list = 1; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taylor Blau wrote on the Git mailing list (how to reply to this email):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Derrick Stolee wrote on the Git mailing list (how to reply to this email):