Skip to content

feat(graph): add suggest-collection command for collection optimization#1030

Open
smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
smoparth:feat/suggest-collection
Open

feat(graph): add suggest-collection command for collection optimization#1030
smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
smoparth:feat/suggest-collection

Conversation

@smoparth
Copy link
Copy Markdown
Contributor

@smoparth smoparth commented Apr 6, 2026

Pull Request Description

What

Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections.

  • Iterative DFS traversal over all edge types for full dependency closure
  • Ranks collections by fewest new packages, then highest coverage
  • Rich table (default) and JSON output formats
  • Error handling for bad graph files and duplicate collection names
  • Tests covering helpers, CLI, and analysis logic

Why

Closes: #971

@smoparth smoparth requested a review from a team as a code owner April 6, 2026 19:26
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Adds a new CLI command fromager graph suggest-collection with --format (table/json). The command loads an onboarding DependencyGraph, finds top-level packages from ROOT, computes each package’s transitive dependency closure via a new DependencyNode.iter_all_dependencies() and get_dependency_closure(), and compares closures to package sets extracted from one or more collection graphs (get_package_names() and extract_collection_name()). It scores collections per package by new vs existing packages and coverage percentage, ranks results, and emits either JSON or a Rich table. New unit and integration tests cover closure, name extraction, scoring, and CLI output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature: adding a suggest-collection command for optimizing collection organization, which aligns with the primary change in the changeset.
Description check ✅ Passed The description is directly related to the changeset, explaining the suggest-collection feature, its implementation approach, and linking to issue #971.
Linked Issues check ✅ Passed The PR implements all major requirements from issue #971: CLI command with table/json formats, dependency closure computation via iterative DFS, ranking logic, ranking by new_packages then coverage_percentage, and comprehensive helper functions and tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing the suggest-collection command and its supporting infrastructure; no out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci label Apr 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 1031-1033: The current check in the onboarding output path always
prints a human message when toplevel_nodes is empty; update the block that
references toplevel_nodes so that when output_format == "json" it emits a valid
JSON empty list (e.g., using click.echo("[]")) and returns, otherwise keep the
human-friendly message "No top-level packages found in onboarding graph." This
change touches the conditional that checks toplevel_nodes in the function
handling the onboarding graph output and should reference the output_format
variable to decide which output to produce.
- Around line 916-918: The current sort used for selecting best_score
(scores.sort(...); best_score = scores[0]) only considers new_packages and
coverage_percentage, which leaves true ties to fall back to the original
collection_graphs order; make tie-breaking deterministic by adding a stable
filename-derived key as a tertiary sort field (e.g., change the sort key to
(score.new_packages, -score.coverage_percentage, score.filename) or, if Score
objects don’t carry filenames, derive the filename from the corresponding
collection_graphs entry when building scores) so best_score / best_fit is chosen
predictably by filename when the first two metrics tie.

In `@tests/test_suggest_collection.py`:
- Line 5: Replace the fragile regex-based JSON extraction in
tests/test_suggest_collection.py with direct JSON parsing of the entire CLI
output: remove the re import and any use of re.search that extracts a bracketed
block, and instead call json.loads(...) on the full output string produced by
the CLI (e.g., the variable holding stdout or `cli_output`/`result.stdout`) in
each affected test (the blocks around lines 24-29, 294-295, 327 etc.); ensure
the tests assert against the parsed dict/list returned by json.loads so the test
fails if extra bytes surround the JSON payload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 13fe9c39-2bda-4331-802b-d350ab77b34e

📥 Commits

Reviewing files that changed from the base of the PR and between d86f938 and 16bfc24.

📒 Files selected for processing (2)
  • src/fromager/commands/graph.py
  • tests/test_suggest_collection.py

@smoparth smoparth force-pushed the feat/suggest-collection branch 2 times, most recently from 55faca2 to 1836b93 Compare April 6, 2026 20:01
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/fromager/commands/graph.py (1)

1048-1067: Consider loading collection graphs in parallel for large inputs.

For many collection graphs, sequential loading could be slow. This is fine for typical usage but worth noting if performance becomes a concern with many large graphs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/commands/graph.py` around lines 1048 - 1067, Parallelize loading
by extracting collection names first (call extract_collection_name for each
graph_path, check for duplicates) and then use a
concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/fromager/commands/graph.py`:
- Around line 1048-1067: Parallelize loading by extracting collection names
first (call extract_collection_name for each graph_path, check for duplicates)
and then use a concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b23b6d9-1f42-421f-9b08-b2e1bdfe4769

📥 Commits

Reviewing files that changed from the base of the PR and between 16bfc24 and 1836b93.

📒 Files selected for processing (2)
  • src/fromager/commands/graph.py
  • tests/test_suggest_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_suggest_collection.py

Copy link
Copy Markdown

@jlarkin09 jlarkin09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the one suggestion by coderabbit for json.loads()

@tiran
Copy link
Copy Markdown
Collaborator

tiran commented Apr 7, 2026

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

@smoparth
Copy link
Copy Markdown
Contributor Author

smoparth commented Apr 7, 2026

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

@tiran, I had the same thought when I picked this.
@dhellmann, would love to hear your take on right place this feature could land.

@dhellmann
Copy link
Copy Markdown
Member

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

I want this feature here. Let's think of a better name for it.

print(f"\nBuilding {len(graph)} packages in {rounds} rounds.")


def _traverse_edges(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a method of DependencyNode.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to DependencyNode.

}


def extract_collection_name(graph_path: str) -> str:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This encodes some very opinionated naming conventions in the filename. How about if we just use the filename as the identifier? Then users could name the files in ways that are meaningful to them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified to just return the filename.

"best_fit": best_score.collection if best_score else "none",
"new_packages": best_score.new_packages if best_score else 0,
"existing_packages": best_score.existing_packages if best_score else 0,
"coverage_percentage": round(best_score.coverage_percentage, 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting this expression over multiple lines without any () made it a bit confusing to read.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Wrapped in parentheses.

Collections are ranked by fewest new packages required, then by highest
dependency coverage.

\b
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this \b do something when printed as part of the command help?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its Click's way to preserves line breaks in --help output. Without it, Click displays the argument descriptions into a single paragraph.

if output_format == "json":
click.echo("[]")
else:
click.echo("No top-level packages found in onboarding graph.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always send this message to stderr and then always print the data structure the same way, instead of checking the output format here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. The message now always goes to stderr via click.echo(..., err=True) and the normal output path runs regardless. The only trade-off is that it loads the collection graph files before producing an empty result, but that's negligible time.

@rd4398
Copy link
Copy Markdown
Contributor

rd4398 commented Apr 8, 2026

I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature.

I want this feature here. Let's think of a better name for it.

How about find-best-fit as a name?

Add fromager graph suggest-collection that analyzes dependency overlap
between onboarding packages and existing collections to recommend the
best-fit collection for each package. Helps pipeline maintainers make
data-driven decisions when assigning new packages to permanent collections.

- Iterative DFS traversal over all edge types for full dependency closure
- Ranks collections by fewest new packages, then highest coverage
- Rich table (default) and JSON output formats
- Error handling for bad graph files and duplicate collection names
- Tests covering helpers, CLI, and analysis logic

Co-Authored-By: Claude-4.6-opus-high <claude@anthropic.com>
Closes: python-wheel-build#971
Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
@smoparth smoparth force-pushed the feat/suggest-collection branch from 1836b93 to 2ec1705 Compare April 8, 2026 22:16
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 903-923: The JSON being emitted in the results dict in
src/fromager/commands/graph.py must follow the documented schema: replace the
top-level
best_fit/new_packages/existing_packages/coverage_percentage/all_collections
shape with package, version, total_dependencies, and an ordered collections list
named "collections" where each entry has keys name (use score.collection),
new_packages, existing_packages, total_dependencies (compute per-score total if
available), and coverage_pct (rounded to 1 decimal); remove the flattened
top-level best_* fields from this serialized dict and, if you still need those
flattened values for the human table, compute them in
_print_suggest_collection_table() by taking the first-ranked collection from the
collections list rather than emitting a second JSON shape here.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b7011d4c-bcb5-4b92-8f48-db53729ab322

📥 Commits

Reviewing files that changed from the base of the PR and between 1836b93 and 2ec1705.

📒 Files selected for processing (3)
  • src/fromager/commands/graph.py
  • src/fromager/dependency_graph.py
  • tests/test_suggest_collection.py
✅ Files skipped from review due to trivial changes (1)
  • tests/test_suggest_collection.py

Comment on lines +903 to +923
results.append(
{
"package": str(node.canonicalized_name),
"version": str(node.version),
"total_dependencies": total_dependency_count,
"best_fit": best_score.collection if best_score else "none",
"new_packages": best_score.new_packages if best_score else 0,
"existing_packages": best_score.existing_packages if best_score else 0,
"coverage_percentage": (
round(best_score.coverage_percentage, 1) if best_score else 0.0
),
"all_collections": [
{
"collection": score.collection,
"new_packages": score.new_packages,
"existing_packages": score.existing_packages,
"coverage_percentage": round(score.coverage_percentage, 1),
}
for score in scores
],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve the documented JSON schema.

The payload built here doesn't match the --format json contract from issue #971. It emits top-level best_fit/new_packages/all_collections, but the advertised machine-readable shape is package, version, total_dependencies, plus an ordered collections list whose entries include name, new_packages, existing_packages, total_dependencies, and coverage_pct. That will break consumers expecting the documented structure.

Suggested fix
         results.append(
             {
                 "package": str(node.canonicalized_name),
                 "version": str(node.version),
                 "total_dependencies": total_dependency_count,
-                "best_fit": best_score.collection if best_score else "none",
-                "new_packages": best_score.new_packages if best_score else 0,
-                "existing_packages": best_score.existing_packages if best_score else 0,
-                "coverage_percentage": (
-                    round(best_score.coverage_percentage, 1) if best_score else 0.0
-                ),
-                "all_collections": [
+                "collections": [
                     {
-                        "collection": score.collection,
+                        "name": score.collection,
                         "new_packages": score.new_packages,
                         "existing_packages": score.existing_packages,
-                        "coverage_percentage": round(score.coverage_percentage, 1),
+                        "total_dependencies": total_dependency_count,
+                        "coverage_pct": round(score.coverage_percentage, 1),
                     }
                     for score in scores
                 ],
             }
         )

If you still want the flattened fields for table rendering, derive them in _print_suggest_collection_table() from the first ranked collection instead of serializing a second JSON shape here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/fromager/commands/graph.py` around lines 903 - 923, The JSON being
emitted in the results dict in src/fromager/commands/graph.py must follow the
documented schema: replace the top-level
best_fit/new_packages/existing_packages/coverage_percentage/all_collections
shape with package, version, total_dependencies, and an ordered collections list
named "collections" where each entry has keys name (use score.collection),
new_packages, existing_packages, total_dependencies (compute per-score total if
available), and coverage_pct (rounded to 1 decimal); remove the flattened
top-level best_* fields from this serialized dict and, if you still need those
flattened values for the human table, compute them in
_print_suggest_collection_table() by taking the first-ranked collection from the
collections list rather than emitting a second JSON shape here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add graph suggest-collection command to optimize collection organization

5 participants