feat(graph): add suggest-collection command for collection optimization#1030
feat(graph): add suggest-collection command for collection optimization#1030smoparth wants to merge 1 commit intopython-wheel-build:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a new CLI command Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 1031-1033: The current check in the onboarding output path always
prints a human message when toplevel_nodes is empty; update the block that
references toplevel_nodes so that when output_format == "json" it emits a valid
JSON empty list (e.g., using click.echo("[]")) and returns, otherwise keep the
human-friendly message "No top-level packages found in onboarding graph." This
change touches the conditional that checks toplevel_nodes in the function
handling the onboarding graph output and should reference the output_format
variable to decide which output to produce.
- Around line 916-918: The current sort used for selecting best_score
(scores.sort(...); best_score = scores[0]) only considers new_packages and
coverage_percentage, which leaves true ties to fall back to the original
collection_graphs order; make tie-breaking deterministic by adding a stable
filename-derived key as a tertiary sort field (e.g., change the sort key to
(score.new_packages, -score.coverage_percentage, score.filename) or, if Score
objects don’t carry filenames, derive the filename from the corresponding
collection_graphs entry when building scores) so best_score / best_fit is chosen
predictably by filename when the first two metrics tie.
In `@tests/test_suggest_collection.py`:
- Line 5: Replace the fragile regex-based JSON extraction in
tests/test_suggest_collection.py with direct JSON parsing of the entire CLI
output: remove the re import and any use of re.search that extracts a bracketed
block, and instead call json.loads(...) on the full output string produced by
the CLI (e.g., the variable holding stdout or `cli_output`/`result.stdout`) in
each affected test (the blocks around lines 24-29, 294-295, 327 etc.); ensure
the tests assert against the parsed dict/list returned by json.loads so the test
fails if extra bytes surround the JSON payload.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 13fe9c39-2bda-4331-802b-d350ab77b34e
📒 Files selected for processing (2)
src/fromager/commands/graph.pytests/test_suggest_collection.py
55faca2 to
1836b93
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/fromager/commands/graph.py (1)
1048-1067: Consider loading collection graphs in parallel for large inputs.For many collection graphs, sequential loading could be slow. This is fine for typical usage but worth noting if performance becomes a concern with many large graphs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/fromager/commands/graph.py` around lines 1048 - 1067, Parallelize loading by extracting collection names first (call extract_collection_name for each graph_path, check for duplicates) and then use a concurrent.futures.ThreadPoolExecutor to call DependencyGraph.from_file(graph_path) for each unique path in parallel; for each future, on success call get_package_names(collection_graph) and populate collection_packages[collection_name], and on exception wrap and raise a click.ClickException that includes the graph_path and original error (preserve the existing logger.debug call per collection_name). Reference symbols: collection_graphs, extract_collection_name, DependencyGraph.from_file, get_package_names, collection_packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/fromager/commands/graph.py`:
- Around line 1048-1067: Parallelize loading by extracting collection names
first (call extract_collection_name for each graph_path, check for duplicates)
and then use a concurrent.futures.ThreadPoolExecutor to call
DependencyGraph.from_file(graph_path) for each unique path in parallel; for each
future, on success call get_package_names(collection_graph) and populate
collection_packages[collection_name], and on exception wrap and raise a
click.ClickException that includes the graph_path and original error (preserve
the existing logger.debug call per collection_name). Reference symbols:
collection_graphs, extract_collection_name, DependencyGraph.from_file,
get_package_names, collection_packages.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6b23b6d9-1f42-421f-9b08-b2e1bdfe4769
📒 Files selected for processing (2)
src/fromager/commands/graph.pytests/test_suggest_collection.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/test_suggest_collection.py
jlarkin09
left a comment
There was a problem hiding this comment.
LGTM other than the one suggestion by coderabbit for json.loads()
|
I'm not convinced that it makes sense to add this feature to Fromager upstream. Collections are a downstream concept, not a Fromager feature. |
@tiran, I had the same thought when I picked this. |
I want this feature here. Let's think of a better name for it. |
src/fromager/commands/graph.py
Outdated
| print(f"\nBuilding {len(graph)} packages in {rounds} rounds.") | ||
|
|
||
|
|
||
| def _traverse_edges( |
There was a problem hiding this comment.
This should probably be a method of DependencyNode.
There was a problem hiding this comment.
Moved to DependencyNode.
| } | ||
|
|
||
|
|
||
| def extract_collection_name(graph_path: str) -> str: |
There was a problem hiding this comment.
This encodes some very opinionated naming conventions in the filename. How about if we just use the filename as the identifier? Then users could name the files in ways that are meaningful to them.
There was a problem hiding this comment.
Simplified to just return the filename.
src/fromager/commands/graph.py
Outdated
| "best_fit": best_score.collection if best_score else "none", | ||
| "new_packages": best_score.new_packages if best_score else 0, | ||
| "existing_packages": best_score.existing_packages if best_score else 0, | ||
| "coverage_percentage": round(best_score.coverage_percentage, 1) |
There was a problem hiding this comment.
Splitting this expression over multiple lines without any () made it a bit confusing to read.
There was a problem hiding this comment.
Thanks! Wrapped in parentheses.
| Collections are ranked by fewest new packages required, then by highest | ||
| dependency coverage. | ||
|
|
||
| \b |
There was a problem hiding this comment.
Does this \b do something when printed as part of the command help?
There was a problem hiding this comment.
Its Click's way to preserves line breaks in --help output. Without it, Click displays the argument descriptions into a single paragraph.
src/fromager/commands/graph.py
Outdated
| if output_format == "json": | ||
| click.echo("[]") | ||
| else: | ||
| click.echo("No top-level packages found in onboarding graph.") |
There was a problem hiding this comment.
Can we always send this message to stderr and then always print the data structure the same way, instead of checking the output format here?
There was a problem hiding this comment.
Updated. The message now always goes to stderr via click.echo(..., err=True) and the normal output path runs regardless. The only trade-off is that it loads the collection graph files before producing an empty result, but that's negligible time.
How about |
Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections. - Iterative DFS traversal over all edge types for full dependency closure - Ranks collections by fewest new packages, then highest coverage - Rich table (default) and JSON output formats - Error handling for bad graph files and duplicate collection names - Tests covering helpers, CLI, and analysis logic Co-Authored-By: Claude-4.6-opus-high <claude@anthropic.com> Closes: python-wheel-build#971 Signed-off-by: Shanmukh Pawan <smoparth@redhat.com>
1836b93 to
2ec1705
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/fromager/commands/graph.py`:
- Around line 903-923: The JSON being emitted in the results dict in
src/fromager/commands/graph.py must follow the documented schema: replace the
top-level
best_fit/new_packages/existing_packages/coverage_percentage/all_collections
shape with package, version, total_dependencies, and an ordered collections list
named "collections" where each entry has keys name (use score.collection),
new_packages, existing_packages, total_dependencies (compute per-score total if
available), and coverage_pct (rounded to 1 decimal); remove the flattened
top-level best_* fields from this serialized dict and, if you still need those
flattened values for the human table, compute them in
_print_suggest_collection_table() by taking the first-ranked collection from the
collections list rather than emitting a second JSON shape here.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b7011d4c-bcb5-4b92-8f48-db53729ab322
📒 Files selected for processing (3)
src/fromager/commands/graph.pysrc/fromager/dependency_graph.pytests/test_suggest_collection.py
✅ Files skipped from review due to trivial changes (1)
- tests/test_suggest_collection.py
| results.append( | ||
| { | ||
| "package": str(node.canonicalized_name), | ||
| "version": str(node.version), | ||
| "total_dependencies": total_dependency_count, | ||
| "best_fit": best_score.collection if best_score else "none", | ||
| "new_packages": best_score.new_packages if best_score else 0, | ||
| "existing_packages": best_score.existing_packages if best_score else 0, | ||
| "coverage_percentage": ( | ||
| round(best_score.coverage_percentage, 1) if best_score else 0.0 | ||
| ), | ||
| "all_collections": [ | ||
| { | ||
| "collection": score.collection, | ||
| "new_packages": score.new_packages, | ||
| "existing_packages": score.existing_packages, | ||
| "coverage_percentage": round(score.coverage_percentage, 1), | ||
| } | ||
| for score in scores | ||
| ], | ||
| } |
There was a problem hiding this comment.
Preserve the documented JSON schema.
The payload built here doesn't match the --format json contract from issue #971. It emits top-level best_fit/new_packages/all_collections, but the advertised machine-readable shape is package, version, total_dependencies, plus an ordered collections list whose entries include name, new_packages, existing_packages, total_dependencies, and coverage_pct. That will break consumers expecting the documented structure.
Suggested fix
results.append(
{
"package": str(node.canonicalized_name),
"version": str(node.version),
"total_dependencies": total_dependency_count,
- "best_fit": best_score.collection if best_score else "none",
- "new_packages": best_score.new_packages if best_score else 0,
- "existing_packages": best_score.existing_packages if best_score else 0,
- "coverage_percentage": (
- round(best_score.coverage_percentage, 1) if best_score else 0.0
- ),
- "all_collections": [
+ "collections": [
{
- "collection": score.collection,
+ "name": score.collection,
"new_packages": score.new_packages,
"existing_packages": score.existing_packages,
- "coverage_percentage": round(score.coverage_percentage, 1),
+ "total_dependencies": total_dependency_count,
+ "coverage_pct": round(score.coverage_percentage, 1),
}
for score in scores
],
}
)If you still want the flattened fields for table rendering, derive them in _print_suggest_collection_table() from the first ranked collection instead of serializing a second JSON shape here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/fromager/commands/graph.py` around lines 903 - 923, The JSON being
emitted in the results dict in src/fromager/commands/graph.py must follow the
documented schema: replace the top-level
best_fit/new_packages/existing_packages/coverage_percentage/all_collections
shape with package, version, total_dependencies, and an ordered collections list
named "collections" where each entry has keys name (use score.collection),
new_packages, existing_packages, total_dependencies (compute per-score total if
available), and coverage_pct (rounded to 1 decimal); remove the flattened
top-level best_* fields from this serialized dict and, if you still need those
flattened values for the human table, compute them in
_print_suggest_collection_table() by taking the first-ranked collection from the
collections list rather than emitting a second JSON shape here.
Pull Request Description
What
Add fromager graph suggest-collection that analyzes dependency overlap between onboarding packages and existing collections to recommend the best-fit collection for each package. Helps pipeline maintainers make data-driven decisions when assigning new packages to permanent collections.
Why
Closes: #971