WIP: Source Links Fix#440
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run --lockfile_mode=error //src:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
| @@ -0,0 +1,23 @@ | |||
| # ******************************************************************************* | |||
There was a problem hiding this comment.
I probably want to delete this file to make everything easier
There was a problem hiding this comment.
Pull request overview
This PR reworks how “source code links” are generated and injected into Sphinx needs, aiming to support both local builds (derive Git hash from the current repo) and combo/ref-integration builds (use per-module metadata such as repo URL + commit hash).
Changes:
- Add metadata support (module name / repo URL / commit hash) to NeedLink JSON handling and Bazel CLI generation/merge scripts.
- Introduce grouping of needs by module and a new module-grouped cache JSON.
- Update Sphinx extension flow to read the new caches and generate GitHub links using either git-derived or metadata-derived refs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| src/helper_lib/additional_functions.py | Changes GitHub link generation API to require module metadata. |
| src/extensions/score_source_code_linker/needlinks.py | Extends NeedLink with metadata fields; adds new metadata-aware JSON format/load path. |
| src/extensions/score_source_code_linker/need_source_links.py | Moves group_by_need into this module. |
| src/extensions/score_source_code_linker/module_source_links.py | New module-grouped cache format and grouping logic. |
| src/extensions/score_source_code_linker/metadata.py | New TypedDict + TypeGuard for metadata records. |
| src/extensions/score_source_code_linker/generate_source_code_links_json.py | Refactors extraction helper signature and logging (currently inconsistent call sites). |
| src/extensions/score_source_code_linker/init.py | Adds module-linker stage and changes injection to use module-grouped cache + metadata-based link generation. |
| scripts_bazel/merge_sourcelinks.py | Merges per-module sourcelinks and enriches with known-good repo/hash metadata. |
| scripts_bazel/generate_sourcelinks_cli.py | Emits sourcelinks JSON with a leading metadata dict and uses updated extraction helper signature. |
| docs.bzl | Adds optional known_good wiring into the sourcelinks merge genrule and the public docs() macro. |
You can also share your feedback on Copilot code review. Take the survey.
| store_source_code_links_with_metadata_json( | ||
| file=args.output, metadata=metadata, needlist=all_need_references | ||
| ) |
There was a problem hiding this comment.
This switches the generated JSON format from a plain list of NeedLinks to a list whose first element is a metadata dict. Any existing consumers/tests that expect the old schema will now fail. Consider either keeping the old format as default (with an opt-in flag for metadata), or updating all in-repo consumers and tests in the same PR to avoid a partially-migrated state.
There was a problem hiding this comment.
@a-zw Valid issue here .
Do you think it would be better to rename this specific cache that comes from here a bit so it is clear it is with metadata?
Like 'scl_metadata_cache.json' or whatever? SO that the name makes it clear to use the metadata reader?
| metadata: moduleInfo, | ||
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | ||
| ) -> str: | ||
| if link is None: | ||
| link = DefaultNeedLink() | ||
| if not metadata.hash: | ||
| # Local path (//:docs) | ||
| return get_github_link_from_git(link) | ||
| # Ref-Integration path (//:docs_combo..) | ||
| return get_github_link_from_json(metadata, link) |
There was a problem hiding this comment.
get_github_link now requires a metadata argument, but there are existing call sites in the repo that still call get_github_link(link) (e.g. in tests and xml_parser.py). This is a breaking API change that will raise TypeError. Consider keeping backwards compatibility by making metadata optional with a sensible default (e.g. infer local/git behavior when metadata is omitted) or providing a new function name for the metadata-based behavior.
| metadata: moduleInfo, | |
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | |
| ) -> str: | |
| if link is None: | |
| link = DefaultNeedLink() | |
| if not metadata.hash: | |
| # Local path (//:docs) | |
| return get_github_link_from_git(link) | |
| # Ref-Integration path (//:docs_combo..) | |
| return get_github_link_from_json(metadata, link) | |
| metadata: moduleInfo | |
| | NeedLink | |
| | DataForTestLink | |
| | DataOfTestCase | |
| | None = None, | |
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | |
| ) -> str: | |
| """ | |
| Return a GitHub link for the given location. | |
| Backwards compatible calling conventions: | |
| - New style: get_github_link(metadata, link) | |
| - Old style: get_github_link(link) # metadata omitted | |
| """ | |
| # Distinguish between new-style and old-style calls. | |
| if isinstance(metadata, moduleInfo): | |
| actual_metadata: moduleInfo | None = metadata | |
| actual_link = link | |
| else: | |
| # Old-style: first argument is actually the link (or None) | |
| actual_metadata = None | |
| # If both metadata and link are provided but metadata is not a moduleInfo, | |
| # prefer the explicitly provided link argument. | |
| actual_link = link if link is not None else metadata | |
| if actual_link is None: | |
| actual_link = DefaultNeedLink() | |
| # If no metadata is available or metadata.hash is falsy, fall back to git-based link. | |
| if actual_metadata is None or not getattr(actual_metadata, "hash", None): | |
| # Local path (//:docs) | |
| return get_github_link_from_git(actual_link) | |
| # Ref-Integration path (//:docs_combo..) | |
| return get_github_link_from_json(actual_metadata, actual_link) |
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | ||
| Example output: | ||
|
|
||
|
|
||
| { | ||
| "need": "<need_id>", | ||
| "module_name": <module_name>, | ||
| "hash": <git hash>, | ||
| "url": <github base url>, | ||
| "links": { | ||
| "CodeLinks": [NeedLink, NeedLink, ...], | ||
| "TestLinks": [testlink, testlink, ...] |
There was a problem hiding this comment.
The group_by_need docstring/documented example includes top-level module_name/hash/url fields, but SourceCodeLinks only contains need and links and this function does not populate metadata. Please update the docstring/example to match the actual returned structure to avoid confusion for future callers.
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | |
| Example output: | |
| { | |
| "need": "<need_id>", | |
| "module_name": <module_name>, | |
| "hash": <git hash>, | |
| "url": <github base url>, | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [testlink, testlink, ...] | |
| Returns a list of SourceCodeLinks objects, each containing a 'need' ID and | |
| associated 'CodeLinks' and 'TestLinks'. | |
| Example output element: | |
| { | |
| "need": "<need_id>", | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [DataForTestLink, DataForTestLink, ...] |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
You can also share your feedback on Copilot code review. Take the survey.
| def get_github_link( | ||
| metadata: ModuleInfo, | ||
| link: NeedLink | DataForTestLink | DataOfTestCase | None = None, | ||
| ) -> str: | ||
| if link is None: | ||
| link = DefaultNeedLink() | ||
| if not metadata.hash: | ||
| # Local path (//:docs) | ||
| return get_github_link_from_git(link) | ||
| # Ref-Integration path (//:docs_combo..) | ||
| return get_github_link_from_json(metadata, link) |
There was a problem hiding this comment.
get_github_link() now requires a metadata: ModuleInfo argument, but there are still in-repo call sites that pass only the link object (e.g. in src/extensions/score_source_code_linker/xml_parser.py and multiple tests under src/extensions/score_source_code_linker/tests/). This will raise a TypeError at runtime and break the existing test suite. Consider either keeping metadata optional with a default representing local builds, or providing a small helper to derive ModuleInfo and updating all remaining call sites in this PR.
| # Use os.walk to have better control over directory traversal | ||
| for file in iterate_files_recursively(search_path): | ||
| references = _extract_references_from_file(search_path, file) | ||
| references = _extract_references_from_file(search_path,Path(file.name), file) |
There was a problem hiding this comment.
In find_all_need_references(), iterate_files_recursively() yields paths relative to search_path and can include subdirectories (e.g. src/foo/bar.py). Passing Path(file.name) into _extract_references_from_file() discards the parent directories, so (root / file_path_name) will not exist for files not in the top-level directory, triggering the assertion and preventing scanning. Pass the full relative path (file) as the path-to-open argument instead of only file.name.
| references = _extract_references_from_file(search_path,Path(file.name), file) | |
| references = _extract_references_from_file(search_path, file, file) |
| def ModuleSourceLinks_JSON_Decoder( | ||
| d: dict[str, Any], | ||
| ) -> ModuleSourceLinks | dict[str, Any]: | ||
| if "module_name" in d and "needs" in d: | ||
| module_name = d["module_name"] | ||
| needs = d["needs"] | ||
| return ModuleSourceLinks( | ||
| module=ModuleInfo( | ||
| name=module_name.get("module_name"), | ||
| hash=module_name.get("hash"), | ||
| url=module_name.get("url"), | ||
| ), | ||
| # We know this can only be list[SourceCodeLinks] and nothing else | ||
| # Therefore => we ignore the type error here | ||
| needs=[SourceCodeLinks_JSON_Decoder(need) for need in needs], # type: ignore | ||
| ) |
There was a problem hiding this comment.
ModuleSourceLinks_JSON_Decoder() is looking for keys "module_name" and "needs", but ModuleSourceLinks_JSON_Encoder/asdict() will serialize ModuleSourceLinks as { "module": {"name": ..., "hash": ..., "url": ...}, "needs": [...] }. As a result, decoding will never construct ModuleSourceLinks, and load_module_source_links_json() will fail its isinstance(link, ModuleSourceLinks) assertion. Update the decoder to match the actual serialized shape (use the module key and its name/hash/url fields).
| # This isn't pretty will think of a better solution later, for now this should work | ||
| try: | ||
| source_code_links = load_source_code_links_json(source_code_links_json) | ||
| except AssertionError: | ||
| source_code_links = load_source_code_links_with_metadata_json( | ||
| source_code_links_json | ||
| ) |
There was a problem hiding this comment.
build_and_save_combined_file() uses try/except AssertionError to detect which JSON schema is in use. This is fragile because load_source_code_links_json() relies on assert statements for validation; running Python with optimizations (-O) disables asserts and would skip the schema check, letting the wrong format through and failing later. Prefer explicit validation that raises a real exception (e.g. TypeError/ValueError) and catch that, or detect the metadata header by inspecting the decoded JSON.
| _ = parser.add_argument( | ||
| "--known_good", | ||
| required=True, | ||
| help="Optional path to a 'known good' JSON file (provided by Bazel).", |
There was a problem hiding this comment.
The --known_good argument help text says it is an "Optional" path, but the argument is marked required=True. This is misleading for users and Bazel rule authors; update the help text or make the flag truly optional.
| help="Optional path to a 'known good' JSON file (provided by Bazel).", | |
| help="Path to a 'known good' JSON file (provided by Bazel).", |
| def _merge_sourcelinks(name, sourcelinks, known_good = None): | ||
| """Merge multiple sourcelinks JSON files into a single file. | ||
|
|
||
| Args: | ||
| name: Name for the merged sourcelinks target | ||
| sourcelinks: List of sourcelinks JSON file targets | ||
| """ | ||
|
|
||
| extra_srcs = [] | ||
| known_good_arg = "" | ||
| if known_good != None: | ||
| extra_srcs = [known_good] | ||
| known_good_arg = "--known_good $(location %s)" % known_good | ||
|
|
||
| native.genrule( | ||
| name = name, | ||
| srcs = sourcelinks, | ||
| srcs = sourcelinks + extra_srcs, | ||
| outs = [name + ".json"], | ||
| cmd = """ | ||
| $(location @score_docs_as_code//scripts_bazel:merge_sourcelinks) \ | ||
| --output $@ \ | ||
| {known_good_arg} \ | ||
| $(SRCS) | ||
| """, | ||
| """.format(known_good_arg = known_good_arg), |
There was a problem hiding this comment.
merge_sourcelinks.py now defines --known_good as required=True, but docs.bzl only passes --known_good when known_good != None. With the default known_good=None, the genrule will invoke the script without --known_good and fail argument parsing. Either make known_good mandatory in docs() / _merge_sourcelinks, or make --known_good optional again and handle the missing case in the script.
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | ||
| Example output: | ||
|
|
||
|
|
||
| { | ||
| "need": "<need_id>", | ||
| "module_name": <module_name>, | ||
| "hash": <git hash>, | ||
| "url": <github base url>, | ||
| "links": { | ||
| "CodeLinks": [NeedLink, NeedLink, ...], | ||
| "TestLinks": [testlink, testlink, ...] | ||
| } | ||
| } |
There was a problem hiding this comment.
The group_by_need() docstring claims the returned structure contains top-level module_name/hash/url fields, but SourceCodeLinks only has need and links and the function doesn't populate any module metadata. This makes it unclear where callers should expect module info to come from (now handled via NeedLink metadata / module grouping). Update the docstring to reflect the actual return shape.
| Returns a nested dictionary structure with 'CodeLink' and 'TestLink' categories. | |
| Example output: | |
| { | |
| "need": "<need_id>", | |
| "module_name": <module_name>, | |
| "hash": <git hash>, | |
| "url": <github base url>, | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [testlink, testlink, ...] | |
| } | |
| } | |
| Returns a list of SourceCodeLinks objects, each containing a need ID and | |
| grouped 'CodeLinks' and 'TestLinks' for that need. | |
| Example output (as JSON-serializable structure): | |
| [ | |
| { | |
| "need": "<need_id>", | |
| "links": { | |
| "CodeLinks": [NeedLink, NeedLink, ...], | |
| "TestLinks": [DataForTestLink, DataForTestLink, ...] | |
| } | |
| }, | |
| ... | |
| ] | |
| Module or repository metadata (such as module name, commit hash, or base URL) | |
| is expected to be carried by the individual NeedLink / DataForTestLink | |
| instances or their metadata, not as top-level fields in this structure. |
| if not source_link.links.CodeLinks: | ||
| continue | ||
|
|
||
| first_link = source_link.links.CodeLinks[0] | ||
| module_key = first_link.module_name | ||
|
|
||
| if module_key not in module_groups: | ||
| module_groups[module_key] = ModuleSourceLinks( | ||
| module=ModuleInfo( | ||
| name=module_key, hash=first_link.hash, url=first_link.url | ||
| ) | ||
| ) | ||
|
|
||
| module_groups[module_key].needs.append(source_link) # Much clearer! |
There was a problem hiding this comment.
group_needs_by_module() currently skips any SourceCodeLinks entries that have no CodeLinks (if not source_link.links.CodeLinks: continue). This drops needs that only have TestLinks, so their test links will never be injected into the Sphinx needs output. If test-only coverage is a valid case, consider assigning these to a default/unknown module group (or grouping by test link file path) so they remain visible.
| if not source_link.links.CodeLinks: | |
| continue | |
| first_link = source_link.links.CodeLinks[0] | |
| module_key = first_link.module_name | |
| if module_key not in module_groups: | |
| module_groups[module_key] = ModuleSourceLinks( | |
| module=ModuleInfo( | |
| name=module_key, hash=first_link.hash, url=first_link.url | |
| ) | |
| ) | |
| module_groups[module_key].needs.append(source_link) # Much clearer! | |
| code_links = source_link.links.CodeLinks | |
| test_links = getattr(source_link.links, "TestLinks", None) | |
| # Skip only if there are neither code links nor test links | |
| if not code_links and not test_links: | |
| continue | |
| if code_links: | |
| first_link = code_links[0] | |
| module_key = first_link.module_name | |
| module_info = ModuleInfo( | |
| name=module_key, hash=first_link.hash, url=first_link.url | |
| ) | |
| else: | |
| # No CodeLinks but there are TestLinks: assign to a default/unknown module | |
| module_key = "unknown" | |
| module_info = ModuleInfo(name=module_key, hash="", url="") | |
| if module_key not in module_groups: | |
| module_groups[module_key] = ModuleSourceLinks(module=module_info) | |
| module_groups[module_key].needs.append(source_link) |
|
Just realized that testlinks also need the module name earlier than source_code_linker extension end state. I need to re-think this approach a bit and see what I can adapt to make this more convenient. I think I have an idea, but unsure. |
Still in progress, do not merge.
It works but need to fix tests and clean it up a lot.
📌 Description
🚨 Impact Analysis
✅ Checklist