Skip to content

fix(exceptions): surface trace ID and underlying reason in FailedUpload#593

Open
kannwism wants to merge 1 commit into
mainfrom
fix(exceptions)/surface-trace-id-in-failed-uploads
Open

fix(exceptions): surface trace ID and underlying reason in FailedUpload#593
kannwism wants to merge 1 commit into
mainfrom
fix(exceptions)/surface-trace-id-in-failed-uploads

Conversation

@kannwism
Copy link
Copy Markdown
Contributor

Summary

  • FailedUpload now carries an optional trace_id, populated from RapidataError responses, so the backend trace that produced the failure is preserved end-to-end.
  • FailedUploadException.__str__ prints [trace_id=…] next to each affected item. Grouping by reason is unchanged (the new field is rendered per item, not folded into the grouping key) so identical failures still cluster.
  • Datapoint-level failures caused by asset upload errors now inherit the underlying asset errors' real reasons and trace IDs instead of the opaque One or more required assets failed to upload placeholder. _rapidata_dataset._collect_and_return_results accepts the asset-level FailedUploads and maps them back to the blocked datapoints via asset_to_datapoints.

Motivation

Before this change, a user hitting an asset-upload failure saw a generic message with no way to correlate it to the backend trace, so support tickets had to be opened blind. RapidataError already extracts details.traceId for top-level errors — this change just stops dropping it on the way through the upload pipeline.

Test plan

  • uv run pyright src/rapidata/rapidata_client — clean (0/0/0).
  • Manual smoke: FailedUpload.from_exception populates trace_id for RapidataError, leaves None otherwise.
  • Manual smoke: FailedUploadException renders [trace_id=…] per item with two different trace IDs grouped under one reason.
  • Manual smoke: legacy path with no trace_id still renders without the suffix.
  • End-to-end upload that hits a failing asset (reviewer to validate against staging).

🔗 Session: https://session-8649337b.poseidon.rapidata.internal/

🤖 Generated with Claude Code

FailedUpload now carries an optional trace_id, populated from RapidataError
responses, and FailedUploadException's rendering prints it next to each item.
Datapoint-level failures caused by asset upload errors now inherit the
underlying asset errors' reasons (and trace IDs) instead of the opaque
"One or more required assets failed to upload" placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: RapidPoseidon <poseidon@rapidata.ai>
@claude
Copy link
Copy Markdown

claude Bot commented May 18, 2026

Code Review — PR #593: surface trace ID and underlying reason in FailedUpload

Overview

This PR solves a real UX problem: previously, users hitting asset-upload failures saw an opaque One or more required assets failed to upload message with no way to correlate it back to a backend trace. The change is well-scoped and the implementation is generally clean. Below are observations ranging from blocking concerns to minor polish.


Correctness

asset_to_datapoints mutation dependency (subtle, worth a comment)

_collect_and_return_results relies on an implicit invariant: for failed assets, their entries in asset_to_datapoints are still intact at call time. This works because _find_ready_datapoints (the success-path callback) only calls asset_to_datapoints[asset].discard(datapoint_idx) for successfully completed assets — failed assets never trigger that callback, so their sets remain untouched. However, this is a non-obvious cross-method dependency. I'd suggest a brief comment at the lookup site to make the invariant explicit:

# asset_to_datapoints entries for failed assets are still intact here because
# _find_ready_datapoints only removes entries on *successful* completion.
affected = asset_to_datapoints.get(asset_failure.item, set())

Multiple trace IDs joined into a single string

In _build_asset_failure_for_datapoint, when a datapoint's multiple assets fail with different trace IDs, they're concatenated as trace_id = ", ".join(unique_trace_ids). The resulting trace_id field is then a comma-separated string, not a single ID. This works for display, but if anyone downstream tries to use FailedUpload.trace_id programmatically to look up a specific trace, they'll need to split first. It's a minor concern since this is primarily a diagnostics field, but worth noting.


Design / Duplication

__str__ re-implements grouping that failures_by_reason already does

failures_by_reason groups by error_messagelist[Datapoint], and __str__ now builds an identical grouping (same key, same ordering) but over list[FailedUpload] so it can access trace_id. The comment explains why (public API returns bare Datapoint, not FailedUpload), but the duplication means the two can silently diverge if failures_by_reason's grouping logic ever changes.

A forward-compatible option would be to add a new detailed_failures_by_reason property that returns dict[str, list[FailedUpload[Datapoint]]] and have __str__ delegate to it. That avoids the duplication without being a breaking change.

details dict accessed twice in from_exception

get_reason() already checks isinstance(exception.details, dict) internally; then from_exception does the same check again to extract traceId. Extracting the trace ID inside RapidataError (e.g., as a trace_id property) would centralise the dict access and make from_exception cleaner, but that's a refactor beyond the PR's scope — not blocking.


Test Coverage

The test plan lists manual smoke tests but no automated ones. Given the grouping and rendering logic touches several code paths, the following would be valuable additions:

  1. FailedUpload.from_exception — assert trace_id is populated for RapidataError with details={"traceId": "abc"} and None for generic exceptions.
  2. _build_asset_failure_for_datapoint — assert deduplication of reasons/trace IDs, fallback message when no related failures are provided, and the multi-trace-ID join.
  3. FailedUploadException.__str__ — assert the [trace_id=…] suffix appears only when trace_id is set.

These are unit-testable in isolation and would prevent regressions as the pipeline evolves.


Minor / Nits

  • format_error_details now appends Trace Id: before Timestamp:. That ordering feels natural (reason → trace → time), no change needed.
  • The docstring on _build_asset_failure_for_datapoint is thorough — good.
  • The fallback message ("One or more required assets failed to upload") is preserved when related_asset_failures is empty, which is the right safe default.

Summary

The change is well-motivated and the core logic is correct. The main asks before merge:

  1. Add a comment at the asset_to_datapoints.get(asset_failure.item, set()) lookup explaining the invariant being relied on.
  2. Consider a detailed_failures_by_reason property (or similar) to eliminate the duplicated grouping logic in __str__.
  3. Add automated unit tests for the three scenarios listed above.

The multiple-trace-ID string concern is low-priority given the diagnostic-only use case, but worth tracking if trace_id is ever used programmatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant