Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions examples/dataset/make_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -276,14 +276,14 @@ async def _load_ultrachat_conversations(
ds = ds.shuffle(seed=42)
yield len(ds)
for i in range(len(ds)):
prompt = ds[i]["prompt"].strip()
prompt_id = ds[i]["prompt_id"].strip()
if prompt:
msgs = [{"role": "user", "content": prompt}]
if not prompt_id:
prompt_id = id_for_conversation(msgs)
prompt_id = f"ultrachat-{split_name}-{prompt_id}"
yield {"conversation_id": prompt_id, "conversations": msgs}
msgs = ds[i]["messages"]
if not msgs:
continue
if not prompt_id:
prompt_id = id_for_conversation(msgs)
prompt_id = f"ultrachat-{split_name}-{prompt_id}"
Comment on lines 279 to +285

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle missing/non-string prompt_id before calling .strip()

Line 279 can crash (AttributeError/KeyError) when a row has null/missing prompt_id, so the new fallback ID logic is never reached. Normalize first, then strip.

Proposed fix
-        prompt_id = ds[i]["prompt_id"].strip()
+        raw_prompt_id = ds[i].get("prompt_id")
+        prompt_id = raw_prompt_id.strip() if isinstance(raw_prompt_id, str) else ""
         msgs = ds[i]["messages"]
         if not msgs:
             continue
         if not prompt_id:
             prompt_id = id_for_conversation(msgs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/dataset/make_dataset.py` around lines 279 - 285, The code calls
.strip() on ds[i]["prompt_id"] which can raise if prompt_id is missing or not a
string; change to first retrieve prompt_id safely (e.g., use
ds[i].get("prompt_id") or check key), normalize to a string only when
appropriate (if not isinstance(prompt_id, str) set prompt_id = ""), then call
.strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).

yield {"conversation_id": prompt_id, "conversations": msgs}
logger.info(f"Finished loading UltraChat {split_name} conversations.")


Expand Down
Loading