Skip to content

Async engine: side-effect columns not persisted to row buffer #523

@nabinchha

Description

@nabinchha

Bug

In the async engine (DATA_DESIGNER_ASYNC_ENGINE=1), AsyncTaskScheduler._run_cell only writes columns tracked in _instance_to_columns back to the RowGroupBufferManager. Side-effect columns produced by generators (e.g. __trace from with_trace, __reasoning_content from extract_reasoning_content) are present in the result dict but are silently dropped during the buffer write-back.

When a downstream column references a side-effect column in its prompt template, the value is missing from the row buffer, causing a template rendering error:

The following ['<column>__reasoning_content'] columns are missing!

All rows for that downstream column fail as non-retryable, and the entire dataset generation fails.

Root Cause

_instance_to_columns is built from the generators dict which only maps primary column names to generator instances. Side-effect columns are not registered. The buffer write loop at _run_cell line 796-799 iterates only over output_cols from this map, so any extra keys in the result dict are never written to the buffer.

The same issue exists in _run_batch for batch generators.

Impact

Any pipeline using extract_reasoning_content=True or with_trace != TraceType.NONE where a downstream column references the side-effect column will fail under the async engine. The sync engine is unaffected because it mutates the row dict in place.

Fix

After writing tracked output_cols, also persist any new keys from the result dict (keys not present in the input row_data) to the buffer. Apply the same pattern to _run_batch.

Affected files

  • packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions