Extending benchmarking to allow pre-schedules#65
Conversation
There was a problem hiding this comment.
Pull request overview
Extends the Lighthouse workload/benchmarking interface to support applying multiple transform schedules, enabling “pre-schedules” to run before the benchmark wrapper is emitted (to accommodate signature changes from sharding/partitioning).
Changes:
- Rename workload API from
schedule_module()toschedule_modules()and apply schedules sequentially during lowering. - Update benchmarking flow to optionally apply the first schedule before emitting the benchmark wrapper, then apply remaining schedules.
- Update examples (XeGPU and mlp-mpi) to return schedule lists; adjust mlp-mpi payload/signature and pipeline to support benchmarking.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| lighthouse/workload/workload.py | Updates the Workload interface to return multiple schedules and applies them in order during lowering. |
| lighthouse/workload/runner.py | Updates benchmark() to support pre-schedules before emitting the benchmark wrapper; adds an execution print. |
| examples/xegpu/mlp.py | Migrates to schedule_modules() returning a single schedule in a list. |
| examples/xegpu/matmul.py | Migrates to schedule_modules() returning a single schedule in a list. |
| examples/workload/example.py | Migrates to schedule_modules() and returns a list, but still has an early return path that returns a single module. |
| examples/mlp-mpi/mlp_weight_stationary.py | Adjusts payload signature to take an explicit destination argument and uses bufferization materialization into destination. |
| examples/mlp-mpi/mlp-mpi.py | Switches example driver to use benchmark() and splits schedule into pre/main schedules for signature-sensitive benchmarking. |
Comments suppressed due to low confidence (1)
examples/workload/example.py:151
schedule_modulesis now expected to returnlist[ir.Module], but this implementation still annotates-> ir.Moduleand (more importantly) returns a bareschedule_modulewhenstop_at_stage == "bufferized"(line 151). This will violate the new interface and will fail at runtime due to theassert isinstance(schedule_modules, list)inWorkload.lower_payload. Update the return annotation and make the early return return a list as well (or restructure to avoid returning from inside the insertion-point block).
def schedule_modules(
self, stop_at_stage: Optional[str] = None, parameters: Optional[dict] = None
) -> ir.Module:
schedule_module = ir.Module.create()
schedule_module.operation.attributes["transform.with_named_sequence"] = (
ir.UnitAttr.get()
)
with ir.InsertionPoint(schedule_module.body):
named_sequence = transform.named_sequence(
"__transform_main",
[transform.AnyOpType.get()],
[],
arg_attrs=[{"transform.readonly": ir.UnitAttr.get()}],
)
with ir.InsertionPoint(named_sequence.body):
anytype = transform.AnyOpType.get()
func = match(named_sequence.bodyTarget, ops={"func.func"})
mod = transform.get_parent_op(
anytype,
func,
op_name="builtin.module",
deduplicate=True,
)
mod = apply_registered_pass(mod, "one-shot-bufferize")
mod = apply_registered_pass(mod, "convert-linalg-to-loops")
transform.apply_cse(mod)
canonicalize(mod)
if stop_at_stage == "bufferized":
transform.YieldOp()
return schedule_module
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The proposed fix indeed requires that we're able to interrupt the lowering schedule in the middle, do something (i.e., emit the benchmark function with local memref shapes), and then do the rest of the lowering. The One possible solution is that returned "schedules" is a dictionary with (tag, schedule) pairs. For uninterrupted lowering we just apply all the schedules in order (in python >=3.7 dictionaries preserve order). If we need to interrupt at sharded state, we check whether the schedules contain a tag "sharded" and if so we know where to stop. Alternative solution For the present benchmark func issue, another possible solution is to copy the payload func shard annotations (if any) to the benchmark function. Then we can just apply the whole pipeline in one go, and the benchmark func with get converted from global to local shapes correctly. We'd avoid splitting the schedule but would still have a sharding specific change in the runner. |
Yes, something nicer would be nice.
How do you suggest to copy the shard annotations? A shard-specific transform-schedule? What's bad about allowing multiple schedules? Whether schedules or pipelines, being able to combine more than one to me looks like a useful feature no matter what. |
| from mlir.dialects.transform import x86 | ||
|
|
||
|
|
||
| def _cleanup(target): |
There was a problem hiding this comment.
This already exists partially in the pipeline helper. Better to improve that one and reuse here.
| # This allows for modifying function signature before extracting it for the benchmark function. | ||
| schedule_modules = workload.schedule_modules(parameters=schedule_parameters) | ||
| assert isinstance(schedule_modules, list) | ||
| if len(schedule_modules) > 1: |
There was a problem hiding this comment.
I don't like the idea of separating the schedules numerically. If there are two distinct stages in which they run, I'd make them into separate lists, and apply all schedules in the first list here, and all in the second list down there.
There was a problem hiding this comment.
I don't like the idea of list of lists in this case. What is between first and second is a pure artefact of how the benchmarking is implemented. Other tools might require other separations. I don't think there is a clear way to make this generic enough without relying on something that identifies arbitrary (maybe continuous) subsets (like with numeric offsets or dicts).
As discussed elsewhere, a better approach might be to put things which modify the IR (like inserting the benchmark wrapper) into separate schedules and leave it to the workload to put various schedules in the right order.
I can try to implement that with @rolfmorel's latest extensions to MLIR.
There was a problem hiding this comment.
Not a list of lists, two separate lists. Or one list that is always applied in full, and the user does that twice.
Encoding the stage in which things run on the list index is asking for trouble, and it becomes literally a list of lists.
| transform.apply_patterns_canonicalization() | ||
|
|
||
|
|
||
| def create(tile_size=64) -> ir.Module: |
There was a problem hiding this comment.
Let's keep it as is for now.
I'll modularize the schedule when a CPU example is added.
There was a problem hiding this comment.
One nit: I'd add an assert to ensure the tile_size is at least multiple of 32
Sharding modifies function signatures to make them operate on partitions, not the whole tensor.
The benchmark utility extracted the payload's signature from the unpartitioned IR and created the wrapper function based on that.
This of course break when it tries to call the partitioned function with global shapes.
This PR allows workloads to return more than one schedule. If so, the benchmark will apply the first, add the benchmark wrapper and finally apply all the remaining schedules.
The mlp-mpi example can now be properly benchmarked. For this mild the payload function was adjusted to accept a return argument instead of returning a tensor and it provides two schedules. All other examples simply return a list with a single schedule.
Requires a fix in MLIR to make mlp-mpi.py pass.