Basic implementation for mlos_benchd service#949
Basic implementation for mlos_benchd service#949eujing wants to merge 2 commits intomicrosoft:mainfrom
Conversation
| self._opt_targets = opt_targets | ||
| self._ts_start = ts_start or datetime.now(UTC) | ||
| self._ts_end: datetime | None = None | ||
| self._status = Status.PENDING |
There was a problem hiding this comment.
This should match what was stored in the backend for resumable Experiments, right?
|
|
||
| This method is called by `Storage.Experiment.__enter__()`. | ||
| """ | ||
| self._status = Status.RUNNING |
There was a problem hiding this comment.
Maybe add some asserts on expected status to check for invalid state transitions.
Wouldn't be terrible to document the expected state transitions in a README.md or docstring either.
| """ | ||
| self._status = Status.RUNNING | ||
| self._driver_name = platform.node() | ||
| self._driver_pid = os.getpid() |
There was a problem hiding this comment.
These seem ok to initialize the values from None, but might be problematic for resuming an Experiment.
There are a few cases I can think of:
- An individual
mlos_benchdriver process fails and needs to be restarted. - The
mlos_benchdservice dies, but themlos_benchprocess is still running. - The whole driver host fails and either needs to be restarted on that same backend or else picked up by a new one (for simplicity, let's assume we only support the former for now. We can add a heartbeat mechanism later to support the latter).
|
|
||
| def _setup(self) -> None: | ||
| super()._setup() | ||
| def _ensure_persisted(self) -> None: |
There was a problem hiding this comment.
| def _ensure_persisted(self) -> None: | |
| def _save(self) -> None: |
or
| def _ensure_persisted(self) -> None: | |
| def _try_save(self) -> None: |
or
| def _ensure_persisted(self) -> None: | |
| def _persist(self) -> None: |
| def _setup(self) -> None: | ||
| super()._setup() | ||
| self._ensure_persisted() | ||
| with self._engine.begin() as conn: |
There was a problem hiding this comment.
Might need to separate that out to an _update method, or just incorporate it into _save
There was a problem hiding this comment.
Could also rename _save to _create to separate the INSERT from UPDATE calls
| except exc.SQLAlchemyError: | ||
| # probably a conflict | ||
| trans.rollback() | ||
|
|
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="mlos_benchd") |
There was a problem hiding this comment.
Add a long text description, similar to mlos_bench.launcher
| help="Number of workers to use. Default is 1.", | ||
| ) | ||
| parser.add_argument( | ||
| "--poll_interval", |
There was a problem hiding this comment.
Also provide poll-interval variants, like mlos_bench.launcher
| "--environment", | ||
| root_env_config, | ||
| "--experiment_id", | ||
| exp_id, |
There was a problem hiding this comment.
We'll eventually need to include other things here too:
- cli config
- globals (could be subsumed by cli config)
- working dir (maybe we adjust the storage backend to store the target directory and cli-config and optionally globals instead of root_env_config)
bpkroth
left a comment
There was a problem hiding this comment.
This is a great start! Thanks so much ❤️
I left a few comments for initial changes.
We'll also need tests.
| This script is responsible for polling the storage for runnable experiments and | ||
| executing them in parallel. | ||
|
|
||
| See the current ``--help`` `output for details. |
There was a problem hiding this comment.
| See the current ``--help`` `output for details. | |
| See the current ``--help`` output for details. |
| """ | ||
| mlos_bench background execution daemon. | ||
|
|
||
| This script is responsible for polling the storage for runnable experiments and |
There was a problem hiding this comment.
| This script is responsible for polling the storage for runnable experiments and | |
| This script is responsible for polling the :py:mod:`~mlos_bench.storage` for runnable :py:class:`.Experiment`s and |
There was a problem hiding this comment.
Some minor tweaks to help make all of the docstring generation cross referencing. Might need some tweaks.
| default=1, | ||
| help="Polling interval in seconds. Default is 1.", | ||
| ) | ||
| _main(parser.parse_args()) |
There was a problem hiding this comment.
For testing, may also want some sort of hidden argument or environment variable used to set the MAX_ITERATIONS.
Pull Request
Basic implementation for mlos_benchd service
This PR introduces:
Experimentvia the storage API, separate from running via CLI.mlos_benchdscript that polls storage for runnable experiments, then executes them.This allows for the separation of experiment creation (e.g. scheduling), and their execution (on potentially multiple hosts).
The new
mlos_benchdscript can run on any host to poll for new experiments, as long as they monitor the right storage backend.Builds upon schema changes from #931
Example usage
Create new experiment, via notebook or any python environment:
mlos_benchd:
Limitations:
Description
Type of Change
Indicate the type of change by choosing one (or more) of the following:
Testing
Manual testing, unit tests to come.
Additional Notes (optional)
Add any additional context or information for reviewers.