Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements a "jobs-centric" redesign of the drug discovery execution API. It introduces a composable class hierarchy (Execution base + QuoteMixin/SyncExecutableMixin/AsyncExecutableMixin mixins) and replaces the previous workflow-step pattern with standalone job classes (Docking, ABFE, PocketFinder). The old classes are preserved as docking_step.py/abfe_step.py for backward compatibility with the Complex class. The Results.get() API is also refactored to accept a raw filter_dict instead of named filter parameters.
Changes:
- New standalone
Docking,ABFE, andPocketFinderjob classes with composable execution mixins (Execution,AsyncExecutableMixin, etc.) Results.get()refactored to accept afilter_dict; a new_build_result_filter()helper handles construction of filter dicts- Old
docking.pyandabfe.pyWorkflowStep classes renamed todocking_step.pyandabfe_step.py
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/drug_discovery/execution.py |
New base class with read-only id/estimate/cost properties and status transition logic |
src/drug_discovery/execution_mixins.py |
New mixins: QuoteMixin, SyncExecutableMixin, AsyncExecutableMixin, JupyterVizMixin |
src/drug_discovery/docking.py |
New standalone Docking class replacing old WorkflowStep-based one |
src/drug_discovery/abfe.py |
New standalone ABFE class replacing old WorkflowStep-based one |
src/drug_discovery/pocket_finder.py |
New standalone PocketFinder class |
src/drug_discovery/docking_step.py |
Old Docking WorkflowStep class (kept for Complex compatibility) |
src/drug_discovery/abfe_step.py |
Old ABFE WorkflowStep class (kept for Complex compatibility) |
src/drug_discovery/complex.py |
Updated imports to use renamed _step.py files |
src/platform/results.py |
Refactored Results.get() to accept filter_dict; added _build_result_filter() helper |
src/platform/constants.py |
Added PlatformStatus literal type and ALLOWED_STATUS_TRANSITIONS dict; bumped PocketFinder version |
src/drug_discovery/utils/__init__.py |
Simplified _start_tool_run signature (removed output_dir_path, provider) |
src/drug_discovery/__init__.py |
Exports new PocketFinder, Docking, ABFE, Execution, PlatformStatus |
src/jobs/__init__.py |
New backward-compatibility shim re-exporting job classes |
src/drug_discovery/constants.py |
Added PocketFinder to valid_tools and tool_mapper |
src/drug_discovery/structures/pocket.py |
Added Pocket.from_id() classmethod and pocket_count/pocket_min_size args to from_result() |
src/drug_discovery/structures/ligand.py |
Added LigandSet.from_ids() classmethod |
tests/test_jobs_base.py |
New tests for Execution base class, mixins, and PocketFinder |
tests/test_jobs_docking.py |
New tests for standalone Docking class |
tests/test_jobs_abfe.py |
New tests for standalone ABFE class |
tests/test_results.py |
Updated tests for refactored Results.get() API |
tests/test_functions_with_data_platform.py |
Updated call site to use new filter_dict API |
jobs-refactor-plan.md |
Design document describing the refactor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/jobs/__init__.py
Outdated
|
|
||
| from deeporigin.drug_discovery.abfe import ABFE | ||
| from deeporigin.drug_discovery.docking import Docking | ||
| from deeporigin.drug_discovery.execution import Execution, PlatformStatus |
There was a problem hiding this comment.
PlatformStatus is imported from deeporigin.drug_discovery.execution in both src/jobs/__init__.py (line 10) and src/drug_discovery/__init__.py (line 66), but PlatformStatus is not defined or re-exported in src/drug_discovery/execution.py — it is only defined in src/platform/constants.py. This will raise an ImportError at runtime whenever PlatformStatus is accessed via either of these import paths. execution.py needs to either re-export PlatformStatus with from deeporigin.platform.constants import PlatformStatus, or both call sites need to import directly from deeporigin.platform.constants.
| from deeporigin.drug_discovery.execution import Execution, PlatformStatus | |
| from deeporigin.drug_discovery.execution import Execution | |
| from deeporigin.platform.constants import PlatformStatus |
src/drug_discovery/abfe.py
Outdated
| for key, val in output_files.items(): | ||
| if isinstance(val, dict) and "system_pdb_file_path" in val: | ||
| pass | ||
|
|
There was a problem hiding this comment.
The elif isinstance(output_files, dict) branch at lines 147-150 is dead code — it enters the branch but only executes pass, discarding any dict-shaped output_files without extracting binding_xml or solvation_xml. This means if the API ever returns output_files as a dict (rather than a list), self.binding_xml_path and self.solvation_xml_path will remain None, causing quote() and start() to raise ValueError. Either extract the XML paths from the dict format or remove the branch entirely and document the expected format.
| for key, val in output_files.items(): | |
| if isinstance(val, dict) and "system_pdb_file_path" in val: | |
| pass | |
| def _extract_xml_paths(obj) -> None: | |
| nonlocal binding_xml, solvation_xml | |
| if isinstance(obj, dict): | |
| for v in obj.values(): | |
| _extract_xml_paths(v) | |
| elif isinstance(obj, list): | |
| for v in obj: | |
| _extract_xml_paths(v) | |
| elif isinstance(obj, str): | |
| if obj.endswith("bsm_system.xml"): | |
| binding_xml = obj | |
| elif obj.endswith("solvation.xml"): | |
| solvation_xml = obj | |
| _extract_xml_paths(output_files) |
tests/test_jobs_docking.py
Outdated
| def docking_with_smiles(protein, pocket, ligand_set, client): | ||
| """Create a Docking instance using a second LigandSet (was SMILES path).""" | ||
| return Docking( | ||
| protein=protein, | ||
| pocket=pocket, | ||
| ligands=ligand_set, |
There was a problem hiding this comment.
The docking_with_smiles fixture (lines 70-77) is identical to docking_with_ligands — both pass ligands=ligand_set and do not use the smiles_list argument. The associated test_with_smiles_list test (line 99) claims to test "Docking constructed with smiles_list converts to a LigandSet", but it never exercises the smiles_list code path. The fixture should instead pass smiles_list=["CCO", "CCCO"] (without ligand IDs pre-assigned, since smiles_list conversion happens internally) to correctly cover that constructor branch.
| def docking_with_smiles(protein, pocket, ligand_set, client): | |
| """Create a Docking instance using a second LigandSet (was SMILES path).""" | |
| return Docking( | |
| protein=protein, | |
| pocket=pocket, | |
| ligands=ligand_set, | |
| def docking_with_smiles(protein, pocket, client): | |
| """Create a Docking instance using a SMILES list.""" | |
| return Docking( | |
| protein=protein, | |
| pocket=pocket, | |
| smiles_list=["CCO", "CCCO"], |
| def get_results(self) -> LigandSet | None: | ||
| """Retrieve docking results from the platform. | ||
|
|
||
| metadata = { | ||
| "protein_file": protein_basename, | ||
| "protein_hash": self.parent.protein.to_hash(), | ||
| } | ||
| Downloads result SDF files and returns a ``LigandSet``. | ||
|
|
||
| if batch_size is None and n_workers is None: | ||
| raise DeepOriginException( | ||
| "Either batch_size or n_workers must be specified." | ||
| ) from None | ||
| elif batch_size is not None and n_workers is not None: | ||
| print( | ||
| "Both batch_size and n_workers are specified. Using n_workers to determine batch_size..." | ||
| ) | ||
|
|
||
| if n_workers is not None: | ||
| batch_size = math.ceil(len(self.parent.ligands) / n_workers) | ||
| print(f"Using a batch size of {batch_size}") | ||
|
|
||
| if pocket is not None: | ||
| box_size = float(2 * np.cbrt(pocket.volume)) | ||
| box_size = [box_size, box_size, box_size] | ||
| pocket_center = pocket.get_center().tolist() | ||
|
|
||
| # Get all ligands and filter out already docked ones | ||
| all_ligands = list(self.parent.ligands) | ||
|
|
||
| df = self.get_jobs_df(pocket_center=pocket_center, box_size=box_size) | ||
|
|
||
| already_docked_smiles = set() | ||
|
|
||
| if not re_run: | ||
| for _, row in df.iterrows(): | ||
| user_inputs = row.get("user_inputs", {}) | ||
| # Handle both old format (smiles_list) and new format (ligands array) | ||
| if "smiles_list" in user_inputs: | ||
| already_docked_smiles.update(user_inputs["smiles_list"]) | ||
| elif "ligands" in user_inputs: | ||
| already_docked_smiles.update( | ||
| ligand.get("smiles", "") for ligand in user_inputs["ligands"] | ||
| ) | ||
|
|
||
| # Filter ligands to only include those not already docked | ||
| ligands_to_dock = [ | ||
| ligand | ||
| for ligand in all_ligands | ||
| if ligand.smiles not in already_docked_smiles | ||
| ] | ||
| Returns: | ||
| A ``LigandSet`` of docked poses, or ``None`` if no results yet. | ||
| """ | ||
| response = self.client.results.get_poses(protein_id=self.protein.id) | ||
| records = response.get("data", []) | ||
| remote_paths: list[str] = [] | ||
| for record in records: | ||
| data = record.get("data", {}) | ||
| fp = data.get("file_path") | ||
| if fp: | ||
| remote_paths.append(fp) | ||
|
|
||
| # Sort ligands by SMILES for consistent batching | ||
| ligands_to_dock = sorted(ligands_to_dock, key=lambda ligand: ligand.smiles) | ||
|
|
||
| job_ids = [] | ||
|
|
||
| chunks = list(more_itertools.chunked(ligands_to_dock, batch_size)) | ||
|
|
||
| def process_chunk(chunk): | ||
| params = { | ||
| "box_size": list(box_size), | ||
| "pocket_center": list(pocket_center), | ||
| "protein": { | ||
| "file_path": self.parent.protein._remote_path, | ||
| }, | ||
| "ligands": [ | ||
| { | ||
| "smiles": ligand.smiles, | ||
| } | ||
| for ligand in chunk | ||
| ], | ||
| } | ||
| if not remote_paths: | ||
| return None | ||
|
|
||
| # Include protein ID if available | ||
| if self.parent.protein.id is not None: | ||
| params["protein"]["id"] = self.parent.protein.id | ||
|
|
||
| # Include ligand IDs if available | ||
| for i, ligand in enumerate(chunk): | ||
| if ligand.id is not None: | ||
| params["ligands"][i]["id"] = ligand.id | ||
|
|
||
| # Include pocket_id if pocket is provided and has an ID | ||
| if pocket is not None and hasattr(pocket, "id") and pocket.id is not None: | ||
| params["pocket_id"] = pocket.id | ||
|
|
||
| execution_dto = utils._start_tool_run( | ||
| params=params, | ||
| metadata=metadata, | ||
| tool="Docking", | ||
| tool_version=self.tool_version, | ||
| client=self.parent.client, | ||
| output_dir_path=output_dir_path, | ||
| approve_amount=approve_amount, | ||
| ) | ||
| return execution_dto | ||
|
|
||
| if len(ligands_to_dock) > 0: | ||
| if use_parallel: | ||
| with concurrent.futures.ThreadPoolExecutor() as executor: | ||
| # Submit all chunks to the executor | ||
| future_to_chunk = { | ||
| executor.submit(process_chunk, chunk): chunk for chunk in chunks | ||
| } | ||
|
|
||
| # Process results with progress bar | ||
| for future in tqdm( | ||
| concurrent.futures.as_completed(future_to_chunk), | ||
| total=len(chunks), | ||
| desc="Starting docking jobs", | ||
| ): | ||
| execution_dto = future.result() | ||
| if execution_dto is not None: | ||
| job_ids.append(execution_dto) | ||
| else: | ||
| for chunk in tqdm( | ||
| chunks, total=len(chunks), desc="Starting docking jobs" | ||
| ): | ||
| execution_dto = process_chunk(chunk) | ||
| if execution_dto is not None: | ||
| job_ids.append(execution_dto) | ||
| else: | ||
| raise DeepOriginException( | ||
| title="Cannot run Docking: no new ligands to dock", | ||
| message="No new ligands to dock. All ligands have already been docked.", | ||
| fix=" If you want to re-run the docking, set <code>re_run=True</code>.", | ||
| level="warning", | ||
| ) from None | ||
|
|
||
| jobs = [ | ||
| Job.from_dto(execution_dto, client=self.parent.client) | ||
| for execution_dto in job_ids | ||
| ] | ||
| local_paths = self.client.files.download_files(files=remote_paths, lazy=True) | ||
|
|
||
| self.jobs = JobList(jobs) | ||
| result = LigandSet() | ||
| for path in local_paths: | ||
| result.ligands += LigandSet.from_sdf(path).ligands | ||
|
|
||
| return self.jobs | ||
| return result |
There was a problem hiding this comment.
According to the design specification in jobs-refactor-plan.md (line 400), calling get_results() before start() (when there is no execution ID) should "raise an error". However, the current implementation of get_results() in Docking does not check whether self.id is None before querying the results API — it simply queries by protein_id and may return results from any previous docking run for that protein, not necessarily the one started by this instance. Compare this with ABFE.get_results(), which correctly raises ValueError("No execution has been started...") when self.id is None. Consider adding a similar guard here.
src/drug_discovery/docking_step.py
Outdated
| tool="Docking", | ||
| tool_version=self.tool_version, | ||
| client=self.parent.client, | ||
| output_dir_path=output_dir_path, |
There was a problem hiding this comment.
The _start_tool_run function signature was refactored to remove the output_dir_path and provider parameters. However, both docking_step.py (line 465) and abfe_step.py (line 388) still pass output_dir_path=output_dir_path as a keyword argument to _start_tool_run. Since @beartype is applied to _start_tool_run and the parameter no longer exists in the signature, calling Complex.docking.run(...) or Complex.abfe.run(...) will raise a TypeError at runtime, breaking the existing Complex-based workflow.
These callers need to be updated: instead of passing output_dir_path to _start_tool_run, they should either pass the outputs as a pre-built outputs dict or remove the parameter entirely (since the new _start_tool_run now accepts an outputs dict directly).
| output_dir_path=output_dir_path, |
src/drug_discovery/abfe_step.py
Outdated
| execution_dto = utils._start_tool_run( | ||
| metadata=metadata, | ||
| params=params, | ||
| tool="ABFE", | ||
| tool_version=self.tool_version, | ||
| client=self.parent.client, | ||
| output_dir_path=output_dir_path, |
There was a problem hiding this comment.
Same issue as in docking_step.py: abfe_step.py still passes output_dir_path=output_dir_path (line 388) to _start_tool_run, which no longer accepts that parameter. This will raise a TypeError at runtime when calling Complex.abfe.run(...). The call needs to be updated to construct and pass an outputs dict directly instead.
| execution_dto = utils._start_tool_run( | |
| metadata=metadata, | |
| params=params, | |
| tool="ABFE", | |
| tool_version=self.tool_version, | |
| client=self.parent.client, | |
| output_dir_path=output_dir_path, | |
| outputs = {"output_dir_path": output_dir_path} | |
| execution_dto = utils._start_tool_run( | |
| metadata=metadata, | |
| params=params, | |
| tool="ABFE", | |
| tool_version=self.tool_version, | |
| client=self.parent.client, | |
| outputs=outputs, |
src/drug_discovery/docking.py
Outdated
|
|
||
| params = { | ||
| "box_size": box_size_list, | ||
| "pocket_center": pocket_center, | ||
| "protein": { | ||
| "id": self.protein.id, | ||
| "file_path": self.protein._remote_path, | ||
| }, | ||
| "ligands": [ | ||
| { | ||
| "id": record.get("id"), | ||
| "tool_id": record.get("tool_id"), | ||
| "tool_version": record.get("tool_version"), | ||
| "ligand_id": data.get("ligand_id"), | ||
| "protein_id": data.get("protein_id"), | ||
| "binding_energy": data.get("binding_energy"), | ||
| "pose_score": data.get("pose_score"), | ||
| "file_path": data.get("file_path"), | ||
| "id": lig.id, | ||
| "smiles": lig.smiles, | ||
| } | ||
| ) | ||
| for lig in ligands | ||
| ], | ||
| } | ||
|
|
||
| return pd.DataFrame(rows) | ||
| if self.protein.id is not None: | ||
| params["protein"]["id"] = self.protein.id | ||
|
|
||
| @beartype | ||
| def get_jobs_df( | ||
| self, | ||
| *, | ||
| pocket_center: Optional[tuple[Number, Number, Number] | list[Number]] = None, | ||
| box_size: Optional[tuple[Number, Number, Number] | list[Number]] = None, | ||
| ): | ||
| """search for all jobs that match this protein and ligands in the Job DB, and return a dataframe of the results | ||
| for i, lig in enumerate(ligands): | ||
| if lig.id is not None: | ||
| params["ligands"][i]["id"] = lig.id | ||
|
|
||
| Args: | ||
| pocket_center: Optional tuple of (x, y, z) coordinates to filter by pocket center | ||
| box_size: Optional tuple of (x, y, z) dimensions to filter by box size | ||
| """ | ||
| # Use parent method with docking-specific parameters | ||
| df = super().get_jobs_df( | ||
| include_metadata=True, | ||
| include_inputs=True, | ||
| include_outputs=True, | ||
| status=list(NON_FAILED_STATES), | ||
| ) | ||
| if hasattr(self.pocket, "id") and self.pocket.id is not None: | ||
| params["pocket_id"] = self.pocket.id |
There was a problem hiding this comment.
The _build_tool_inputs method contains redundant code. The protein ID is set twice: first in the params dict literal at line 291 ("id": self.protein.id), then again conditionally at lines 303-304. Similarly, ligand IDs are already included in the list comprehension at lines 294-299 ("id": lig.id), but then set again in the loop at lines 306-308. The constructor already enforces that protein, pocket, and all ligands have IDs, so the conditional checks at lines 303-304 and the re-assignment loop at lines 306-308 are unreachable dead code. These redundant sections should be removed.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/drug_discovery/docking.py
Outdated
| if self.protein.id is not None: | ||
| params["protein"]["id"] = self.protein.id | ||
|
|
||
| if len(df) == 0: | ||
| return df | ||
| for i, lig in enumerate(ligands): | ||
| if lig.id is not None: | ||
| params["ligands"][i]["id"] = lig.id |
There was a problem hiding this comment.
The _build_tool_inputs method contains redundant code that sets the same fields twice. protein.id is set in the initial params dict at line 291, then conditionally re-set at lines 303-304. Ligand IDs are set in the list comprehension at lines 294-299, then conditionally re-set in the loop at lines 306-308. Since __init__ already validates that protein.id, all ligand IDs, and pocket.id are not None, these duplicate assignments are always executed and the conditionals are always true. The duplicate assignments (lines 303-308) should be removed to avoid confusion.
| while True: | ||
| response = self.search( | ||
| "ligands", | ||
| cursor=cursor, | ||
| filter_dict=filter_dict, | ||
| limit=limit, | ||
| offset=offset, | ||
| select=select, | ||
| sort=sort, | ||
| ) | ||
| all_data.extend(response.get("data", [])) | ||
|
|
||
| cursor = response.get("meta", {}).get("nextCursor") | ||
| if not cursor: | ||
| break |
There was a problem hiding this comment.
In the search_ligands pagination loop, the offset parameter is passed on every iteration including subsequent pages. When cursor-based pagination is active, using offset on subsequent page requests (where a cursor is already set) may cause incorrect pagination behavior — double-skipping records. The offset should only be applied on the first request and omitted once a cursor is available.
src/drug_discovery/abfe.py
Outdated
| ABFE_TOOL_KEY = "deeporigin.abfe-end-to-end" | ||
| ABFE_TOOL_VERSION = "0.2.19" |
There was a problem hiding this comment.
ABFE_TOOL_KEY and ABFE_TOOL_VERSION are defined as local constants in abfe.py rather than being imported from platform/constants.py like DOCKING_TOOL_KEY and DOCKING_TOOL_VERSION are in docking.py. The same ABFE tool key already exists in drug_discovery/constants.py (tool_mapper["ABFE"]). This duplication risks divergence if the key or version is ever updated in one place but not the other.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
src/drug_discovery/docking.py
Outdated
| for ligand in all_ligands | ||
| if ligand.smiles not in already_docked_smiles | ||
| ] | ||
| response = self.client.results.get_poses(protein_id=self.protein.id) |
There was a problem hiding this comment.
In Docking.get_results, the method queries get_poses(protein_id=self.protein.id) which returns ALL pose records for that protein — not just those belonging to this specific execution (identified by self.id). If multiple docking runs have been performed on the same protein, this method will return results from all of them, not just from the current execution. The filter should also include compute_job_id=self.id to scope results to this specific execution.
| response = self.client.results.get_poses(protein_id=self.protein.id) | |
| response = self.client.results.get_poses( | |
| protein_id=self.protein.id, | |
| compute_job_id=self.id, | |
| ) |
| instance = object.__new__(cls) | ||
|
|
||
| instance.client = client | ||
| instance._id = dto["executionId"] | ||
| instance._estimate = None | ||
| instance._cost = None | ||
| instance.tool_key = tool_info["key"] | ||
| instance.tool_version = tool_info["version"] | ||
|
|
||
| instance.status = dto.get("status") | ||
| instance.progress = dto.get("progressReport") | ||
| instance._execution_dto = dto |
There was a problem hiding this comment.
In AsyncExecutableMixin.from_id, the method uses object.__new__(cls) which bypasses __init__, so Execution.__init__ is never called. This means _id, _estimate, and _cost are set manually on instance (lines 160-162), but the mixin code itself sets instance._estimate = None and instance._cost = None rather than via the property-backed attributes. If a future subclass adds additional state initialized in __init__, from_id would silently skip that initialization. A comment documenting this responsibility or a pattern for subclasses to hook into would improve maintainability.
src/drug_discovery/docking.py
Outdated
| inputs["protein"]["id"], | ||
| client=instance.client, | ||
| ) | ||
| fut_ligands = executor.submit( | ||
| LigandSet.from_ids, | ||
| [lig["id"] for lig in inputs["ligands"]], |
There was a problem hiding this comment.
In Docking.from_id, the code accesses inputs["protein"]["id"] and inputs["ligands"] directly (without .get()), which will raise a KeyError with an unhelpful message if these fields are absent (e.g., for executions created by an older version of the code that used a different input schema). Consider using .get() with appropriate error messages, e.g., inputs.get("protein", {}).get("id") and inputs.get("ligands", []), similar to how pocket_id is handled.
| def test_terminal_state_blocks_further_transitions(self): | ||
| """Once in Succeeded, no further transitions are allowed.""" | ||
| job = AsyncJob("x") | ||
| job._set_status("Created") | ||
| job._set_status("Running") | ||
| job._set_status("Succeeded") | ||
| with pytest.raises(ValueError, match="Invalid status transition"): | ||
| job._set_status("Running") |
There was a problem hiding this comment.
In test_terminal_state_blocks_further_transitions, the test transitions from Created directly to Running (skipping Queued) at line 153. According to ALLOWED_STATUS_TRANSITIONS, "Created" allows transitioning to {"Queued", "Running", "Failed", "Cancelled"}, so Created → Running is valid. This is a correct test. However, the test docstring says "None -> Created -> Queued -> Running -> Succeeded" but the actual path is None -> Created -> Running -> Succeeded (skipping Queued). The docstring is misleading.
src/drug_discovery/abfe_step.py
Outdated
| Objects instantiated here are meant to be used within the Complex class.""" | ||
|
|
||
| """tool version to use for ABFE""" | ||
| tool_version = "0.2.19" |
There was a problem hiding this comment.
The abfe_step.py file uses the hardcoded string "0.2.19" for tool_version (line 29) instead of using the newly added ABFE_TOOL_VERSION constant from src/platform/constants.py. This creates a maintenance risk as these two values could diverge. The new standalone abfe.py correctly uses ABFE_TOOL_VERSION, so abfe_step.py should be updated to match.
| pocket_data["id"] = record["id"] | ||
| remote_path = pocket_data["file_path"] | ||
| pocket_data["file_path"] = client.files.download_file( | ||
| remote_path=remote_path, |
There was a problem hiding this comment.
In Pocket.from_id, the file is downloaded without lazy=True (line 383-385), whereas other download calls in the same module (e.g., _from_platform_record in ligand.py) use lazy=True. This inconsistency means the pocket file is eagerly downloaded even if it might not be immediately needed. Consider passing lazy=True here to be consistent with the rest of the codebase.
| remote_path=remote_path, | |
| remote_path=remote_path, | |
| lazy=True, |
src/platform/results.py
Outdated
| if compute_job_id is not None: | ||
| filter_dict["compute_job_id"] = compute_job_id |
There was a problem hiding this comment.
In _build_result_filter, the compute_job_id is added to the filter dict without an operator wrapper (filter_dict["compute_job_id"] = compute_job_id), while every other field uses {"eq": value}. In the mock server's _apply_search_filters, the else branch (results = [r for r in results if r.get(key) == value]) would handle the raw string comparison correctly; however, the real platform API may expect the same {"eq": ...} operator structure for consistency. The inconsistency between compute_job_id and all other fields (which use {"eq": ...}) is a potential source of bugs against the real API.
src/drug_discovery/docking.py
Outdated
| for ligand in ligands: | ||
| if ligand.id is None: | ||
| raise ValueError("Ligands must have an ID.") | ||
|
|
There was a problem hiding this comment.
In Docking.__init__, there is a check that requires all ligands to have an ID (if ligand.id is None: raise ValueError("Ligands must have an ID.")). However, the run() sync path uses the functions API and likely works fine with ligands that have no platform ID. This overly strict constraint prevents users from using the sync run() path for ligands that haven't been synced to the platform yet, which breaks the intended workflow described in the docstring ("Sync path: uses the functions API for small ligand sets").
| for ligand in ligands: | |
| if ligand.id is None: | |
| raise ValueError("Ligands must have an ID.") |
| "Cannot cancel: no execution has been started (id is None)." | ||
| ) | ||
|
|
||
| cancellable = {"Created", "Queued", "Running"} |
There was a problem hiding this comment.
Is quoted non cancelable state?
There was a problem hiding this comment.
the only transition from quoted is to "running" (well technically, it can go to queued...)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| @classmethod | ||
| def from_id(cls, id: str, *, client: DeepOriginClient | None = None) -> Self: | ||
| """Construct an instance from an existing platform execution ID. | ||
|
|
||
| Creates a bare instance via ``object.__new__`` (bypassing | ||
| ``__init__``), fetches the execution DTO, and populates the | ||
| common execution fields. Subclasses should call | ||
| ``super().from_id()`` then rehydrate domain-specific fields | ||
| from ``instance._execution_dto["userInputs"]``. | ||
|
|
||
| Args: | ||
| id: Platform execution ID. | ||
| client: Optional API client. Uses the default if not provided. | ||
|
|
||
| Returns: | ||
| A partially-hydrated instance with common fields populated. | ||
| """ | ||
| if client is None: | ||
| client = DeepOriginClient.get() | ||
|
|
||
| dto = client.executions.get_execution(execution_id=id) | ||
| tool_info = dto["tool"] | ||
|
|
||
| instance = object.__new__(cls) | ||
|
|
||
| instance.client = client | ||
| instance._id = dto["executionId"] | ||
| instance._estimate = None | ||
| instance._cost = None | ||
| instance.tool_key = tool_info["key"] | ||
| instance.tool_version = tool_info["version"] | ||
|
|
||
| instance.status = dto.get("status") | ||
| instance.progress = dto.get("progressReport") | ||
| instance._execution_dto = dto | ||
|
|
||
| quotation = dto.get("quotationResult", {}) | ||
| successful = quotation.get("successfulQuotations", []) | ||
| if successful: | ||
| price = successful[0].get("priceTotal") | ||
| if price is not None: | ||
| instance._estimate = float(price) | ||
| if instance.status == "Succeeded" and price is not None: | ||
| instance._cost = float(price) | ||
|
|
||
| return instance |
There was a problem hiding this comment.
In AsyncExecutableMixin.from_id, object.__new__(cls) is used to bypass __init__, but the resulting instance is missing the _execution_dto attribute initialization that normally happens in AsyncExecutableMixin.__init__ (line 72). The code then sets instance._execution_dto = dto on line 168, so it does get set — but the _estimate and _cost attributes that come from Execution.__init__ (lines 41-42) are set manually here on lines 161-162. However, Execution.client is set on line 159, which is correct. The concern is that any subclass __init__ state beyond what's explicitly set here (e.g., _protein, _pocket, _ligands for Docking) is entirely absent from the bare instance, and the code relies on subclasses to set these in overrides of from_id. This pattern is fragile: if a new subclass overrides from_id incorrectly or forgets a field, the instance will silently have missing attributes without an AttributeError at construction time.
| pocket_id = inputs.get("pocket_id") | ||
| if pocket_id is None: | ||
| raise ValueError( | ||
| "Missing 'pocket_id' in execution userInputs; " | ||
| "this execution may have been created without an associated pocket." | ||
| ) |
There was a problem hiding this comment.
In Docking.from_id, the userInputs of the execution DTO from the tools API (via client.executions.get_execution) has the structure {"box_size": ..., "ligands": [...], "pocket_id": ..., "protein": {...}} as set by the mock server ("userInputs": body.get("inputs", {})). The code correctly accesses inputs["protein"]["id"], inputs["ligands"], and inputs.get("pocket_id"). However, if a docking job was submitted with the old docking_step.py format (which uses smiles_list and doesn't include pocket_id in userInputs), this code will raise a ValueError on line 331 saying pocket_id is missing. The error message on lines 332-334 acknowledges this case, which is good, but users with old async docking jobs can never reconstruct them via from_id. Consider documenting this limitation explicitly in the docstring.
| def test_cancel_wrong_state_raises(self): | ||
| """cancel() from a non-cancellable state raises ValueError.""" | ||
| job = AsyncJob("x") | ||
| job._id = "some-id" | ||
| job.status = "Succeeded" | ||
| with pytest.raises(ValueError, match="Cannot cancel"): | ||
| job.cancel() |
There was a problem hiding this comment.
The test_cancel_wrong_state_raises test (line 224) directly sets job.status = "Succeeded" to simulate a terminal state for testing cancel(). However, if status were enforced as read-only in the future (or via the _set_status guard), this direct assignment would silently bypass validation. This is fine as a test implementation detail now, but it's inconsistent with how the production code uses _set_status. Using job._set_status("Created"); job._set_status("Running"); job._set_status("Succeeded") would be more robust and consistent with the other status transition tests in the same test class.
| all_data: list[dict[str, Any]] = [] | ||
| cursor: str | None = None | ||
| is_first_page = True | ||
|
|
||
| while True: | ||
| response = self.search( | ||
| "ligands", | ||
| cursor=cursor, | ||
| filter_dict=filter_dict, | ||
| limit=limit, | ||
| offset=offset if is_first_page else None, | ||
| select=select, | ||
| sort=sort, | ||
| ) | ||
| all_data.extend(response.get("data", [])) | ||
| is_first_page = False | ||
|
|
||
| cursor = response.get("meta", {}).get("nextCursor") | ||
| if not cursor: | ||
| break | ||
|
|
||
| response["data"] = all_data | ||
| return response |
There was a problem hiding this comment.
In search_ligands, if the loop executes at least one iteration, response holds the last page response and response["data"] is then replaced with all_data (line 272). However, if no pages are returned (empty result, meaning response.get("data", []) is empty on the first call and nextCursor is absent), response still holds the last response from self.search(...), and response["data"] is set to all_data (an empty list). This works correctly. But response is referenced on line 272 outside the while loop body, meaning it's set from the last iteration — if the while True body never executes, Python would raise UnboundLocalError. Practically the loop always executes at least once (it's while True), so this is not a runtime risk, but it would be clearer to initialize response = {} before the loop.
| def prepare( | ||
| self, | ||
| *, | ||
| ligands: list[Ligand] | LigandSet, | ||
| re_run: bool, | ||
| ) -> list[Ligand]: | ||
| """Helper method to determine which ligands need to be run based on already run jobs and re_run flag.""" | ||
| padding: float = 1.0, | ||
| retain_waters: bool = False, | ||
| add_H_atoms: bool = False, | ||
| protonate_protein: bool = False, | ||
| ) -> Protein: | ||
| """Run system preparation for the protein-ligand pair. | ||
|
|
||
| if isinstance(ligands, LigandSet): | ||
| ligands = ligands.ligands | ||
| This step produces the solvation and binding XML files required | ||
| by ``start()``. The resulting prepared system PDB is returned as | ||
| a ``Protein`` object and can be visualised. | ||
|
|
||
| if re_run: | ||
| # we're re-running, so we need to re-run all ligands | ||
| return ligands | ||
|
|
||
| jobs = JobList.list( | ||
| client=self.parent.client, | ||
| ) | ||
| df = jobs.filter( | ||
| tool_key=tool_mapper["ABFE"], | ||
| status=["Succeeded", "Running", "Queued", "Created"], | ||
| require_metadata=True, | ||
| ).to_dataframe( | ||
| include_metadata=True, | ||
| resolve_user_names=False, | ||
| client=self.parent.client, | ||
| ) | ||
|
|
||
| # Build set of ligand names that have already been run | ||
| if len(df) > 0: | ||
| ligand_hashes_already_run = { | ||
| ligand_hash | ||
| for ligand_hash in df["metadata"].apply( | ||
| lambda d: d.get("ligand_hash") if isinstance(d, dict) else None | ||
| ) | ||
| if isinstance(ligand_hash, str) and ligand_hash | ||
| } | ||
| else: | ||
| ligand_hashes_already_run = set() | ||
|
|
||
| # no re-run, remove already run ligands | ||
| ligands_to_run = [ | ||
| ligand | ||
| for ligand in ligands | ||
| if ligand.to_hash() not in ligand_hashes_already_run | ||
| ] | ||
| return ligands_to_run | ||
|
|
||
| @beartype | ||
| def check_dt(self): | ||
| """Validate that every "dt" in params is numeric and within allowed bounds. | ||
| Args: | ||
| padding: Padding around the system box in nm. Defaults to 1.0. | ||
| retain_waters: Keep water molecules from the crystal structure. | ||
| add_H_atoms: Add hydrogen atoms to the ligand. | ||
| protonate_protein: Protonate the protein. | ||
|
|
||
| Traverses the nested parameters dictionary and validates that each | ||
| occurrence of a key named "dt" has a numeric value within the | ||
| inclusive range [min_dt, max_dt]. If any non-numeric or out-of-range | ||
| values are found, an error is raised listing all offending paths. | ||
| Returns: | ||
| A ``Protein`` built from the prepared system PDB. | ||
|
|
||
| Raises: | ||
| DeepOriginException: If any "dt" value is non-numeric or outside | ||
| the allowed range. | ||
| ValueError: If the ligand is charged (ABFE does not support charged ligands). | ||
| """ | ||
| min_dt = 0.001 | ||
| max_dt = 0.004 | ||
|
|
||
| def is_number(value) -> bool: | ||
| return isinstance(value, (int, float)) | ||
|
|
||
| def find_dt_violations(obj, path: list[str]) -> list[str]: | ||
| """Return list of JSON-like paths with invalid dt values.""" | ||
| violations: list[str] = [] | ||
| if isinstance(obj, dict): | ||
| for key, value in obj.items(): | ||
| next_path = path + [str(key)] | ||
| if key == "dt" and ( | ||
| not is_number(value) or not (min_dt <= float(value) <= max_dt) | ||
| ): | ||
| violations.append(".".join(next_path)) | ||
| # Recurse into nested structures | ||
| if isinstance(value, (dict, list, tuple)): | ||
| violations.extend(find_dt_violations(value, next_path)) | ||
| elif isinstance(obj, (list, tuple)): | ||
| for idx, value in enumerate(obj): | ||
| next_path = path + [str(idx)] | ||
| if isinstance(value, (dict, list, tuple)): | ||
| violations.extend(find_dt_violations(value, next_path)) | ||
| return violations | ||
|
|
||
| violations = find_dt_violations(self._params, ["_params"]) | ||
| if violations: | ||
| paths = ", ".join(violations) | ||
| raise DeepOriginException( | ||
| f"Found invalid dt values; must be numeric and within range [{min_dt}, {max_dt}]. Offending paths: {paths}" | ||
| ) from None | ||
| raise NotImplementedError("ABFE preparation is not supported yet.") |
There was a problem hiding this comment.
In ABFE.prepare(), the method declares parameters (padding, retain_waters, add_H_atoms, protonate_protein) and documents that it returns a Protein, but the body unconditionally raises NotImplementedError. The declared return type -> Protein is misleading since the function never returns. It should either be -> Protein with a raise NotImplementedError that is reflected in the docstring as the sole behavior (removing the misleading parameter docs), or have its return type annotated as NoReturn.
changes
[ still in progress]