Skip to content

Jobs refactor#463

Merged
sg-s merged 17 commits intomainfrom
jobs-refactor
Mar 9, 2026
Merged

Jobs refactor#463
sg-s merged 17 commits intomainfrom
jobs-refactor

Conversation

@sg-s
Copy link
Collaborator

@sg-s sg-s commented Mar 5, 2026

changes

[ still in progress]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a "jobs-centric" redesign of the drug discovery execution API. It introduces a composable class hierarchy (Execution base + QuoteMixin/SyncExecutableMixin/AsyncExecutableMixin mixins) and replaces the previous workflow-step pattern with standalone job classes (Docking, ABFE, PocketFinder). The old classes are preserved as docking_step.py/abfe_step.py for backward compatibility with the Complex class. The Results.get() API is also refactored to accept a raw filter_dict instead of named filter parameters.

Changes:

  • New standalone Docking, ABFE, and PocketFinder job classes with composable execution mixins (Execution, AsyncExecutableMixin, etc.)
  • Results.get() refactored to accept a filter_dict; a new _build_result_filter() helper handles construction of filter dicts
  • Old docking.py and abfe.py WorkflowStep classes renamed to docking_step.py and abfe_step.py

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/drug_discovery/execution.py New base class with read-only id/estimate/cost properties and status transition logic
src/drug_discovery/execution_mixins.py New mixins: QuoteMixin, SyncExecutableMixin, AsyncExecutableMixin, JupyterVizMixin
src/drug_discovery/docking.py New standalone Docking class replacing old WorkflowStep-based one
src/drug_discovery/abfe.py New standalone ABFE class replacing old WorkflowStep-based one
src/drug_discovery/pocket_finder.py New standalone PocketFinder class
src/drug_discovery/docking_step.py Old Docking WorkflowStep class (kept for Complex compatibility)
src/drug_discovery/abfe_step.py Old ABFE WorkflowStep class (kept for Complex compatibility)
src/drug_discovery/complex.py Updated imports to use renamed _step.py files
src/platform/results.py Refactored Results.get() to accept filter_dict; added _build_result_filter() helper
src/platform/constants.py Added PlatformStatus literal type and ALLOWED_STATUS_TRANSITIONS dict; bumped PocketFinder version
src/drug_discovery/utils/__init__.py Simplified _start_tool_run signature (removed output_dir_path, provider)
src/drug_discovery/__init__.py Exports new PocketFinder, Docking, ABFE, Execution, PlatformStatus
src/jobs/__init__.py New backward-compatibility shim re-exporting job classes
src/drug_discovery/constants.py Added PocketFinder to valid_tools and tool_mapper
src/drug_discovery/structures/pocket.py Added Pocket.from_id() classmethod and pocket_count/pocket_min_size args to from_result()
src/drug_discovery/structures/ligand.py Added LigandSet.from_ids() classmethod
tests/test_jobs_base.py New tests for Execution base class, mixins, and PocketFinder
tests/test_jobs_docking.py New tests for standalone Docking class
tests/test_jobs_abfe.py New tests for standalone ABFE class
tests/test_results.py Updated tests for refactored Results.get() API
tests/test_functions_with_data_platform.py Updated call site to use new filter_dict API
jobs-refactor-plan.md Design document describing the refactor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.


from deeporigin.drug_discovery.abfe import ABFE
from deeporigin.drug_discovery.docking import Docking
from deeporigin.drug_discovery.execution import Execution, PlatformStatus
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlatformStatus is imported from deeporigin.drug_discovery.execution in both src/jobs/__init__.py (line 10) and src/drug_discovery/__init__.py (line 66), but PlatformStatus is not defined or re-exported in src/drug_discovery/execution.py — it is only defined in src/platform/constants.py. This will raise an ImportError at runtime whenever PlatformStatus is accessed via either of these import paths. execution.py needs to either re-export PlatformStatus with from deeporigin.platform.constants import PlatformStatus, or both call sites need to import directly from deeporigin.platform.constants.

Suggested change
from deeporigin.drug_discovery.execution import Execution, PlatformStatus
from deeporigin.drug_discovery.execution import Execution
from deeporigin.platform.constants import PlatformStatus

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +151
for key, val in output_files.items():
if isinstance(val, dict) and "system_pdb_file_path" in val:
pass

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The elif isinstance(output_files, dict) branch at lines 147-150 is dead code — it enters the branch but only executes pass, discarding any dict-shaped output_files without extracting binding_xml or solvation_xml. This means if the API ever returns output_files as a dict (rather than a list), self.binding_xml_path and self.solvation_xml_path will remain None, causing quote() and start() to raise ValueError. Either extract the XML paths from the dict format or remove the branch entirely and document the expected format.

Suggested change
for key, val in output_files.items():
if isinstance(val, dict) and "system_pdb_file_path" in val:
pass
def _extract_xml_paths(obj) -> None:
nonlocal binding_xml, solvation_xml
if isinstance(obj, dict):
for v in obj.values():
_extract_xml_paths(v)
elif isinstance(obj, list):
for v in obj:
_extract_xml_paths(v)
elif isinstance(obj, str):
if obj.endswith("bsm_system.xml"):
binding_xml = obj
elif obj.endswith("solvation.xml"):
solvation_xml = obj
_extract_xml_paths(output_files)

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +75
def docking_with_smiles(protein, pocket, ligand_set, client):
"""Create a Docking instance using a second LigandSet (was SMILES path)."""
return Docking(
protein=protein,
pocket=pocket,
ligands=ligand_set,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docking_with_smiles fixture (lines 70-77) is identical to docking_with_ligands — both pass ligands=ligand_set and do not use the smiles_list argument. The associated test_with_smiles_list test (line 99) claims to test "Docking constructed with smiles_list converts to a LigandSet", but it never exercises the smiles_list code path. The fixture should instead pass smiles_list=["CCO", "CCCO"] (without ligand IDs pre-assigned, since smiles_list conversion happens internally) to correctly cover that constructor branch.

Suggested change
def docking_with_smiles(protein, pocket, ligand_set, client):
"""Create a Docking instance using a second LigandSet (was SMILES path)."""
return Docking(
protein=protein,
pocket=pocket,
ligands=ligand_set,
def docking_with_smiles(protein, pocket, client):
"""Create a Docking instance using a SMILES list."""
return Docking(
protein=protein,
pocket=pocket,
smiles_list=["CCO", "CCCO"],

Copilot uses AI. Check for mistakes.
Comment on lines +353 to +379
def get_results(self) -> LigandSet | None:
"""Retrieve docking results from the platform.

metadata = {
"protein_file": protein_basename,
"protein_hash": self.parent.protein.to_hash(),
}
Downloads result SDF files and returns a ``LigandSet``.

if batch_size is None and n_workers is None:
raise DeepOriginException(
"Either batch_size or n_workers must be specified."
) from None
elif batch_size is not None and n_workers is not None:
print(
"Both batch_size and n_workers are specified. Using n_workers to determine batch_size..."
)

if n_workers is not None:
batch_size = math.ceil(len(self.parent.ligands) / n_workers)
print(f"Using a batch size of {batch_size}")

if pocket is not None:
box_size = float(2 * np.cbrt(pocket.volume))
box_size = [box_size, box_size, box_size]
pocket_center = pocket.get_center().tolist()

# Get all ligands and filter out already docked ones
all_ligands = list(self.parent.ligands)

df = self.get_jobs_df(pocket_center=pocket_center, box_size=box_size)

already_docked_smiles = set()

if not re_run:
for _, row in df.iterrows():
user_inputs = row.get("user_inputs", {})
# Handle both old format (smiles_list) and new format (ligands array)
if "smiles_list" in user_inputs:
already_docked_smiles.update(user_inputs["smiles_list"])
elif "ligands" in user_inputs:
already_docked_smiles.update(
ligand.get("smiles", "") for ligand in user_inputs["ligands"]
)

# Filter ligands to only include those not already docked
ligands_to_dock = [
ligand
for ligand in all_ligands
if ligand.smiles not in already_docked_smiles
]
Returns:
A ``LigandSet`` of docked poses, or ``None`` if no results yet.
"""
response = self.client.results.get_poses(protein_id=self.protein.id)
records = response.get("data", [])
remote_paths: list[str] = []
for record in records:
data = record.get("data", {})
fp = data.get("file_path")
if fp:
remote_paths.append(fp)

# Sort ligands by SMILES for consistent batching
ligands_to_dock = sorted(ligands_to_dock, key=lambda ligand: ligand.smiles)

job_ids = []

chunks = list(more_itertools.chunked(ligands_to_dock, batch_size))

def process_chunk(chunk):
params = {
"box_size": list(box_size),
"pocket_center": list(pocket_center),
"protein": {
"file_path": self.parent.protein._remote_path,
},
"ligands": [
{
"smiles": ligand.smiles,
}
for ligand in chunk
],
}
if not remote_paths:
return None

# Include protein ID if available
if self.parent.protein.id is not None:
params["protein"]["id"] = self.parent.protein.id

# Include ligand IDs if available
for i, ligand in enumerate(chunk):
if ligand.id is not None:
params["ligands"][i]["id"] = ligand.id

# Include pocket_id if pocket is provided and has an ID
if pocket is not None and hasattr(pocket, "id") and pocket.id is not None:
params["pocket_id"] = pocket.id

execution_dto = utils._start_tool_run(
params=params,
metadata=metadata,
tool="Docking",
tool_version=self.tool_version,
client=self.parent.client,
output_dir_path=output_dir_path,
approve_amount=approve_amount,
)
return execution_dto

if len(ligands_to_dock) > 0:
if use_parallel:
with concurrent.futures.ThreadPoolExecutor() as executor:
# Submit all chunks to the executor
future_to_chunk = {
executor.submit(process_chunk, chunk): chunk for chunk in chunks
}

# Process results with progress bar
for future in tqdm(
concurrent.futures.as_completed(future_to_chunk),
total=len(chunks),
desc="Starting docking jobs",
):
execution_dto = future.result()
if execution_dto is not None:
job_ids.append(execution_dto)
else:
for chunk in tqdm(
chunks, total=len(chunks), desc="Starting docking jobs"
):
execution_dto = process_chunk(chunk)
if execution_dto is not None:
job_ids.append(execution_dto)
else:
raise DeepOriginException(
title="Cannot run Docking: no new ligands to dock",
message="No new ligands to dock. All ligands have already been docked.",
fix=" If you want to re-run the docking, set <code>re_run=True</code>.",
level="warning",
) from None

jobs = [
Job.from_dto(execution_dto, client=self.parent.client)
for execution_dto in job_ids
]
local_paths = self.client.files.download_files(files=remote_paths, lazy=True)

self.jobs = JobList(jobs)
result = LigandSet()
for path in local_paths:
result.ligands += LigandSet.from_sdf(path).ligands

return self.jobs
return result
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the design specification in jobs-refactor-plan.md (line 400), calling get_results() before start() (when there is no execution ID) should "raise an error". However, the current implementation of get_results() in Docking does not check whether self.id is None before querying the results API — it simply queries by protein_id and may return results from any previous docking run for that protein, not necessarily the one started by this instance. Compare this with ABFE.get_results(), which correctly raises ValueError("No execution has been started...") when self.id is None. Consider adding a similar guard here.

Copilot uses AI. Check for mistakes.
tool="Docking",
tool_version=self.tool_version,
client=self.parent.client,
output_dir_path=output_dir_path,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _start_tool_run function signature was refactored to remove the output_dir_path and provider parameters. However, both docking_step.py (line 465) and abfe_step.py (line 388) still pass output_dir_path=output_dir_path as a keyword argument to _start_tool_run. Since @beartype is applied to _start_tool_run and the parameter no longer exists in the signature, calling Complex.docking.run(...) or Complex.abfe.run(...) will raise a TypeError at runtime, breaking the existing Complex-based workflow.

These callers need to be updated: instead of passing output_dir_path to _start_tool_run, they should either pass the outputs as a pre-built outputs dict or remove the parameter entirely (since the new _start_tool_run now accepts an outputs dict directly).

Suggested change
output_dir_path=output_dir_path,

Copilot uses AI. Check for mistakes.
Comment on lines +382 to +388
execution_dto = utils._start_tool_run(
metadata=metadata,
params=params,
tool="ABFE",
tool_version=self.tool_version,
client=self.parent.client,
output_dir_path=output_dir_path,
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as in docking_step.py: abfe_step.py still passes output_dir_path=output_dir_path (line 388) to _start_tool_run, which no longer accepts that parameter. This will raise a TypeError at runtime when calling Complex.abfe.run(...). The call needs to be updated to construct and pass an outputs dict directly instead.

Suggested change
execution_dto = utils._start_tool_run(
metadata=metadata,
params=params,
tool="ABFE",
tool_version=self.tool_version,
client=self.parent.client,
output_dir_path=output_dir_path,
outputs = {"output_dir_path": output_dir_path}
execution_dto = utils._start_tool_run(
metadata=metadata,
params=params,
tool="ABFE",
tool_version=self.tool_version,
client=self.parent.client,
outputs=outputs,

Copilot uses AI. Check for mistakes.
Comment on lines +286 to +311

params = {
"box_size": box_size_list,
"pocket_center": pocket_center,
"protein": {
"id": self.protein.id,
"file_path": self.protein._remote_path,
},
"ligands": [
{
"id": record.get("id"),
"tool_id": record.get("tool_id"),
"tool_version": record.get("tool_version"),
"ligand_id": data.get("ligand_id"),
"protein_id": data.get("protein_id"),
"binding_energy": data.get("binding_energy"),
"pose_score": data.get("pose_score"),
"file_path": data.get("file_path"),
"id": lig.id,
"smiles": lig.smiles,
}
)
for lig in ligands
],
}

return pd.DataFrame(rows)
if self.protein.id is not None:
params["protein"]["id"] = self.protein.id

@beartype
def get_jobs_df(
self,
*,
pocket_center: Optional[tuple[Number, Number, Number] | list[Number]] = None,
box_size: Optional[tuple[Number, Number, Number] | list[Number]] = None,
):
"""search for all jobs that match this protein and ligands in the Job DB, and return a dataframe of the results
for i, lig in enumerate(ligands):
if lig.id is not None:
params["ligands"][i]["id"] = lig.id

Args:
pocket_center: Optional tuple of (x, y, z) coordinates to filter by pocket center
box_size: Optional tuple of (x, y, z) dimensions to filter by box size
"""
# Use parent method with docking-specific parameters
df = super().get_jobs_df(
include_metadata=True,
include_inputs=True,
include_outputs=True,
status=list(NON_FAILED_STATES),
)
if hasattr(self.pocket, "id") and self.pocket.id is not None:
params["pocket_id"] = self.pocket.id
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_tool_inputs method contains redundant code. The protein ID is set twice: first in the params dict literal at line 291 ("id": self.protein.id), then again conditionally at lines 303-304. Similarly, ligand IDs are already included in the list comprehension at lines 294-299 ("id": lig.id), but then set again in the loop at lines 306-308. The constructor already enforces that protein, pocket, and all ligands have IDs, so the conditional checks at lines 303-304 and the re-assignment loop at lines 306-308 are unreachable dead code. These redundant sections should be removed.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +303 to +308
if self.protein.id is not None:
params["protein"]["id"] = self.protein.id

if len(df) == 0:
return df
for i, lig in enumerate(ligands):
if lig.id is not None:
params["ligands"][i]["id"] = lig.id
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_tool_inputs method contains redundant code that sets the same fields twice. protein.id is set in the initial params dict at line 291, then conditionally re-set at lines 303-304. Ligand IDs are set in the list comprehension at lines 294-299, then conditionally re-set in the loop at lines 306-308. Since __init__ already validates that protein.id, all ligand IDs, and pocket.id are not None, these duplicate assignments are always executed and the conditionals are always true. The duplicate assignments (lines 303-308) should be removed to avoid confusion.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +268
while True:
response = self.search(
"ligands",
cursor=cursor,
filter_dict=filter_dict,
limit=limit,
offset=offset,
select=select,
sort=sort,
)
all_data.extend(response.get("data", []))

cursor = response.get("meta", {}).get("nextCursor")
if not cursor:
break
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the search_ligands pagination loop, the offset parameter is passed on every iteration including subsequent pages. When cursor-based pagination is active, using offset on subsequent page requests (where a cursor is already set) may cause incorrect pagination behavior — double-skipping records. The offset should only be applied on the first request and omitted once a cursor is available.

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +23
ABFE_TOOL_KEY = "deeporigin.abfe-end-to-end"
ABFE_TOOL_VERSION = "0.2.19"
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABFE_TOOL_KEY and ABFE_TOOL_VERSION are defined as local constants in abfe.py rather than being imported from platform/constants.py like DOCKING_TOOL_KEY and DOCKING_TOOL_VERSION are in docking.py. The same ABFE tool key already exists in drug_discovery/constants.py (tool_mapper["ABFE"]). This duplication risks divergence if the key or version is ever updated in one place but not the other.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

for ligand in all_ligands
if ligand.smiles not in already_docked_smiles
]
response = self.client.results.get_poses(protein_id=self.protein.id)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Docking.get_results, the method queries get_poses(protein_id=self.protein.id) which returns ALL pose records for that protein — not just those belonging to this specific execution (identified by self.id). If multiple docking runs have been performed on the same protein, this method will return results from all of them, not just from the current execution. The filter should also include compute_job_id=self.id to scope results to this specific execution.

Suggested change
response = self.client.results.get_poses(protein_id=self.protein.id)
response = self.client.results.get_poses(
protein_id=self.protein.id,
compute_job_id=self.id,
)

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +168
instance = object.__new__(cls)

instance.client = client
instance._id = dto["executionId"]
instance._estimate = None
instance._cost = None
instance.tool_key = tool_info["key"]
instance.tool_version = tool_info["version"]

instance.status = dto.get("status")
instance.progress = dto.get("progressReport")
instance._execution_dto = dto
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AsyncExecutableMixin.from_id, the method uses object.__new__(cls) which bypasses __init__, so Execution.__init__ is never called. This means _id, _estimate, and _cost are set manually on instance (lines 160-162), but the mixin code itself sets instance._estimate = None and instance._cost = None rather than via the property-backed attributes. If a future subclass adds additional state initialized in __init__, from_id would silently skip that initialization. A comment documenting this responsibility or a pattern for subclasses to hook into would improve maintainability.

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +344
inputs["protein"]["id"],
client=instance.client,
)
fut_ligands = executor.submit(
LigandSet.from_ids,
[lig["id"] for lig in inputs["ligands"]],
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Docking.from_id, the code accesses inputs["protein"]["id"] and inputs["ligands"] directly (without .get()), which will raise a KeyError with an unhelpful message if these fields are absent (e.g., for executions created by an older version of the code that used a different input schema). Consider using .get() with appropriate error messages, e.g., inputs.get("protein", {}).get("id") and inputs.get("ligands", []), similar to how pocket_id is handled.

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +156
def test_terminal_state_blocks_further_transitions(self):
"""Once in Succeeded, no further transitions are allowed."""
job = AsyncJob("x")
job._set_status("Created")
job._set_status("Running")
job._set_status("Succeeded")
with pytest.raises(ValueError, match="Invalid status transition"):
job._set_status("Running")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_terminal_state_blocks_further_transitions, the test transitions from Created directly to Running (skipping Queued) at line 153. According to ALLOWED_STATUS_TRANSITIONS, "Created" allows transitioning to {"Queued", "Running", "Failed", "Cancelled"}, so CreatedRunning is valid. This is a correct test. However, the test docstring says "None -> Created -> Queued -> Running -> Succeeded" but the actual path is None -> Created -> Running -> Succeeded (skipping Queued). The docstring is misleading.

Copilot uses AI. Check for mistakes.
Objects instantiated here are meant to be used within the Complex class."""

"""tool version to use for ABFE"""
tool_version = "0.2.19"
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abfe_step.py file uses the hardcoded string "0.2.19" for tool_version (line 29) instead of using the newly added ABFE_TOOL_VERSION constant from src/platform/constants.py. This creates a maintenance risk as these two values could diverge. The new standalone abfe.py correctly uses ABFE_TOOL_VERSION, so abfe_step.py should be updated to match.

Copilot uses AI. Check for mistakes.
pocket_data["id"] = record["id"]
remote_path = pocket_data["file_path"]
pocket_data["file_path"] = client.files.download_file(
remote_path=remote_path,
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Pocket.from_id, the file is downloaded without lazy=True (line 383-385), whereas other download calls in the same module (e.g., _from_platform_record in ligand.py) use lazy=True. This inconsistency means the pocket file is eagerly downloaded even if it might not be immediately needed. Consider passing lazy=True here to be consistent with the rest of the codebase.

Suggested change
remote_path=remote_path,
remote_path=remote_path,
lazy=True,

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +55
if compute_job_id is not None:
filter_dict["compute_job_id"] = compute_job_id
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_result_filter, the compute_job_id is added to the filter dict without an operator wrapper (filter_dict["compute_job_id"] = compute_job_id), while every other field uses {"eq": value}. In the mock server's _apply_search_filters, the else branch (results = [r for r in results if r.get(key) == value]) would handle the raw string comparison correctly; however, the real platform API may expect the same {"eq": ...} operator structure for consistency. The inconsistency between compute_job_id and all other fields (which use {"eq": ...}) is a potential source of bugs against the real API.

Copilot uses AI. Check for mistakes.
Comment on lines 96 to 99
for ligand in ligands:
if ligand.id is None:
raise ValueError("Ligands must have an ID.")

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Docking.__init__, there is a check that requires all ligands to have an ID (if ligand.id is None: raise ValueError("Ligands must have an ID.")). However, the run() sync path uses the functions API and likely works fine with ligands that have no platform ID. This overly strict constraint prevents users from using the sync run() path for ligands that haven't been synced to the platform yet, which breaks the intended workflow described in the docstring ("Sync path: uses the functions API for small ligand sets").

Suggested change
for ligand in ligands:
if ligand.id is None:
raise ValueError("Ligands must have an ID.")

Copilot uses AI. Check for mistakes.
"Cannot cancel: no execution has been started (id is None)."
)

cancellable = {"Created", "Queued", "Running"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is quoted non cancelable state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only transition from quoted is to "running" (well technically, it can go to queued...)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 30 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +134 to +179
@classmethod
def from_id(cls, id: str, *, client: DeepOriginClient | None = None) -> Self:
"""Construct an instance from an existing platform execution ID.

Creates a bare instance via ``object.__new__`` (bypassing
``__init__``), fetches the execution DTO, and populates the
common execution fields. Subclasses should call
``super().from_id()`` then rehydrate domain-specific fields
from ``instance._execution_dto["userInputs"]``.

Args:
id: Platform execution ID.
client: Optional API client. Uses the default if not provided.

Returns:
A partially-hydrated instance with common fields populated.
"""
if client is None:
client = DeepOriginClient.get()

dto = client.executions.get_execution(execution_id=id)
tool_info = dto["tool"]

instance = object.__new__(cls)

instance.client = client
instance._id = dto["executionId"]
instance._estimate = None
instance._cost = None
instance.tool_key = tool_info["key"]
instance.tool_version = tool_info["version"]

instance.status = dto.get("status")
instance.progress = dto.get("progressReport")
instance._execution_dto = dto

quotation = dto.get("quotationResult", {})
successful = quotation.get("successfulQuotations", [])
if successful:
price = successful[0].get("priceTotal")
if price is not None:
instance._estimate = float(price)
if instance.status == "Succeeded" and price is not None:
instance._cost = float(price)

return instance
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AsyncExecutableMixin.from_id, object.__new__(cls) is used to bypass __init__, but the resulting instance is missing the _execution_dto attribute initialization that normally happens in AsyncExecutableMixin.__init__ (line 72). The code then sets instance._execution_dto = dto on line 168, so it does get set — but the _estimate and _cost attributes that come from Execution.__init__ (lines 41-42) are set manually here on lines 161-162. However, Execution.client is set on line 159, which is correct. The concern is that any subclass __init__ state beyond what's explicitly set here (e.g., _protein, _pocket, _ligands for Docking) is entirely absent from the bare instance, and the code relies on subclasses to set these in overrides of from_id. This pattern is fragile: if a new subclass overrides from_id incorrectly or forgets a field, the instance will silently have missing attributes without an AttributeError at construction time.

Copilot uses AI. Check for mistakes.
Comment on lines +329 to +334
pocket_id = inputs.get("pocket_id")
if pocket_id is None:
raise ValueError(
"Missing 'pocket_id' in execution userInputs; "
"this execution may have been created without an associated pocket."
)
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Docking.from_id, the userInputs of the execution DTO from the tools API (via client.executions.get_execution) has the structure {"box_size": ..., "ligands": [...], "pocket_id": ..., "protein": {...}} as set by the mock server ("userInputs": body.get("inputs", {})). The code correctly accesses inputs["protein"]["id"], inputs["ligands"], and inputs.get("pocket_id"). However, if a docking job was submitted with the old docking_step.py format (which uses smiles_list and doesn't include pocket_id in userInputs), this code will raise a ValueError on line 331 saying pocket_id is missing. The error message on lines 332-334 acknowledges this case, which is good, but users with old async docking jobs can never reconstruct them via from_id. Consider documenting this limitation explicitly in the docstring.

Copilot uses AI. Check for mistakes.
Comment on lines +220 to +226
def test_cancel_wrong_state_raises(self):
"""cancel() from a non-cancellable state raises ValueError."""
job = AsyncJob("x")
job._id = "some-id"
job.status = "Succeeded"
with pytest.raises(ValueError, match="Cannot cancel"):
job.cancel()
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_cancel_wrong_state_raises test (line 224) directly sets job.status = "Succeeded" to simulate a terminal state for testing cancel(). However, if status were enforced as read-only in the future (or via the _set_status guard), this direct assignment would silently bypass validation. This is fine as a test implementation detail now, but it's inconsistent with how the production code uses _set_status. Using job._set_status("Created"); job._set_status("Running"); job._set_status("Succeeded") would be more robust and consistent with the other status transition tests in the same test class.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +273
all_data: list[dict[str, Any]] = []
cursor: str | None = None
is_first_page = True

while True:
response = self.search(
"ligands",
cursor=cursor,
filter_dict=filter_dict,
limit=limit,
offset=offset if is_first_page else None,
select=select,
sort=sort,
)
all_data.extend(response.get("data", []))
is_first_page = False

cursor = response.get("meta", {}).get("nextCursor")
if not cursor:
break

response["data"] = all_data
return response
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In search_ligands, if the loop executes at least one iteration, response holds the last page response and response["data"] is then replaced with all_data (line 272). However, if no pages are returned (empty result, meaning response.get("data", []) is empty on the first call and nextCursor is absent), response still holds the last response from self.search(...), and response["data"] is set to all_data (an empty list). This works correctly. But response is referenced on line 272 outside the while loop body, meaning it's set from the last iteration — if the while True body never executes, Python would raise UnboundLocalError. Practically the loop always executes at least once (it's while True), so this is not a runtime risk, but it would be clearer to initialize response = {} before the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +103
def prepare(
self,
*,
ligands: list[Ligand] | LigandSet,
re_run: bool,
) -> list[Ligand]:
"""Helper method to determine which ligands need to be run based on already run jobs and re_run flag."""
padding: float = 1.0,
retain_waters: bool = False,
add_H_atoms: bool = False,
protonate_protein: bool = False,
) -> Protein:
"""Run system preparation for the protein-ligand pair.

if isinstance(ligands, LigandSet):
ligands = ligands.ligands
This step produces the solvation and binding XML files required
by ``start()``. The resulting prepared system PDB is returned as
a ``Protein`` object and can be visualised.

if re_run:
# we're re-running, so we need to re-run all ligands
return ligands

jobs = JobList.list(
client=self.parent.client,
)
df = jobs.filter(
tool_key=tool_mapper["ABFE"],
status=["Succeeded", "Running", "Queued", "Created"],
require_metadata=True,
).to_dataframe(
include_metadata=True,
resolve_user_names=False,
client=self.parent.client,
)

# Build set of ligand names that have already been run
if len(df) > 0:
ligand_hashes_already_run = {
ligand_hash
for ligand_hash in df["metadata"].apply(
lambda d: d.get("ligand_hash") if isinstance(d, dict) else None
)
if isinstance(ligand_hash, str) and ligand_hash
}
else:
ligand_hashes_already_run = set()

# no re-run, remove already run ligands
ligands_to_run = [
ligand
for ligand in ligands
if ligand.to_hash() not in ligand_hashes_already_run
]
return ligands_to_run

@beartype
def check_dt(self):
"""Validate that every "dt" in params is numeric and within allowed bounds.
Args:
padding: Padding around the system box in nm. Defaults to 1.0.
retain_waters: Keep water molecules from the crystal structure.
add_H_atoms: Add hydrogen atoms to the ligand.
protonate_protein: Protonate the protein.

Traverses the nested parameters dictionary and validates that each
occurrence of a key named "dt" has a numeric value within the
inclusive range [min_dt, max_dt]. If any non-numeric or out-of-range
values are found, an error is raised listing all offending paths.
Returns:
A ``Protein`` built from the prepared system PDB.

Raises:
DeepOriginException: If any "dt" value is non-numeric or outside
the allowed range.
ValueError: If the ligand is charged (ABFE does not support charged ligands).
"""
min_dt = 0.001
max_dt = 0.004

def is_number(value) -> bool:
return isinstance(value, (int, float))

def find_dt_violations(obj, path: list[str]) -> list[str]:
"""Return list of JSON-like paths with invalid dt values."""
violations: list[str] = []
if isinstance(obj, dict):
for key, value in obj.items():
next_path = path + [str(key)]
if key == "dt" and (
not is_number(value) or not (min_dt <= float(value) <= max_dt)
):
violations.append(".".join(next_path))
# Recurse into nested structures
if isinstance(value, (dict, list, tuple)):
violations.extend(find_dt_violations(value, next_path))
elif isinstance(obj, (list, tuple)):
for idx, value in enumerate(obj):
next_path = path + [str(idx)]
if isinstance(value, (dict, list, tuple)):
violations.extend(find_dt_violations(value, next_path))
return violations

violations = find_dt_violations(self._params, ["_params"])
if violations:
paths = ", ".join(violations)
raise DeepOriginException(
f"Found invalid dt values; must be numeric and within range [{min_dt}, {max_dt}]. Offending paths: {paths}"
) from None
raise NotImplementedError("ABFE preparation is not supported yet.")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ABFE.prepare(), the method declares parameters (padding, retain_waters, add_H_atoms, protonate_protein) and documents that it returns a Protein, but the body unconditionally raises NotImplementedError. The declared return type -> Protein is misleading since the function never returns. It should either be -> Protein with a raise NotImplementedError that is reflected in the docstring as the sole behavior (removing the misleading parameter docs), or have its return type annotated as NoReturn.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants