Classify TaskCancelledException as client error instead of system error#5273
Classify TaskCancelledException as client error instead of system error#5273ahkcs wants to merge 4 commits intoopensearch-project:mainfrom
Conversation
When a user cancels a PPL query, the TaskCancelledException is wrapped by Calcite (RuntimeException → SQLException → TaskCancelledException) and reaches the REST error handler as a generic exception, resulting in a 500 status code and incrementing the system error metric. Since query cancellation is user-initiated, it should be classified as a client error (400) to avoid inflating system failure metrics. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Failed to generate code suggestions for PR |
TaskCancelledException extends OpenSearchException, so ErrorMessageFactory.unwrapCause() was finding it and creating an OpenSearchErrorMessage with exception.status() (500), overriding the 400 status passed from the REST handler. Handle TaskCancelledException before the OpenSearchException check to use the passed-in status for both the HTTP response and JSON body. Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Failed to generate code suggestions for PR |
Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Failed to generate code suggestions for PR |
ErrorMessage.fetchReason() returns "Invalid Query" for all 400 status codes. Now that TaskCancelledException returns 400, the reason should be "Query cancelled" instead of "Invalid Query". Signed-off-by: Kai Huang <ahkcs@amazon.com>
|
Failed to generate code suggestions for PR |
|
Note: |
|
If we still need this after error enhancements, we can do it by expanding on the OpenSearchException check in the new interface to add some overridden exception types: https://github.com/opensearch-project/sql/pull/5266/changes#diff-3ad41d4c6d562cf927e38e1186467082b21d979bbeb732b87f73651bed1c3fa2R65-R67 |
Summary
When a user cancels a PPL query via the
_tasks/_cancelAPI, theTaskCancelledExceptionis wrapped by Calcite (RuntimeException → SQLException → TaskCancelledException) and reaches the REST error handler as a generic exception. This results in:PPL_FAILED_REQ_COUNT_SYS(system error metric) instead ofPPL_FAILED_REQ_COUNT_CUS(client error metric)Since query cancellation is user-initiated, it should not be classified as a system error.
Changes
hasCauseOf()helper to walk the exception cause chainTaskCancelledExceptiontoisClientError()via cause chain checkTaskCancelledExceptionRelated
Test plan
TaskCancelledExceptionclassified as client errorTaskCancelledExceptionclassified as client errorRuntimeExceptionstill classified as system errorOSD side log: