Speed Up Tool::Schema Validation by 5x to 100x#369
Open
koic wants to merge 1 commit into
Open
Conversation
## Motivation and Context `MCP::Tool::Schema` previously used the pure-Ruby `json-schema` gem for both construction-time metaschema validation and runtime argument / result validation. For deep schemas this cost ~32-100ms per construction (see issue modelcontextprotocol#364). `json_schemer` (https://rubygems.org/gems/json_schemer, v2.5.0) is an actively maintained alternative that is much faster on this workload. Switching reduces cold-start warming pressure in horizontally scaled multi-tenant deployments (the scenario raised in modelcontextprotocol#364) and brings runtime data validation down to sub-millisecond per call once the per-instance schemer is memoized. The implementation preserves the legacy behavior of the `json-schema` based code. Metaschema is pinned to draft-04 (`meta_schema:` option); the `$schema` dialect URI is still emitted in `to_h` as 2020-12 but runtime validation continues to use draft-04, exactly as before. Moving the runtime validator itself to 2020-12 is out of scope for this PR and warrants its own discussion. `format` keywords are not enforced (`format: false` option), matching what `json-schema`'s draft-04 path did. Malformed schemas (e.g. an invalid `pattern` regex) continue to surface as `ArgumentError, "Invalid JSON Schema: ..."`; the `RegexpError` `json_schemer` raises during construction is wrapped. `Symbol` values in argument data are coerced to strings inside the internal `stringify` helper so they validate against `type: "string"` the same way they did before. The public `Schema#schema` accessor is dropped. It was a refactor artifact from modelcontextprotocol#198 (the consolidation of `InputSchema`'s `properties` / `required` readers into a single hash) with no callers in `lib/`; tests that needed the merged hash now read it through `to_h`. The accessor's only remaining role would have been to expose a mutation path that could desynchronize the memoized `@schemer`, which this change removes. Runtime dependency delta: drops `json-schema` (and `addressable`), adds `json_schemer`, `hana`, `regexp_parser`, and `simpleidn`. Each added gem is a single-purpose library tied to JSON Schema spec compliance. Closes modelcontextprotocol#364. ## How Has This Been Tested? Side-by-side benchmark on representative schemas (simple, with `$ref` / `additionalProperties`, nested, depth 20, depth 40): construction-time metaschema validation is 4.7x to ~100x faster; runtime data validation with the memoized schemer is sub-millisecond across all sizes. Four existing tests had their stub targets or message assertions updated to track the new validator while preserving the original test intent. The "unexpected errors bubble up" tests in `input_schema_test.rb` and `output_schema_test.rb` now stub `JSONSchemer::Schema#validate` instead of `JSON::Validator.fully_validate`. The cache tests in `schema_test.rb` now stub `JSONSchemer::Schema#validate_schema`. The two "detailed error message" tests in `tool_test.rb` switched from asserting `json-schema`'s exact wording to format-agnostic substrings (`"properties/count/minimum"`, `"number"`). The "required arguments are converted to strings" test in `input_schema_test.rb` now reads the result through `to_h[:required]` instead of the removed `schema` accessor. Regression tests added for the legacy-behavior preservations above: `format` keyword is not enforced, invalid `pattern` raises `ArgumentError` (not `RegexpError`), and `Symbol` values validate against `type: "string"`. ## Breaking Changes `Schema#schema` is no longer a public method. The accessor had no callers outside the test suite, but anyone reading the internal representation directly (rather than through `to_h`) will need to switch. `Schema` / `InputSchema` / `OutputSchema` otherwise keep their constructor signatures, `to_h` output, `ValidationError` / `ArgumentError` raise points, and the `$schema` dialect emission. The validator error message wording inside `ArgumentError` and `ValidationError` now comes from `json_schemer` and still includes the JSON pointer path and a description of the mismatch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation and Context
MCP::Tool::Schemapreviously used the pure-Rubyjson-schemagem for both construction-time metaschema validation and runtime argument / result validation. For deep schemas this cost ~32-100ms per construction (see issue #364).json_schemer(https://rubygems.org/gems/json_schemer, v2.5.0) is an actively maintained alternative that is much faster on this workload. Switching reduces cold-start warming pressure in horizontally scaled multi-tenant deployments (the scenario raised in #364) and brings runtime data validation down to sub-millisecond per call once the per-instance schemer is memoized.The implementation preserves the legacy behavior of the
json-schemabased code. Metaschema is pinned to draft-04 (meta_schema:option); the$schemadialect URI is still emitted into_has 2020-12 but runtime validation continues to use draft-04, exactly as before. Moving the runtime validator itself to 2020-12 is out of scope for this PR and warrants its own discussion.formatkeywords are not enforced (format: falseoption), matching whatjson-schema's draft-04 path did. Malformed schemas (e.g. an invalidpatternregex) continue to surface asArgumentError, "Invalid JSON Schema: ..."; theRegexpErrorjson_schemerraises during construction is wrapped.Symbolvalues in argument data are coerced to strings inside the internalstringifyhelper so they validate againsttype: "string"the same way they did before.The public
Schema#schemaaccessor is dropped. It was a refactor artifact from #198 (the consolidation ofInputSchema'sproperties/requiredreaders into a single hash) with no callers inlib/; tests that needed the merged hash now read it throughto_h. The accessor's only remaining role would have been to expose a mutation path that could desynchronize the memoized@schemer, which this change removes.Runtime dependency delta: drops
json-schema(andaddressable), addsjson_schemer,hana,regexp_parser, andsimpleidn. Each added gem is a single-purpose library tied to JSON Schema spec compliance.Closes #364.
How Has This Been Tested?
Side-by-side benchmark on representative schemas (simple, with
$ref/additionalProperties, nested, depth 20, depth 40): construction-time metaschema validation is 4.7x to ~100x faster; runtime data validation with the memoized schemer is sub-millisecond across all sizes.Four existing tests had their stub targets or message assertions updated to track the new validator while preserving the original test intent. The "unexpected errors bubble up" tests in
input_schema_test.rbandoutput_schema_test.rbnow stubJSONSchemer::Schema#validateinstead ofJSON::Validator.fully_validate. The cache tests inschema_test.rbnow stubJSONSchemer::Schema#validate_schema. The two "detailed error message" tests intool_test.rbswitched from assertingjson-schema's exact wording to format-agnostic substrings ("properties/count/minimum","number"). The "required arguments are converted to strings" test ininput_schema_test.rbnow reads the result throughto_h[:required]instead of the removedschemaaccessor.Regression tests added for the legacy-behavior preservations above:
formatkeyword is not enforced, invalidpatternraisesArgumentError(notRegexpError), andSymbolvalues validate againsttype: "string".Breaking Changes
Schema#schemais no longer a public method. The accessor had no callers outside the test suite, but anyone reading the internal representation directly (rather than throughto_h) will need to switch.Schema/InputSchema/OutputSchemaotherwise keep their constructor signatures,to_houtput,ValidationError/ArgumentErrorraise points, and the$schemadialect emission. The validator error message wording insideArgumentErrorandValidationErrornow comes fromjson_schemerand still includes the JSON pointer path and a description of the mismatch.Types of changes
Checklist