Skip to content

Conversation

@Tyler-IN
Copy link

@Tyler-IN Tyler-IN commented Dec 24, 2025

This PR reduces UTF-16 string materialization and intermediate byte[]/string copies across MCP JSON serialization/deserialization and tool-result content handling, while preserving existing public APIs and behaviors by default.

Motivation and Context

Several hot paths (tool result content, JSON-RPC messaging, content block parsing, and URI template processing) were creating avoidable UTF-16 strings and/or staging buffers (e.g., GetRawText(), Encoding.UTF8.GetBytes(...), ToArray() patterns). This change set focuses on enabling more UTF-8-first processing and streaming writes, minimizing allocations/copies and making it easier to keep data as UTF-8 until (and unless) a string is actually required.

How Has This Been Tested?

  • dotnet test --filter "(Execution!=Manual)" (passes locally)
  • Total: 4843, Failed: 0, Succeeded: 4653, Skipped: 190

Breaking Changes

None intended.

  • Default behavior remains: text content materializes as TextContentBlock unless opted-in.
  • Additive APIs/options/types were introduced to enable UTF-8-first behavior without forcing downstream changes.
  • Public API should be exactly compatible when ready.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

Key changes in this branch include:

  • Added Utf8TextContentBlock and a configurable option to choose whether JSON “text” blocks materialize as TextContentBlock (default) or Utf8TextContentBlock (opt-in), with tests parameterized accordingly.
  • Improved TextContentBlock serialization/deserialization paths to reduce UTF-16 transitions where feasible (prefer operating on UTF-8 bytes and caching string only when needed).
  • Reduced ToArray() / staging patterns in tool content result paths and tests; replaced literal UTF-8 conversions with "..."u8 where applicable.
  • Introduced shared process-path resolution utilities (including PATHEXT handling on Windows) to reliably locate executables such as npx in conformance tests.
  • Optimized UriTemplate processing to reduce allocations (including netstandard2.0-compatible approaches)
  • Changes CancellationTokenSource (CTS) intended lifespans to indefinite, rely on being detached from the object graph and later disposed as they're finalized by the GC. Currently disposes of them at Dispose but replaces them with a CanceledTokenSource that doesn't get disposed but only returns cancelled tokens which avoids ObjectDisposedException when checking if the tokens are canceled.

Copilot AI and others added 7 commits December 24, 2025 02:52
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
@Tyler-IN
Copy link
Author

Based on #1070

@Tyler-IN
Copy link
Author

Tyler-IN commented Dec 24, 2025

Key protocol / model updates

  • ContentBlock JSON converter now reads the "text" property directly into UTF-8 bytes (including unescaping) without first materializing a UTF-16 string.
  • Added Utf8TextContentBlock ("type":"text") to allow hot paths to keep text payloads as UTF-8; TextContentBlock now supports a cached UTF-16 string backed by a Utf8Text buffer.
  • Added opt-in deserialization behavior to materialize Utf8TextContentBlock instead of TextContentBlock via McpJsonUtilities.CreateOptions(materializeUtf8TextContentBlocks: true); DefaultOptions remains compatible and continues to materialize TextContentBlock.
  • ImageContentBlock and AudioContentBlock still have Data as a base64 string; added DataUtf8 and DecodedData helpers to avoid repeated conversions.
  • BlobResourceContents updated to cache and interoperate between (1) base64 string (Blob), (2) base64 UTF-8 bytes (BlobUtf8), and (3) decoded bytes (DecodedData)

Serialization / JSON-RPC improvements

  • JsonRpcMessage converter now determines concrete message type (request/notification/response/error) by scanning the top-level payload with Utf8JsonReader.ValueTextEquals and skipping values in-place, avoiding JsonElement.GetRawText() / UTF-16 round-trips.
  • McpJsonUtilities now exposes CreateOptions(...) to produce an options instance that can override the ContentBlock converter (materializeUtf8TextContentBlocks) while still chaining MEAI type info resolver support.
  • Added McpTextUtilities helpers (UTF-8 decode, base64 encode on older TFMs, and common whitespace checks used by transports).

Transport / streaming changes (stdio + streams)

  • StreamClientSessionTransport and StreamServerTransport reading loops were refactored to a newline-delimited (LF) byte scanner:
    • Parses messages directly from pooled byte buffers + a reusable MemoryStream buffer.
    • Handles both LF and CRLF by trimming a trailing '\r' after splitting on '\n'.
    • Avoids UTF-16 materialization for parsing and (optionally) for logging by only decoding to string when trace logging is enabled.
  • The previous StreamClientSessionTransport impl. became TextStreamClientSessionTransport for the "text writer/reader" client transport.

HTTP transport / client plumbing

  • McpHttpClient now uses JsonContent.Create on NET TFMs and a new JsonTypeInfoHttpContent on non-NET TFMs to serialize via JsonTypeInfo without buffering to compute Content-Length.
  • StreamableHttpClientSessionTransport and SseClientSessionTransport disposal paths now consistently cancel and "defuse" CTS instances to reduce races/leaks during teardown.
  • StreamableHttpSession updates activity tracking and shutdown ordering (dispose transport first, then cancel, then await server run) for cleaner termination.

Cancellation / lifecycle utilities

  • Added CanceledTokenSource: a singleton already-canceled CancellationTokenSource plus a Defuse(ref cts, ...) helper to safely swap out mutable CTS fields during disposal.

URI template parsing

  • UriTemplate parsing/formatting updated with a more explicit template-expression regex and improved query expression handling; uses GeneratedRegex and (on NET) a non-backtracking regex option for performance.

Tests / samples

  • Added ProcessStartInfoUtilities to robustly locate executables on PATH and to handle Windows .cmd/.bat invocation semantics when UseShellExecute=false; updated integration tests accordingly. (This is for my npx.cmd etc.)
  • Added TextMaterializationTestHelpers to let tests run with either TextContentBlock or Utf8TextContentBlock materialization.
  • Updated tests and samples to align with the new base64-string Data representation for image/audio blocks and to avoid unnecessary allocations in transport tests.

Behavioral notes / compatibility

  • Wire format remains MCP/JSON-RPC compatible: messages are still newline-delimited JSON; CRLF continues to work.
  • The Utf8TextContentBlock materialization is opt-in via McpJsonUtilities.CreateOptions(materializeUtf8TextContentBlocks: true); default behavior preserves prior materialized types for text blocks.

@Tyler-IN Tyler-IN marked this pull request as draft December 24, 2025 13:15
@Tyler-IN Tyler-IN changed the title Perf/utf8 contentblock streaming Reduce copies, eliminate a lot of UTF-16 transcoding Dec 24, 2025
@stephentoub stephentoub added the NO MERGE PR should not be merged until the label is removed label Dec 24, 2025
@Tyler-IN Tyler-IN changed the title Reduce copies, eliminate a lot of UTF-16 transcoding WIP: Reduce copies, eliminate a lot of UTF-16 transcoding Dec 24, 2025
@Tyler-IN Tyler-IN force-pushed the perf/utf8-contentblock-streaming branch 4 times, most recently from 29fa71b to 30422d2 Compare December 24, 2025 16:12
… transports

This change set focuses on reducing allocations and unnecessary transcoding across the MCP wire path (JSON-RPC + MCP content), especially for line-delimited stream transports (stdio / raw streams) and for text content blocks that frequently originate as UTF-8 already.

Key protocol / model updates
- ContentBlock JSON converter now reads the "text" property directly into UTF-8 bytes (including unescaping) without first materializing a UTF-16 string.
- Added Utf8TextContentBlock ("type":"text") to allow hot paths to keep text payloads as UTF-8; TextContentBlock now supports a cached UTF-16 string backed by a Utf8Text buffer.
- Added opt-in deserialization behavior to materialize Utf8TextContentBlock instead of TextContentBlock via McpJsonUtilities.CreateOptions(materializeUtf8TextContentBlocks: true); DefaultOptions remains compatible and continues to materialize TextContentBlock.
- ImageContentBlock and AudioContentBlock still have Data as a base64 string; added DataUtf8 and DecodedData helpers to avoid repeated conversions.
- BlobResourceContents updated to cache and interoperate between (1) base64 string (Blob), (2) base64 UTF-8 bytes (BlobUtf8), and (3) decoded bytes (DecodedData)

Serialization / JSON-RPC improvements
- JsonRpcMessage converter now determines concrete message type (request/notification/response/error) by scanning the top-level payload with Utf8JsonReader.ValueTextEquals and skipping values in-place, avoiding JsonElement.GetRawText() / UTF-16 round-trips.
- McpJsonUtilities now exposes CreateOptions(...) to produce an options instance that can override the ContentBlock converter (materializeUtf8TextContentBlocks) while still chaining MEAI type info resolver support.
- Added McpTextUtilities helpers (UTF-8 decode, base64 encode on older TFMs, and common whitespace checks used by transports).

Transport / streaming changes (stdio + streams)
- StreamClientSessionTransport and StreamServerTransport reading loops were refactored to a newline-delimited (LF) byte scanner:
  - Parses messages directly from pooled byte buffers + a reusable MemoryStream buffer.
  - Handles both LF and CRLF by trimming a trailing '\r' after splitting on '\n'.
  - Avoids UTF-16 materialization for parsing and (optionally) for logging by only decoding to string when trace logging is enabled.
- The previous StreamClientSessionTransport impl. became TextStreamClientSessionTransport for the "text writer/reader" client transport.

HTTP transport / client plumbing
- McpHttpClient now uses JsonContent.Create on NET TFMs and a new JsonTypeInfoHttpContent<T> on non-NET TFMs to serialize via JsonTypeInfo without buffering to compute Content-Length.
- StreamableHttpClientSessionTransport and SseClientSessionTransport disposal paths now consistently cancel and "defuse" CTS instances to reduce races/leaks during teardown.
- StreamableHttpSession updates activity tracking and shutdown ordering (dispose transport first, then cancel, then await server run) for cleaner termination.

Cancellation / lifecycle utilities
- Added CanceledTokenSource: a singleton already-canceled CancellationTokenSource plus a Defuse(ref cts, ...) helper to safely swap out mutable CTS fields during disposal.

URI template parsing
- UriTemplate parsing/formatting updated with a more explicit template-expression regex and improved query expression handling; uses GeneratedRegex and (on NET) a non-backtracking regex option for performance.

Tests / samples
- Added ProcessStartInfoUtilities to robustly locate executables on PATH and to handle Windows .cmd/.bat invocation semantics when UseShellExecute=false; updated integration tests accordingly. (This is for my npx.cmd etc.)
- Added TextMaterializationTestHelpers to let tests run with either TextContentBlock or Utf8TextContentBlock materialization.
- Updated tests and samples to align with the new base64-string Data representation for image/audio blocks and to avoid unnecessary allocations in transport tests.

Behavioral notes / compatibility
- Wire format remains MCP/JSON-RPC compatible: messages are still newline-delimited JSON; CRLF continues to work.
- The Utf8TextContentBlock materialization is opt-in via McpJsonUtilities.CreateOptions(materializeUtf8TextContentBlocks: true); default behavior preserves prior materialized types for text blocks.
@Tyler-IN Tyler-IN force-pushed the perf/utf8-contentblock-streaming branch from 30422d2 to 50a8255 Compare December 24, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NO MERGE PR should not be merged until the label is removed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants