Skip to content

High CPU usage (100%) due to infinite retry loop on CancelledError in retry_on_closed_resource after upgrading to v1.20.0+ with FastAPI/uvicorn #4009

@stonyme

Description

@stonyme

Description
Upgrading google-adk-python from v1.18.0 to v1.20.0 or v1.21.0 causes 100% CPU usage in our FastAPI app (using get_fast_api_app) on Gunicorn workers. The issue stems from the retry_on_closed_resource decorator now retrying all errors, including asyncio.CancelledError, leading to repeated retries during client disconnections/timeouts, causing the event loop to busy-wait.

Stable version: v1.18.0 (no retry extension).
Broken versions: v1.20.0 (introduced retry-all-errors via commit a3aa077) and v1.21.0.
Impact: Long-running async streaming endpoints (e.g., Gemini agent streaming) sustain 100% CPU for 20+ minutes until worker restart.

My speculated cause (based on CHANGELOG analysis): The v1.20.0 retry logic change aims to improve robustness but overlooks async scenarios where cancellation signals are special—CancelledError is treated as a "closed resource" instead of a termination signal, causing anyio/uvicorn to fail clean shutdowns and loop in _deliver_cancellation (py-spy shows 16% overhead). This amplifies in FastAPI + Gunicorn deployments, especially under high concurrency. v1.18.0 only retries connection/timeout errors, hence no issue.
Steps to Reproduce

Build a FastAPI app with get_fast_api_app integrating async agents (e.g., streaming).
Deploy: gunicorn --worker-class uvicorn.workers.UvicornWorker --workers 8 --timeout 120 main:app.
Trigger a long endpoint and simulate disconnect: curl -m 10 http://localhost:8000/message -d "long prompt".
Monitor: py-spy top --pid <gunicorn_pid> shows 100% CPU with hotspots in asyncio/anyio.
Minimal Repro Code (main.py):

from google.adk.cli.fast_api import get_fast_api_app  
import os  
from dotenv import load_dotenv  

load_dotenv()  
app = get_fast_api_app(agents_dir=".", allow_origins=["*"], session_service_uri=os.getenv("DATABASE_URL"))  
# Run above gunicorn command  

Expected vs. Actual Behavior

Expected: On disconnect/timeout, CancelledError propagates; task ends, CPU <20%.
Actual: Retry loop; py-spy example:

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                                                                                                                                                          
 36.00%  36.00%    8.62s     8.62s   current_task (asyncio/tasks.py)
 16.00%  52.00%    5.21s    13.86s   _deliver_cancellation (anyio/_backends/_asyncio.py)
  9.00%  63.00%   0.920s    15.25s   _run_once (asyncio/base_events.py)
  2.00%  54.00%   0.460s    14.33s   _run (asyncio/events.py)
  1.00%  64.00%   0.240s    15.49s   run_forever (asyncio/base_events.py)
  0.00%   0.00%   0.040s    0.040s   _do_waitpid (asyncio/unix_events.py)
  0.00%   0.00%   0.020s    0.020s   _worker (concurrent/futures/thread.py)
  0.00%   0.00%   0.020s    0.030s   _call_soon (asyncio/base_events.py)
  0.00%   0.00%   0.010s    0.010s   __init__ (asyncio/events.py)
  0.00%   0.00%   0.010s    0.010s   sleep (asyncio/tasks.py)
  0.00%   0.00%   0.000s    0.010s   main_loop (uvicorn/server.py)
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                    
      7 myuser    20   0 1261644 359828  39388 R 100.0   0.6  22:45.31 gunicorn                                                                                                                                                                                   
      1 myuser    20   0   48684  33268  10124 S   0.0   0.1   0:00.79 gunicorn                                                                                                                                                                                   
     21 myuser    20   0    2580    928    828 S   0.0   0.0   0:00.00 sh                                                                                                                                                                                         

Image

Environment

OS: Linux (Docker/K8s).
Python: 3.11.x.
ADK: v1.20.0/v1.21.0 (broken); v1.18.0 (stable).
Other: FastAPI 0.104+, Uvicorn 0.24+, Gunicorn 21.2+, Anyio 4.x.
Deployment: Docker ENV WEB_CONCURRENCY=8, 1-2 CPU cores.

Additional Context

Workaround: Roll back to v1.19.0 (latest stable without retry extension); or custom decorator excluding CancelledError (code available).
Logs: Debug mode shows repeated "retry on closed resource".
Attachments: py-spy output, log snippets.

Suggested Fix

Configure retry_on_closed_resource to exclude CancelledError (e.g., exclude_cancellation=True).
Add explicit aclose() in streaming contexts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    core[Component] This issue is related to the core interface and implementation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions