-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Description
Upgrading google-adk-python from v1.18.0 to v1.20.0 or v1.21.0 causes 100% CPU usage in our FastAPI app (using get_fast_api_app) on Gunicorn workers. The issue stems from the retry_on_closed_resource decorator now retrying all errors, including asyncio.CancelledError, leading to repeated retries during client disconnections/timeouts, causing the event loop to busy-wait.
Stable version: v1.18.0 (no retry extension).
Broken versions: v1.20.0 (introduced retry-all-errors via commit a3aa077) and v1.21.0.
Impact: Long-running async streaming endpoints (e.g., Gemini agent streaming) sustain 100% CPU for 20+ minutes until worker restart.
My speculated cause (based on CHANGELOG analysis): The v1.20.0 retry logic change aims to improve robustness but overlooks async scenarios where cancellation signals are special—CancelledError is treated as a "closed resource" instead of a termination signal, causing anyio/uvicorn to fail clean shutdowns and loop in _deliver_cancellation (py-spy shows 16% overhead). This amplifies in FastAPI + Gunicorn deployments, especially under high concurrency. v1.18.0 only retries connection/timeout errors, hence no issue.
Steps to Reproduce
Build a FastAPI app with get_fast_api_app integrating async agents (e.g., streaming).
Deploy: gunicorn --worker-class uvicorn.workers.UvicornWorker --workers 8 --timeout 120 main:app.
Trigger a long endpoint and simulate disconnect: curl -m 10 http://localhost:8000/message -d "long prompt".
Monitor: py-spy top --pid <gunicorn_pid> shows 100% CPU with hotspots in asyncio/anyio.
Minimal Repro Code (main.py):
from google.adk.cli.fast_api import get_fast_api_app
import os
from dotenv import load_dotenv
load_dotenv()
app = get_fast_api_app(agents_dir=".", allow_origins=["*"], session_service_uri=os.getenv("DATABASE_URL"))
# Run above gunicorn command Expected vs. Actual Behavior
Expected: On disconnect/timeout, CancelledError propagates; task ends, CPU <20%.
Actual: Retry loop; py-spy example:
%Own %Total OwnTime TotalTime Function (filename)
36.00% 36.00% 8.62s 8.62s current_task (asyncio/tasks.py)
16.00% 52.00% 5.21s 13.86s _deliver_cancellation (anyio/_backends/_asyncio.py)
9.00% 63.00% 0.920s 15.25s _run_once (asyncio/base_events.py)
2.00% 54.00% 0.460s 14.33s _run (asyncio/events.py)
1.00% 64.00% 0.240s 15.49s run_forever (asyncio/base_events.py)
0.00% 0.00% 0.040s 0.040s _do_waitpid (asyncio/unix_events.py)
0.00% 0.00% 0.020s 0.020s _worker (concurrent/futures/thread.py)
0.00% 0.00% 0.020s 0.030s _call_soon (asyncio/base_events.py)
0.00% 0.00% 0.010s 0.010s __init__ (asyncio/events.py)
0.00% 0.00% 0.010s 0.010s sleep (asyncio/tasks.py)
0.00% 0.00% 0.000s 0.010s main_loop (uvicorn/server.py) PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7 myuser 20 0 1261644 359828 39388 R 100.0 0.6 22:45.31 gunicorn
1 myuser 20 0 48684 33268 10124 S 0.0 0.1 0:00.79 gunicorn
21 myuser 20 0 2580 928 828 S 0.0 0.0 0:00.00 sh Environment
OS: Linux (Docker/K8s).
Python: 3.11.x.
ADK: v1.20.0/v1.21.0 (broken); v1.18.0 (stable).
Other: FastAPI 0.104+, Uvicorn 0.24+, Gunicorn 21.2+, Anyio 4.x.
Deployment: Docker ENV WEB_CONCURRENCY=8, 1-2 CPU cores.
Additional Context
Workaround: Roll back to v1.19.0 (latest stable without retry extension); or custom decorator excluding CancelledError (code available).
Logs: Debug mode shows repeated "retry on closed resource".
Attachments: py-spy output, log snippets.
Suggested Fix
Configure retry_on_closed_resource to exclude CancelledError (e.g., exclude_cancellation=True).
Add explicit aclose() in streaming contexts.
