Feat/agent tool resilience sample #4086

sarojrout · 2026-01-06T23:14:00Z

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

Closes:
Related: Built-in Resilience Support for AgentTool #4087

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
Currently, building resilient multi-agent systems with AgentTool requires significant custom code. When sub-agents timeout or fail, developers must:

Create custom timeout wrappers
Manually handle errors and retries
Engineer complex prompts for intelligent routing to alternative agents
Build error recovery logic
Format user-friendly error messages

This creates a high barrier to entry and leads to inconsistent implementations across different projects.

Solution:
This PR adds a working sample (contributing/samples/agent_tool_resilience/) that demonstrates how to build resilient multi-agent systems using ADK's existing components:

TimeoutAgentTool wrapper - Adds timeout protection to AgentTool
- Wraps AgentTool.run_async() with asyncio.wait_for()
- Returns structured error responses compatible with retry plugins
- Handles both synchronous and async generator patterns
Integration with ReflectAndRetryToolPlugin - Handles automatic retries
- Intercepts tool failures (including timeouts)
- Provides structured reflection guidance to the LLM
- Tracks retry counts per-tool
Prompt-based dynamic routing - Enables intelligent fallback
- Coordinator agent receives error responses as tool results
- LLM reasons about errors and chooses alternative agents
- Demonstrates fallback patterns without requiring core changes
Error recovery agent - Provides user-friendly error analysis
- Specialized agent that analyzes failures
- Hides complexity from users
- Suggests alternative approaches

Why this solution:

Works with current ADK features (no core changes needed)
Provides a reference implementation for developers
Demonstrates best practices for resilience patterns
Can be merged immediately as a sample
Serves as proof-of-concept for future built-in support

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

I have added or updated unit tests for my change.
All unit tests pass locally.

Note: This is a sample addition, not a core feature change. The sample code itself is tested through manual E2E testing. The TimeoutAgentTool wrapper uses standard Python asyncio.wait_for() which is well-tested.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Setup:

# 1. Activate virtual environment
source .venv/bin/activate

# 2. Launch the sample
adk web contributing/samples/agent_tool_resilience

Normal Operation:
- Query: "What is quantum computing?"
- Expected: Primary agent handles query successfully
- Result: Comprehensive answer returned
Timeout Scenario:
- Configuration: Set timeout=5.0 in agent.py as part of TimeoutAgentTool
- Query: Very complex research request (multiple domains, detailed requirements)
- Expected: Primary times out after 5s → Fallback agent succeeds
- Result: Timeout detected, fallback automatically tried, answer returned

Test Results Summary:

Normal operation works correctly
Timeout protection functions as expected
Automatic retry via ReflectAndRetryToolPlugin works
Dynamic fallback routing works
Error recovery agent provides helpful guidance
User-friendly error messages (no complexity leakage)

Checklist

I have read the CONTRIBUTING.md document.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
- TimeoutAgentTool includes detailed docstrings
- Complex timeout logic for async generators is commented
- Agent instructions explain error handling protocols
I have added tests that prove my fix is effective or that my feature works.
- Manual E2E tests demonstrate all scenarios
New and existing unit tests pass locally with my changes.
- No core changes - only sample addition
I have manually tested my changes end-to-end.
- All test scenarios verified (see "Manual E2E Tests" above)
- Sample runs successfully with adk web
Any dependent changes have been merged and published in downstream modules.
- No dependencies - uses existing ADK features only

Additional context

What This PR Adds

Files Added:

contributing/samples/agent_tool_resilience/agent.py - Complete implementation (~320 lines)
contributing/samples/agent_tool_resilience/__init__.py - Package initialization
contributing/samples/agent_tool_resilience/README.md - User documentation

Key Features:

Timeout Protection - Custom TimeoutAgentTool wrapper
Automatic Retry - Integration with ReflectAndRetryToolPlugin
Dynamic Fallback - Prompt-based routing to alternative agents
Error Recovery - Specialized agent for user-friendly error analysis

Impact

Low Risk: Only adds a sample, no core changes
High Value: Provides reference implementation for resilience patterns
Backward Compatible: No breaking changes
Well Documented: Comprehensive README and inline comments
Screenshots

Demonstrates timeout protection, automatic retry, and dynamic fallback patterns for multi-agent workflows using AgentTool.

gemini-code-assist · 2026-01-06T23:14:13Z

Summary of Changes

Hello @sarojrout, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new sample that provides a robust reference implementation for building resilient multi-agent systems within the ADK framework. It addresses the challenge of managing sub-agent timeouts and failures by demonstrating how to combine existing ADK components to achieve timeout protection, automatic retries, intelligent fallback mechanisms, and user-friendly error recovery, all without requiring core framework changes.

Highlights

Timeout Protection: Introduces a custom TimeoutAgentTool wrapper that adds timeout handling to sub-agents, catching asyncio.TimeoutError and returning structured error responses for processing by retry plugins. It also includes careful handling for async generators.
Automatic Retry: Demonstrates integration with ReflectAndRetryToolPlugin to automatically intercept tool failures (including timeouts) and provide structured reflection guidance to the LLM, tracking retry counts per-tool.
Dynamic Fallback Routing: Showcases prompt-based dynamic routing where a coordinator agent receives error responses as tool results, allowing the LLM to reason about errors and intelligently choose alternative agents, such as a fallback research agent.
Error Recovery Agent: Includes a specialized error_recovery_agent that analyzes failures and provides user-friendly guidance, hiding complexity from the end-user and suggesting alternative approaches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable sample demonstrating resilience patterns for multi-agent systems, including timeouts, retries, and fallbacks. The implementation is well-structured and provides a clear example for developers. My review includes suggestions to improve code clarity and maintainability in agent.py by refactoring the timeout handling logic and ensuring consistency in configuration. I also noted a minor issue in the README.md file.

contributing/samples/agent_tool_resilience/agent.py

gemini-code-assist · 2026-01-06T23:15:40Z

contributing/samples/agent_tool_resilience/agent.py

+    try:
+      while True:
+        # Check overall timeout
+        elapsed = time.time() - start_time
+        if elapsed >= self.timeout:
+          # Timeout exceeded
+          yield Event(
+              content=types.Content(
+                  role='assistant',
+                  parts=[
+                      types.Part.from_text(
+                          text=f"Timeout: {self.timeout_error_message}"
+                      )
+                  ],
+              ),
+          )
+          return
+
+        # Calculate remaining time
+        remaining = self.timeout - elapsed
+        if remaining <= 0:
+          yield Event(
+              content=types.Content(
+                  role='assistant',
+                  parts=[
+                      types.Part.from_text(
+                          text=f"Timeout: {self.timeout_error_message}"
+                      )
+                  ],
+              ),
+          )
+          return
+
+        # Get next event with timeout check
+        try:
+          event = await asyncio.wait_for(
+              agen.__anext__(),
+              timeout=min(remaining, 0.5)  # Check frequently
+          )
+          yield event
+        except StopAsyncIteration:
+          # Generator finished normally
+          break
+        except asyncio.TimeoutError:
+          # This iteration timed out, but check overall timeout
+          if time.time() - start_time >= self.timeout:
+            yield Event(
+                content=types.Content(
+                    role='assistant',
+                    parts=[
+                        types.Part.from_text(
+                            text=f"Timeout: {self.timeout_error_message}"
+                        )
+                    ],
+                ),
+            )
+            return
+          # Otherwise, continue waiting for next event
+          continue
+    except Exception:
+      # Re-raise other exceptions
+      raise


The run_async_with_events method can be simplified for better readability and maintainability.

The code to create and yield a timeout Event is duplicated. This can be extracted into a local helper function.

The check if remaining <= 0: is redundant because the preceding check if elapsed >= self.timeout: already covers this condition, making it unreachable. Removing this dead code will make the logic clearer.

Here is a suggested refactoring that addresses both points:

try: def _create_timeout_event() -> Event: return Event( content=types.Content( role='assistant', parts=[ types.Part.from_text( text=f"Timeout: {self.timeout_error_message}" ) ], ), ) while True: # Check overall timeout elapsed = time.time() - start_time if elapsed >= self.timeout: # Timeout exceeded yield _create_timeout_event() return # Get next event with timeout check remaining = self.timeout - elapsed try: event = await asyncio.wait_for( agen.__anext__(), timeout=min(remaining, 0.5) # Check frequently ) yield event except StopAsyncIteration: # Generator finished normally break except asyncio.TimeoutError: # This iteration timed out, but check overall timeout if time.time() - start_time >= self.timeout: yield _create_timeout_event() return # Otherwise, continue waiting for next event continue except Exception: # Re-raise other exceptions raise

contributing/samples/agent_tool_resilience/agent.py

contributing/samples/agent_tool_resilience/README.md

…from being marked as final

sarojrout · 2026-01-08T19:59:26Z

@ryanaiagent , can you please get this sample reviewed and merged so that others can take a pull?

sarojrout added 2 commits January 6, 2026 14:25

feat(samples): Add agent_tool_resilience sample

4179470

Demonstrates timeout protection, automatic retry, and dynamic fallback patterns for multi-agent workflows using AgentTool.

cleaned up comments

e6a2b07

adk-bot added the core [Component] This issue is related to the core interface and implementation label Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

sarojrout mentioned this pull request Jan 6, 2026

Built-in Resilience Support for AgentTool #4087

Open

review comments incorporated google#4086

70f1fa1

ryanaiagent self-assigned this Jan 7, 2026

made skip summarization false to prevent the function response event …

b214380

…from being marked as final

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/agent tool resilience sample #4086

Feat/agent tool resilience sample #4086

sarojrout commented Jan 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

sarojrout commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat/agent tool resilience sample #4086

Are you sure you want to change the base?

Feat/agent tool resilience sample #4086

Conversation

sarojrout commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to Issue or Description of Change

Testing Plan

Checklist

Additional context

What This PR Adds

Impact

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sarojrout commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarojrout commented Jan 6, 2026 •

edited

Loading