Skip to content

Conversation

@sarojrout
Copy link
Contributor

@sarojrout sarojrout commented Jan 6, 2026

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

1. Link to an existing issue (if applicable):

2. Or, if no issue exists, describe the change:

If applicable, please follow the issue templates to provide as much detail as
possible.

Problem:
Currently, building resilient multi-agent systems with AgentTool requires significant custom code. When sub-agents timeout or fail, developers must:

  • Create custom timeout wrappers
  • Manually handle errors and retries
  • Engineer complex prompts for intelligent routing to alternative agents
  • Build error recovery logic
  • Format user-friendly error messages

This creates a high barrier to entry and leads to inconsistent implementations across different projects.

Solution:
This PR adds a working sample (contributing/samples/agent_tool_resilience/) that demonstrates how to build resilient multi-agent systems using ADK's existing components:

  1. TimeoutAgentTool wrapper - Adds timeout protection to AgentTool

    • Wraps AgentTool.run_async() with asyncio.wait_for()
    • Returns structured error responses compatible with retry plugins
    • Handles both synchronous and async generator patterns
  2. Integration with ReflectAndRetryToolPlugin - Handles automatic retries

    • Intercepts tool failures (including timeouts)
    • Provides structured reflection guidance to the LLM
    • Tracks retry counts per-tool
  3. Prompt-based dynamic routing - Enables intelligent fallback

    • Coordinator agent receives error responses as tool results
    • LLM reasons about errors and chooses alternative agents
    • Demonstrates fallback patterns without requiring core changes
  4. Error recovery agent - Provides user-friendly error analysis

    • Specialized agent that analyzes failures
    • Hides complexity from users
    • Suggests alternative approaches

Why this solution:

  • Works with current ADK features (no core changes needed)
  • Provides a reference implementation for developers
  • Demonstrates best practices for resilience patterns
  • Can be merged immediately as a sample
  • Serves as proof-of-concept for future built-in support

Testing Plan

Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.

Note: This is a sample addition, not a core feature change. The sample code itself is tested through manual E2E testing. The TimeoutAgentTool wrapper uses standard Python asyncio.wait_for() which is well-tested.

Please include a summary of passed pytest results.

Manual End-to-End (E2E) Tests:

Setup:

# 1. Activate virtual environment
source .venv/bin/activate

# 2. Launch the sample
adk web contributing/samples/agent_tool_resilience
  1. Normal Operation:

    • Query: "What is quantum computing?"
    • Expected: Primary agent handles query successfully
    • Result: Comprehensive answer returned
  2. Timeout Scenario:

    • Configuration: Set timeout=5.0 in agent.py as part of TimeoutAgentTool
    • Query: Very complex research request (multiple domains, detailed requirements)
    • Expected: Primary times out after 5s → Fallback agent succeeds
    • Result: Timeout detected, fallback automatically tried, answer returned

Test Results Summary:

  • Normal operation works correctly
  • Timeout protection functions as expected
  • Automatic retry via ReflectAndRetryToolPlugin works
  • Dynamic fallback routing works
  • Error recovery agent provides helpful guidance
  • User-friendly error messages (no complexity leakage)

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
    - TimeoutAgentTool includes detailed docstrings
    - Complex timeout logic for async generators is commented
    - Agent instructions explain error handling protocols
  • I have added tests that prove my fix is effective or that my feature works.
    - Manual E2E tests demonstrate all scenarios
  • New and existing unit tests pass locally with my changes.
    - No core changes - only sample addition
  • I have manually tested my changes end-to-end.
    • All test scenarios verified (see "Manual E2E Tests" above)
    • Sample runs successfully with adk web
  • Any dependent changes have been merged and published in downstream modules.
    • No dependencies - uses existing ADK features only

Additional context

What This PR Adds

Files Added:

  • contributing/samples/agent_tool_resilience/agent.py - Complete implementation (~320 lines)
  • contributing/samples/agent_tool_resilience/__init__.py - Package initialization
  • contributing/samples/agent_tool_resilience/README.md - User documentation

Key Features:

  1. Timeout Protection - Custom TimeoutAgentTool wrapper
  2. Automatic Retry - Integration with ReflectAndRetryToolPlugin
  3. Dynamic Fallback - Prompt-based routing to alternative agents
  4. Error Recovery - Specialized agent for user-friendly error analysis

Impact

  • Low Risk: Only adds a sample, no core changes
  • High Value: Provides reference implementation for resilience patterns
  • Backward Compatible: No breaking changes
  • Well Documented: Comprehensive README and inline comments
    Screenshots
Screenshot 2026-01-06 at 1 16 08 PM Screenshot 2026-01-06 at 1 24 55 PM

Demonstrates timeout protection, automatic retry, and dynamic fallback
patterns for multi-agent workflows using AgentTool.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sarojrout, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new sample that provides a robust reference implementation for building resilient multi-agent systems within the ADK framework. It addresses the challenge of managing sub-agent timeouts and failures by demonstrating how to combine existing ADK components to achieve timeout protection, automatic retries, intelligent fallback mechanisms, and user-friendly error recovery, all without requiring core framework changes.

Highlights

  • Timeout Protection: Introduces a custom TimeoutAgentTool wrapper that adds timeout handling to sub-agents, catching asyncio.TimeoutError and returning structured error responses for processing by retry plugins. It also includes careful handling for async generators.
  • Automatic Retry: Demonstrates integration with ReflectAndRetryToolPlugin to automatically intercept tool failures (including timeouts) and provide structured reflection guidance to the LLM, tracking retry counts per-tool.
  • Dynamic Fallback Routing: Showcases prompt-based dynamic routing where a coordinator agent receives error responses as tool results, allowing the LLM to reason about errors and intelligently choose alternative agents, such as a fallback research agent.
  • Error Recovery Agent: Includes a specialized error_recovery_agent that analyzes failures and provides user-friendly guidance, hiding complexity from the end-user and suggesting alternative approaches.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the core [Component] This issue is related to the core interface and implementation label Jan 6, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable sample demonstrating resilience patterns for multi-agent systems, including timeouts, retries, and fallbacks. The implementation is well-structured and provides a clear example for developers. My review includes suggestions to improve code clarity and maintainability in agent.py by refactoring the timeout handling logic and ensuring consistency in configuration. I also noted a minor issue in the README.md file.

Comment on lines 108 to 169
try:
while True:
# Check overall timeout
elapsed = time.time() - start_time
if elapsed >= self.timeout:
# Timeout exceeded
yield Event(
content=types.Content(
role='assistant',
parts=[
types.Part.from_text(
text=f"Timeout: {self.timeout_error_message}"
)
],
),
)
return

# Calculate remaining time
remaining = self.timeout - elapsed
if remaining <= 0:
yield Event(
content=types.Content(
role='assistant',
parts=[
types.Part.from_text(
text=f"Timeout: {self.timeout_error_message}"
)
],
),
)
return

# Get next event with timeout check
try:
event = await asyncio.wait_for(
agen.__anext__(),
timeout=min(remaining, 0.5) # Check frequently
)
yield event
except StopAsyncIteration:
# Generator finished normally
break
except asyncio.TimeoutError:
# This iteration timed out, but check overall timeout
if time.time() - start_time >= self.timeout:
yield Event(
content=types.Content(
role='assistant',
parts=[
types.Part.from_text(
text=f"Timeout: {self.timeout_error_message}"
)
],
),
)
return
# Otherwise, continue waiting for next event
continue
except Exception:
# Re-raise other exceptions
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The run_async_with_events method can be simplified for better readability and maintainability.

  • The code to create and yield a timeout Event is duplicated. This can be extracted into a local helper function.
  • The check if remaining <= 0: is redundant because the preceding check if elapsed >= self.timeout: already covers this condition, making it unreachable. Removing this dead code will make the logic clearer.

Here is a suggested refactoring that addresses both points:

    try:
      def _create_timeout_event() -> Event:
        return Event(
            content=types.Content(
                role='assistant',
                parts=[
                    types.Part.from_text(
                        text=f"Timeout: {self.timeout_error_message}"
                    )
                ],
            ),
        )

      while True:
        # Check overall timeout
        elapsed = time.time() - start_time
        if elapsed >= self.timeout:
          # Timeout exceeded
          yield _create_timeout_event()
          return

        # Get next event with timeout check
        remaining = self.timeout - elapsed
        try:
          event = await asyncio.wait_for(
              agen.__anext__(),
              timeout=min(remaining, 0.5)  # Check frequently
          )
          yield event
        except StopAsyncIteration:
          # Generator finished normally
          break
        except asyncio.TimeoutError:
          # This iteration timed out, but check overall timeout
          if time.time() - start_time >= self.timeout:
            yield _create_timeout_event()
            return
          # Otherwise, continue waiting for next event
          continue
    except Exception:
      # Re-raise other exceptions
      raise

@ryanaiagent ryanaiagent self-assigned this Jan 7, 2026
@sarojrout
Copy link
Contributor Author

@ryanaiagent , can you please get this sample reviewed and merged so that others can take a pull?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants