A sophisticated AI agent evaluation framework using the Google Agent Development Kit (ADK) and Vertex AI GenAI Evaluation Service for production-scale testing.
adk web src/agentsThen open http://127.0.0.1:8000
adk eval src/agents src/agents/story_agent.evalset.jsonpytest tests/test_vertex_eval.py -v -ssrc/agents/
├── story_flow_agent.py # Sophisticated Custom Agent (StoryFlow pattern)
├── sample_agent.py # Simple Calculator Agent
├── evaluator_agent.py # LLM-as-Judge Agent
├── orchestrator_agent.py # Evaluation Pipeline Orchestrator
├── story_agent.evalset.json # ADK Evalset Format
└── test_config.json # Evaluation Criteria Config
src/evaluation/
└── vertex_ai_evaluator.py # Vertex AI GenAI Evaluation Service
tests/
├── test_vertex_eval.py # Vertex AI evaluation tests
├── test_story_eval.py # Story agent evaluation
└── data/
├── story_eval_dataset.json # 50-case golden dataset
└── evaluation_results.json # Results output
| Metric | Description |
|---|---|
tool_trajectory_avg_score |
Exact match of tool call trajectory |
response_match_score |
ROUGE-1 similarity to reference |
final_response_match_v2 |
LLM-judged semantic match |
rubric_based_tool_use_quality_v1 |
LLM-judged tool usage quality |
hallucinations_v1 |
Groundedness check |
safety_v1 |
Safety/harmlessness check |
| Metric | Description |
|---|---|
coherence |
Logical flow and structure |
fluency |
Grammar and readability |
groundedness |
Factual accuracy |
summarization_quality |
Summary effectiveness |
The StoryFlowAgent demonstrates ADK best practices:
- Custom Orchestration - Implements
BaseAgent._run_async_impl - LoopAgent - Iterative critique/revision (max 3 iterations)
- SequentialAgent - Post-processing pipeline
- Conditional Logic - Regenerate if tone is negative
StoryFlowAgent (Custom BaseAgent)
├─ StoryGenerator (LlmAgent)
├─ CriticReviserLoop (LoopAgent)
│ ├─ Critic (LlmAgent)
│ └─ Reviser (LlmAgent)
└─ PostProcessing (SequentialAgent)
├─ GrammarCheck (LlmAgent)
└─ ToneCheck (LlmAgent)
# Run with default criteria
adk eval src/agents src/agents/story_agent.evalset.json
# Run with custom config
adk eval src/agents src/agents/story_agent.evalset.json --config test_config.jsonfrom src.evaluation.vertex_ai_evaluator import run_evaluation
from src.agents.story_flow_agent import root_agent
result = await run_evaluation(
agent=root_agent,
evalset_path="src/agents/story_agent.evalset.json",
output_path="tests/data/vertex_eval_results.json",
use_vertex_ai=True
)
print(f"Avg Trajectory Score: {result.avg_trajectory_score}")
print(f"Avg Coherence: {result.avg_coherence}")
print(f"Avg Groundedness: {result.avg_groundedness}")# All evaluation tests
pytest tests/test_vertex_eval.py -v -s
# Trajectory tests only
pytest tests/test_vertex_eval.py::TestTrajectoryEvaluation -v
# Full pipeline integration
pytest tests/test_vertex_eval.py::TestFullPipeline -vpip install google-adk pytest python-dotenv pandas vertexaiCreate .env:
GOOGLE_GENAI_USE_VERTEXAI=TRUE
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1