Agent Evaluation Project

A sophisticated AI agent evaluation framework using the Google Agent Development Kit (ADK) and Vertex AI GenAI Evaluation Service for production-scale testing.

🚀 Quick Start

Launch ADK Web

adk web src/agents

Then open http://127.0.0.1:8000

Run ADK CLI Evaluation

adk eval src/agents src/agents/story_agent.evalset.json

Run Pytest Evaluation

pytest tests/test_vertex_eval.py -v -s

📁 Project Structure

src/agents/
├── story_flow_agent.py        # Sophisticated Custom Agent (StoryFlow pattern)
├── sample_agent.py            # Simple Calculator Agent
├── evaluator_agent.py         # LLM-as-Judge Agent
├── orchestrator_agent.py      # Evaluation Pipeline Orchestrator
├── story_agent.evalset.json   # ADK Evalset Format
└── test_config.json           # Evaluation Criteria Config

src/evaluation/
└── vertex_ai_evaluator.py     # Vertex AI GenAI Evaluation Service

tests/
├── test_vertex_eval.py        # Vertex AI evaluation tests
├── test_story_eval.py         # Story agent evaluation
└── data/
    ├── story_eval_dataset.json   # 50-case golden dataset
    └── evaluation_results.json   # Results output

📊 Evaluation Metrics

ADK Built-in Criteria

Metric	Description
`tool_trajectory_avg_score`	Exact match of tool call trajectory
`response_match_score`	ROUGE-1 similarity to reference
`final_response_match_v2`	LLM-judged semantic match
`rubric_based_tool_use_quality_v1`	LLM-judged tool usage quality
`hallucinations_v1`	Groundedness check
`safety_v1`	Safety/harmlessness check

Vertex AI GenAI Metrics

Metric	Description
`coherence`	Logical flow and structure
`fluency`	Grammar and readability
`groundedness`	Factual accuracy
`summarization_quality`	Summary effectiveness

🧠 StoryFlowAgent Architecture

The StoryFlowAgent demonstrates ADK best practices:

Custom Orchestration - Implements BaseAgent._run_async_impl
LoopAgent - Iterative critique/revision (max 3 iterations)
SequentialAgent - Post-processing pipeline
Conditional Logic - Regenerate if tone is negative

StoryFlowAgent (Custom BaseAgent)
├─ StoryGenerator (LlmAgent)
├─ CriticReviserLoop (LoopAgent)
│  ├─ Critic (LlmAgent)
│  └─ Reviser (LlmAgent)
└─ PostProcessing (SequentialAgent)
   ├─ GrammarCheck (LlmAgent)
   └─ ToneCheck (LlmAgent)

🔬 Evaluation Framework

1. ADK CLI Evaluation

# Run with default criteria
adk eval src/agents src/agents/story_agent.evalset.json

# Run with custom config
adk eval src/agents src/agents/story_agent.evalset.json --config test_config.json

2. Vertex AI Programmatic Evaluation

from src.evaluation.vertex_ai_evaluator import run_evaluation
from src.agents.story_flow_agent import root_agent

result = await run_evaluation(
    agent=root_agent,
    evalset_path="src/agents/story_agent.evalset.json",
    output_path="tests/data/vertex_eval_results.json",
    use_vertex_ai=True
)

print(f"Avg Trajectory Score: {result.avg_trajectory_score}")
print(f"Avg Coherence: {result.avg_coherence}")
print(f"Avg Groundedness: {result.avg_groundedness}")

3. Pytest Suite

# All evaluation tests
pytest tests/test_vertex_eval.py -v -s

# Trajectory tests only
pytest tests/test_vertex_eval.py::TestTrajectoryEvaluation -v

# Full pipeline integration
pytest tests/test_vertex_eval.py::TestFullPipeline -v

🔧 Setup

pip install google-adk pytest python-dotenv pandas vertexai

Create .env:

GOOGLE_GENAI_USE_VERTEXAI=TRUE
GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
src		src
tests		tests
.env		.env
.gitignore		.gitignore
README.md		README.md
debug_mock.txt		debug_mock.txt
execution_error.txt		execution_error.txt
mock_output.txt		mock_output.txt
requirements.txt		requirements.txt
run_universal_eval.py		run_universal_eval.py
test_output.txt		test_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agent Evaluation Project

🚀 Quick Start

Launch ADK Web

Run ADK CLI Evaluation

Run Pytest Evaluation

📁 Project Structure

📊 Evaluation Metrics

ADK Built-in Criteria

Vertex AI GenAI Metrics

🧠 StoryFlowAgent Architecture

🔬 Evaluation Framework

1. ADK CLI Evaluation

2. Vertex AI Programmatic Evaluation

3. Pytest Suite

🔧 Setup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

helpshift/agent-evaluation

Folders and files

Latest commit

History

Repository files navigation

Agent Evaluation Project

🚀 Quick Start

Launch ADK Web

Run ADK CLI Evaluation

Run Pytest Evaluation

📁 Project Structure

📊 Evaluation Metrics

ADK Built-in Criteria

Vertex AI GenAI Metrics

🧠 StoryFlowAgent Architecture

🔬 Evaluation Framework

1. ADK CLI Evaluation

2. Vertex AI Programmatic Evaluation

3. Pytest Suite

🔧 Setup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages