🔖 Introduction
About the project
Prompt Optimiser Agent is a comparative AI engineering project that explores two emerging agent orchestration frameworks, Google Agent Development Kit (ADK) and LangGraph. The project implements a reflection-based multi-agent system where specialised AI agents collaborate to iteratively improve user prompts through structured feedback loops. The primary objective was to gain hands-on experience with modern agentic architectures and evaluate how different orchestration approaches impact developer experience, state management, workflow complexity, token consumption, and runtime performance. The final result is an interactive web application that allows users to benchmark both frameworks side-by-side while visualizing latency, token usage, and revision cycles.
The application is served by a FastAPI backend exposing dedicated endpoints for each framework, and a React + TypeScript + Vite frontend that submits user prompts to both engines in parallel, rendering the results in a unified comparison view with latency and token metrics.



Key outcomes of the project:
Demonstrated measurable latency and token-usage differences between ADK and LangGraph on identical reflection-loop tasks.
Validated the Reflection Pattern as an effective stress-test for comparing framework orchestration overhead.
Delivered a working side-by-side demo UI that transparently surfaces engine metrics per request.
Published findings and analysing framework trade-offs for production agentic solutions.
🤔 Problem space
Problems to solve/Requirements to Create
As agentic AI systems become increasingly complex, developers face challenges in selecting orchestration frameworks that balance flexibility, scalability, maintainability, and performance. While frameworks such as LangGraph and Google ADK are becoming popular choices for building multi-agent applications, practical comparisons between their architectural approaches remain limited.
Understanding Agent Orchestration Trade-offs
Organizations adopting agentic AI need a clear understanding of how different frameworks handle state management, workflow orchestration, memory, and agent communication.
Current Solution
Developers often rely on documentation, tutorials, and isolated examples when evaluating frameworks. However, these resources rarely provide direct comparisons using identical business logic and workflows.
Evaluating Reflection-Based Multi-Agent Workflows
Reflection patterns are commonly used in AI systems to improve output quality through iterative critique and revision. However, these workflows introduce additional orchestration complexity and runtime overhead.
Current Solution
Many implementations rely on single-agent prompt engineering or simplistic chains that lack structured feedback loops.

This results in:
Reflection workflows increase token consumption
Additional agent interactions increase latency
Complex state transitions become difficult to manage at scale
Why Should We Identify Trade-offs?
Understanding orchestration trade-offs helps engineering teams make informed architectural decisions before investing heavily in a framework.
Enables faster framework evaluation
Reduces architectural uncertainty
Provides practical performance benchmarks
Improves understanding of agentic design patterns
Goals
Objective 🎯
Build a modern agentic AI platform capable of orchestrating multiple specialized AI agents for benchmarking LangGraph and Google ADK using the reflection design pattern while maintaining performance, scalability, and developer productivity.
Project Goals
Implement the same reflection-based workflow design pattern using both Google ADK and LangGraph
Compare orchestration complexity and developer experience
Benchmark token consumption, revision cycles, and latency
Explore state management approaches used by each framework
Build an interactive interface for framework comparison
Gaining practical experience with production-oriented agentic architectures
🌟 Design space
UI Design
The frontend is designed as a single-page chat experience. The layout centres a scrollable conversation panel on a full-screen background with a translucent overlay. When an optimisation request completes, both engine results are rendered inside a single message card in a side-by-side grid. Latency and token metrics are displayed below each result in a lightweight badge row. A composer bar at the bottom accepts new prompts. The session resets automatically after every third optimised prompt to keep the demo predictable.
Key design goals:
Low-fidelity Wireframe
💡 The wireframe concept used three zones:
Top bar: branding and session indicator.
Conversation panel (centre): scrollable message list with user bubbles on the right and engine result cards on the left.
Composer bar (bottom): text input, submit button, and session reset indicator.
Each engine result card within a message contained: framework name badge, optimised draft text, and a metric row (latency | input tokens | output tokens).

High-fidelity design
💡 The final implementation matched the wireframe intent. Key visual decisions included a translucent dark overlay on the background image for contrast, a white card surface for message content, colour-coded framework labels (ADK in blue, LangGraph in indigo), and compact monospaced token metric display.

Design system 🎨
💡 The UI used Tailwind CSS utility classes for spacing, colour, and typography, supplemented by custom component logic for the comparison card. No third-party component library was introduced, keeping the dependency surface minimal for a demo application. Consistent use of rounded corners, subtle shadows, and muted grays created a clean, professional aesthetic appropriate for a technical benchmark tool.
Development Phase
Technology Stack Selection
1. Backend: FastAPI + Python
Why FastAPI?
Async-first: both ADK and LangGraph engines are async; FastAPI's ASGI runtime handles concurrent engine calls without blocking.
Schema validation: Pydantic DTOs (PromptRequest, OptimizationResponse) provide automatic request validation and clear API contracts.
Developer velocity: automatic OpenAPI docs, minimal boilerplate, and hot-reload via uvicorn accelerate iteration.
2. Orchestration: Google ADK
Why ADK?
Native Gemini integration: ADK is built by Google for Gemini models, eliminating provider adapter overhead for the ADK engine path.
Imperative control flow: agents are defined as independent Agent instances and composed with standard Python async while loops, matching how backend developers already think.
Mutable session model: InMemorySessionService passes data directly into an active session workspace, minimising orchestration-layer state transformations.
3. Orchestration: LangGraph
Why LangGraph?
Structured graph model: explicit StateGraph with typed nodes and directional edges makes the reflection workflow's control flow visible.
MemorySaver checkpointing: thread-scoped in-memory checkpointing maps cleanly to per-session state via session_id.
Broad ecosystem: LangGraph is the dominant community framework.
4. Frontend: React + TypeScript
Why this stack?
Component-based architecture: chat bubbles, engine result cards, and metric badges are reusable, self-contained components.
TypeScript: end-to-end type safety across the API client and UI state management, catching DTO mismatches at compile time.
High-Level Architecture Diagram
The application follows a client-server architecture where the React frontend communicates with a FastAPI backend. The backend routes requests to either the Google ADK or LangGraph implementation, executes the reflection workflow, collects performance metrics, and returns results for visualization. The FastAPI backend routes each request to its corresponding engine, which runs the four-node reflection loop and returns a normalised OptimizationResponse. Both responses arrive at the frontend and are rendered in the same message card.

As you can see both engines implement the same four-node Reflection Design Pattern. This design was chosen specifically because reflection loops depend on constant feedback cycles, making them computationally demanding and exposing framework orchestration overhead clearly. A single user prompt triggers a sequential waterfall of agent-to-agent calls.
Agent Node | Role | Input | Output |
Generator | Drafts a highly structured, production-ready system prompt from raw user intent | User query + revision history | Optimised draft |
Critic | Evaluates the draft against rubrics: edge cases, formatting, token constraints | Current draft | Critique feedback APPROVED if not REASONING |
Assessor | Determines if the draft meets the sufficiency threshold | Draft + critique | SUFFICIENT: YES/NO |
Reviser | Updates the draft if Assessor signals further revision is needed | Draft + critique | Revised draft |
The loop repeats until the Assessor returns SUFFICIENT: YES or the max_iterations limit is reached, then the final draft and accumulated metrics are returned.
Agent Definition & Workflow Construction
Although both implementations follow the same Reflection Pattern, the way agents and workflows are defined differs significantly between LangGraph and Google ADK.
LangGraph: Agent Nodes + State Graph
LangGraph models workflows as a graph of nodes connected through explicit edges. Each agent is implemented as a node function that receives and updates a shared state object.
Reflection State Definition:
class ReflectionState(TypedDict):
query: str
current_draft: str
critique: str
revision_history: Annotated[list, operator.add]
messages: Annotated[list, add_messages]
is_sufficient: bool
iteration: int
Agent Declaration:
async def draft_node(state: ReflectionState):
response = llm.generate(
prompt=f"Query: {state['query']}",
system_prompt="You are an expert prompt engineer..."
)
return {
"current_draft": response.text
}
Graph Construction:
workflow = StateGraph(ReflectionState)
workflow.add_node("Draft", draft_node)
workflow.add_node("Critic", critic_node)
workflow.add_node("Assess", assessment_node)
workflow.add_node("Revise", revise_node)
workflow.add_edge(START, "Draft")
workflow.add_edge("Draft", "Critic")
workflow.add_edge("Critic", "Assess")
workflow.add_conditional_edges(
"Assess",
should_continue,
{
"sufficient": END,
"needs_improvement": "Revise"
}
)
graph = workflow.compile()
Key Characteristics
Explicit workflow visualization through graph edges.
Strongly structured state management.
Conditional routing defined separately through router functions.
Suitable for complex workflows requiring observability and durable execution.
Google ADK: Independent Agents + Imperative Control Flow
Google ADK treats agents as standalone objects with their own instructions and execution context. Workflow orchestration is handled through standard programming constructs rather than an explicit graph.
Agent Declaration
draft_agent = Agent(
name="Generator",
model="gemini-2.5-flash-lite",
description="Generates optimized prompts",
instruction="""
You are an Expert AI Prompt Engineer.
Transform user requests into
structured production-ready prompts.
"""
)
critic_agent = Agent(...)
assessor_agent = Agent(...)
reviser_agent = Agent(...)
Workflow Ochestration:
while iterations < max_iterations:
assessment = await call_agent_async(
assessor_agent,
current_draft
)
if "SUFFICIENT: YES" in assessment:
break
current_draft = await call_agent_async(
reviser_agent,
current_draft
)
iterations += 1
Key Characteristics
Agents are independent runtime components.
Native Python control flow handles orchestration.
Routing logic is implemented directly with familiar language constructs.
Lower conceptual overhead for developers familiar with backend development.
As you can see, while both frameworks can implement the same reflection pattern, LangGraph emphasizes structured orchestration and stateful workflows, whereas ADK emphasizes flexibility and imperative control flow. This difference became one of the most noticeable aspects of the project from a developer experience perspective.
Challenges Faced and Solutions
Problem 1: State Overhead in LangGraph Reflection Loops
LangGraph's immutable state model creates a deep copy of the full ReflectionState TypedDict on every node transition. In a three-iteration reflection loop (twelve state snapshots for four nodes × three iterations), this overhead compounded significantly.
Solution:
Accepted this as a framework characteristic and documented it as a key finding. Mitigated partially by using Annotated[list, operator.add] for revision_history to avoid redundant list copies, and by keeping the messages accumulator efficient.
Problem 2: Session ID Propagation Across the Stack
Early in development the session_id was not propagated from the API router all the way through to the engine execution functions. ADK's InMemorySessionService and LangGraph's MemorySaver each require a consistent identifier to recall prior state; without end-to-end propagation, every request started a fresh session regardless of what the frontend sent so the conversation history is lost.
Solution:
Added session_id to the PromptRequest DTO and wired it through the router→ services → execute_adk_optimization() / execute_langgraph_optimization(). For LangGraph, session_id maps to thread_id in the MemorySaver config. For ADK, it is passed as SESSION_ID to InMemorySessionService. The frontend generates a UUID on load and holds it for the session lifetime.
Problem 3: Revision Count Divergence Between Frameworks
Under identical inputs and max_iterations=3, ADK completed the workflow in a single revision cycle while LangGraph required three. This meant raw latency comparisons could be resulted by iteration count, not just orchestration overhead.
Solution:
Exposed revision_count as part of the response metrics and documented the confounding factor explicitly. The is difference likely from model behaviour (Gemini 2.5 Flash Lite for ADK vs OpenAI for LangGraph) and context management strategy, not purely from framework logic. Both the article and this blog note that benchmark results should be interpreted alongside iteration count, not just latency.
Performance Benchmarks
Framework Comparison Results
Both frameworks were subjected to an identical multi-turn task: taking the same user input, generating a detailed system prompt, and running the agentic reflection loop to apply strict formatting and edge-case criteria. The test input was:
{
"initial_prompt": "make a portfolio website, its for an AI engineer",
"max_iterations": 3,
"session_id": "1"
}
Metric | Google ADK | LangGraph | |
Latency (seconds) | 8.42 s | 32.52 s | |
Input Tokens | 961 | 3,102 | |
Output Tokens | 971 | 1,648 | |
Revision Count | 1 | 3 | |
The latency advantage for ADK can be explained by two compounding factors. First, ADK's native imperative loop avoids the state-snapshot overhead that LangGraph incurs on every node transition in its compiled StateGraph. Second, ADK completed the workflow in a single revision cycle versus three for LangGraph, meaning fewer total model calls were made. The token usage gap reflects both the shorter loop and ADK's more efficient context management.
Important Note: These benchmarks measured orchestration simplicity and runtime behaviour in a reflection-based workflow, not durable execution, checkpointing, human-in-the-loop workflows, or graph observability areas where LangGraph has deliberate design advantages. The revision count difference likely reflects model behaviour, context management and user query rather than pure framework logic, and should be considered alongside raw latency figures. |
Future Vision / Next Steps
Long-Term Vision
The Prompt Optimiser Agent is intentionally scoped as a benchmark demo. The next iterations should move it toward a production-grade comparison platform and extend the research surface.
1) Production Readiness
Persistent session storage: replace InMemorySessionService and MemorySaver with Redis-backed or Postgres-backed checkpointers so sessions survive backend restarts and scale across multiple instances.
Multi-tenant session isolation: move from a fixed demo session_id to per-user UUID sessions with proper isolation and TTL expiry.
Automated test suite: add pytest tests for API routes and engine response schemas, and a demo script that posts a prompt sequence and prints both framework outputs for regression testing.
Token-Cost Reduction: Implement a sliding-window or summarisation strategy that caps the revision history passed to the model at N entries, discarding or compressing older iterations. This directly reduces input token growth across reflection cycles, lowering API costs without reducing output quality.
2) Extended Benchmark Coverage
Additional frameworks: add a CrewAI or other framework engine under the same API contract to expand the comparison surface.
Human-in-the-loop comparison: implement an interrupt/resume flow in LangGraph (where it has a native advantage) to provide a balanced view of framework strengths.
Token-control layer: limit history sent to models via sliding window or summarisation to keep latency and costs predictable at longer session lengths.
3) Tooling and Observability
Session inspector endpoint: add GET /api/v1/optimize/session/{session_id}/history to expose revision_history for live demonstration and debugging.
Visual session timeline: extend the frontend with a collapsible per-message revision history panel showing each draft, critique, and assessment turn.
CLI / Postman collection: publish a CLI helper or Postman collection so users can reproduce benchmark scenarios without a running frontend.
Final Verdict
After building the same multi-agent reflection system independently in both LangGraph and Google ADK, the clearest takeaway is that neither framework is universally better, they solve different problems at different points in agentic applications.
Dimension | Google ADK | LangGraph |
Learning curve | Low - standard Python async patterns | Higher - graph DSL, TypedDict, router functions |
Orchestration model | Imperative while loop | Compiled DAG StateGraph |
State management | Mutable session workspace | Immutable TypedDict snapshots |
Latency (reflection loop) | Lower (fewer copies, less overhead) | Higher (snapshot on every transition) |
Observability | Manual logging required | Built-in graph tracing and checkpointing |
Durable execution | InMemorySessionService | MemorySaver can swap to persistent backends |
Ecosystem | Google Cloud / Gemini-native | Framework-agnostic, large community |
Best for | Rapid prototyping, latency-sensitive loops | Complex, observable, production pipelines |
ADK is the stronger choice when development speed, familiar programming patterns, and lower latency in tight reflection loops are the priority. LangGraph earns its place when reliability, structured observability, durable state, and production-grade orchestration matter more than raw speed. The benchmark numbers alone should not conclude framework selection but model choice, workflow architecture, and team familiarity all shape real-world outcomes.