Obsidian Metadata
| description | Google AI mode - "what is the best way to build llm agents in 2025 which includes tracing and evals?" |
Google AI Mode Response
Step 1: Define the agent’s purpose and architecture
Before writing any code, determine the agent’s specific role, such as customer support, data analysis, or web research. Then, choose the appropriate architecture based on the task’s complexity.
- Single-agent systems: Ideal for simple, tightly coupled, and linear tasks.
- Multi-agent systems: Best for complex, open-ended, and parallelizable tasks that require specialized agents to collaborate.
- Manager/decentralized patterns: You can also choose between centralized (manager-controlled) and decentralized (peer-to-peer) multi-agent communication patterns.
Step 2: Choose your tools and frameworks
Select a framework and observability platform based on your project’s needs.
Agent frameworks
Frameworks abstract the complexity of building agents by managing the core components—reasoning, memory, and tool usage.
- LangChain/LangGraph: A popular, flexible framework for building complex, stateful workflows with extensive tool integrations.
- AutoGen (Microsoft): Facilitates building collaborative, multi-agent systems that communicate with each other to complete complex tasks.
- CrewAI: A beginner-friendly framework for creating multi-agent teams with clearly defined roles and tasks.
- LlamaIndex Agents: A data-focused framework specializing in retrieval-augmented generation (RAG) over private or enterprise data.
- Semantic Kernel (Microsoft): A lightweight SDK designed for integrating AI capabilities into existing applications using C#, Python, or Java.
Observability and evaluation platforms
These dedicated platforms integrate directly with agent frameworks to provide end-to-end monitoring and testing.
- Maxim AI: Offers a comprehensive suite for simulation, evaluation (human-in-the-loop and LLM-as-a-judge), and production observability with robust security features.
- Future AGI: Combines real-time monitoring, evaluation, and prototyping in a unified platform with OpenTelemetry-native tracing.
- LangSmith: Provides seamless tracing, debugging, and evaluation for applications built with the LangChain ecosystem.
- Arize AI: An enterprise-grade platform that uses OpenTelemetry for tracing and offers strong evaluation capabilities for large-scale operations.
Step 3: Architect core components
All production-ready agents require robust core components for reliable performance.
- Memory: Implement both short-term memory (for the current conversation context) and long-term memory (for past interactions and knowledge) using databases or vector stores. Manage context size to control cost and avoid performance degradation.
- Tool use: Integrate tools that allow the agent to perform actions beyond text generation, such as querying a database or calling an API. Use clear and specific tool descriptions to ensure the agent selects the correct tool for the job.
- Planning and reasoning: Guide the agent’s decision-making process. Effective techniques include problem decomposition and self-reflection prompts that force the agent to review its own responses for errors.
Step 4: Implement tracing and logging
Tracing is the foundation of agent observability. Instrument your code to capture the full lifecycle of a request, from the user prompt to the final response.
- Use OpenTelemetry: For vendor-neutral tracing, ensure your tools are compatible with OpenTelemetry, a widely adopted standard.
- Log granular data: Capture all key data points at each step of the agent’s execution, including:
- Inputs and outputs: Log the model inputs, outputs, and intermediate states.
- Latency and cost: Track these metrics for performance monitoring and cost optimization.
- Tool calls: Record which tools were used and the results returned.
- Custom metadata: Add relevant tags for filtering and analysis, such as user IDs or model versions.
Step 5: Design and automate evaluation
Continuous evaluation is critical for ensuring agent reliability and preventing regressions.
- Define success metrics: Move beyond simple accuracy to measure metrics specific to agents, such as task completion rate, tool usage efficiency, and memory retrieval success.
- Automate evaluations:
- Use LLM-as-a-judge: For rapid iteration, use a more capable LLM to evaluate the agent’s responses against a rubric.
- Use real-world data: Use production logs and user feedback to create test suites for regression testing.
- Implement human-in-the-loop (HIL): Use human reviewers for nuanced, subjective assessments that automated metrics can’t capture, such as tone or creativity.
- Set up guardrails: Implement safety and compliance checks to filter harmful, biased, or non-compliant outputs.
Step 6: Deploy and monitor in production
Roll out your agent gradually to limit risk, then use your observability tools to monitor performance continuously.
- Monitor key metrics: Keep a close watch on KPIs, latency, token usage, and error rates using live dashboards.
- Set up alerts: Configure alerts to notify your team of anomalies, such as cost spikes, degraded performance, or increased error rates.
- Establish a feedback loop: Curate data from production runs for continuous evaluation, fine-tuning, and improvement.

