Overview
Observability is more than monitoring: it's the ability to ask new questions of your system without shipping new code. AWS's native stack includes Amazon CloudWatch (metrics, logs, dashboards, alarms), AWS X-Ray (distributed tracing), Amazon Managed Service for Prometheus, and Amazon Managed Grafana. CloudWatch Application Signals automatically discovers services and shows golden signals. AI-driven tools like Amazon Q in CloudWatch and partner solutions (Datadog, New Relic, Dynatrace, Splunk) help triage incidents.
Key concepts
- Three pillars: logs, metrics, traces — plus events and profiles
- OpenTelemetry as the vendor-neutral standard
- SLOs, SLIs, error budgets — the SRE language
- Anomaly detection and automated root-cause analysis
- Cost-aware observability — sampling, retention, log-class tiers
Key AWS services
- Amazon CloudWatch
- AWS X-Ray
- Amazon Managed Service for Prometheus
- Amazon Managed Grafana
- AWS Distro for OpenTelemetry
Learn more — curated resources
Hand-picked official docs, foundational papers, and the best community guides for going deeper on this topic.
Sessions on this topic
18 sessions from the Summit covered this topic. Each is a self-contained mini-lesson.
- PRT215-SIntermediate
The Visibility Gap: Turning Observability into DevSecOps Signals
The Visibility Gap: Turning Observability into DevSecOps Signals (sponsored by Datadog)Security teams and dev teams share the same production environment but operate from different signals. Without runtime context, security monitoring has blind spots with misconfigured infrastructure and threats in flight. This session draws on Fone Dynamics' ISO 27001 journey to show how runtime telemetry, cloud audit logs, and code scanning give DevSecOps and SecOps teams shared context.
- AIM201Intermediate
From demo to deployment: solving agentic AI's toughest challenges
Most AI agent projects stall when moving from prototype to production. This session tackles the top challenges builders face when deploying agentic AI at scale. You'll learn how to answer the fundamental question of whether to build custom agents or leverage pre-built agents for DevOps, security, development, and business productivity use cases. Then you'll discover how to address the critical production challenges of reliability, observability, cost management, security, and evaluation. Drawing from real customer deployments and AWS's portfolio of agentic AI capabilities, you'll gain actionable approaches for building agents that don't just demo well but ship and scale.
- DEV204Intermediate
AI-Powered EKS Troubleshooting with AWS DevOps Agent
Managing EKS clusters means correlating logs, metrics, IAM policies, and network configurations under pressure. The AWS DevOps Agent, announced at re:Invent 2025, changes this workflow fundamentally. In this session, you'll watch a live demonstration where the DevOps Agent autonomously investigates an EKS service failuretracing issues from Pod logs through VPC Security Groups without manual intervention. You'll learn how the agent correlates cross-service dependencies, generates verified remediation plans, and integrates into existing SRE workflows.
- ISV302Advanced
Architecting Scalable AI Agents using Amazon Bedrock AgentCore
Discover how to build powerful AI agents using Amazon Bedrock's suite of tools, with a focus on Amazon Bedrock AgentCore. This session explores how Parrot Analytics leveraged the modular components of Amazon Bedrock AgentCore and Amazon Nova foundational models to achieve 10x the processing speed of manual classification across 2M+ entities. We will dive into prompt and context engineering, knowledge bases, and observability for production agentic workloads.
- DEV207Intermediate
Data Observability Without the Pain - Lessons from a Production System
Modern IoT platforms are inherently data platforms. Events flow through APIs, queues, AWS Lambda Serverless functions, storage systems, and device networks before becoming meaningful data. When something goes wrong, tracing a single event across these distributed components quickly becomes painfuland the question shifts from _what happened_ to _where do I even start looking Ill walk through three practical observability patterns drawn from building and operating a production, event-driven IoT healthcare platform on AWS that processes tens of thousands of device events daily. Using OpenTelemetry, AWS X-Ray and Honeycomb, well explore techniques for gaining visibility into asynchronous event pipelines, correlating activity across services, and tracing events as they move through distributed systems. Youll leave with three concrete patterns you can apply immediately to your own event-driven data systems.
- COP301Advanced
Elevating your Agentic AI Observability
Gain deep visibility into the performance and reliability of autonomous agents with Amazon CloudWatch. This session showcases how CloudWatch delivers endtoend observability for agentic AI workloadstracking decision quality, token efficiency, and workflow execution at scale. Explore prebuilt dashboards and advanced metrics that help you optimize agent performance, control operational costs, and maintain consistent behavior across complex intelligent systems. Walk away ready to implement productiongrade observability that ensures your AI agents operate reliably, make optimal decisions, and deliver measurable outcomes at scale.
- ISV210Intermediate
Boost performance and reduce costs with Aurora: Canva's story
From initial idea to executionlearn about Canva's journey migrating MySQL workloads from Amazon RDS to Aurora at scale. Discover how Canva achieved meaningful performance improvements, cost savings, and operational efficiencies through this strategic migration. This lightning talk shares real-world insights on planning and executing large-scale database migrations, key Aurora best practices for optimizing cost and performance, and how the latest monitoring features help maintain efficiency as you scale. Learn how AWS Countdown Premium (CDP) accelerated and de-risked Canva's migration, delivering tangible business value while minimizing operational disruption.
- PRT209-SIntermediate
How Auto & General leverage observability foundations for AI
As one of Australia's leading general insurers, Auto & General knew AI would play an important part in their future IT operations. To ensure success, the team embarked on a comprehensive observability maturity journey to build solid foundations, governance, and structure. Learn how A&G worked with New Relic to successfully lay observability foundations for the AI age.
- PRT101-SFoundational
Accelerating Innovation with GitLab DAP Powered by Amazon Bedrock
Learn how GitLab Duo Agent Platform (DAP) powered by Amazon Bedrock brings agentic AI into every stage of the software development lifecycle while keeping data, logs, and inference traffic inside your AWS environment. We'll show how teams can orchestrate AI-assisted workflows for planning, coding, security, and compliance using Amazon Bedrock foundation models behind GitLab's AI Gateway.
- PRT203-SIntermediate
Design, Deploy, and Govern AI Agents with Boomis Agentstudio 5 Steps to Enterprise-Grade AI Security for Amazon Bedrock Projects
Transform enterprise automation with Boomi's AI agent ecosystem. Learn to use Agent Designer to visually build agents that connect across systems, and Agent Control Tower for centralised governance, compliance, and performance monitoring. Securely orchestrate your AI lifecycle at scale with Amazon Bedrock.
- DEV312Advanced
Strands Agents on Lambda: Observability With Powertools & X-Ray
When a Strands Agent fails across five Lambda log streams with no correlation, debugging takes 20 minutes minimum. This session demonstrates a structured observability layer that reduces diagnosis to under two minutes. You'll learn how Lambda Powertools Tracer wraps Strands tool invocations as X-Ray subsegments, how Powertools Logger injects AgentCore session correlation IDs across invocations, and how Powertools Metrics surfaces tool retry frequency as CloudWatch alarms — before timeouts occur. The session covers three production failure classes — tool timeout, reasoning loop, and retry storm — and delivers a reusable CDK construct providing full instrumentation for any Strands Agent Lambda deployment.
- DAT402Expert
Deep dive into database integrations with AWS Zero-ETL
Learn how AWS zero-ETL integrations eliminate complex data movement pipelines across multiple database engines, enabling data engineers, architects, and DBAs to reduce maintenance overhead while ensuring near real-time data availability for analytics and ML workloads. Examine the underlying architecture for supported zero-ETL integrations between Amazon Aurora, Amazon DynamoDB, and Amazon RDS sources to Amazon Redshift, Amazon SageMaker, and Amazon OpenSearch Service targets. Explore data movement options, tunable settings, and monitoring capabilities for ongoing data replicationall without traditional ETL complexity.
- DEV305Advanced
Agents in the enterprise: Best practices with Amazon Bedrock AgentCore
As organizations scale AI agent development, robust enterprise architecture patterns become essential. In this advanced session, we'll explore how Amazon Bedrock AgentCore enables teams to build modular systems using their preferred frameworks while sharing tools through MCP gateways. Learn about A2A collaboration, shared memory, identity-based access controls, and integrated observability. Discover practical strategies for secure runtime deployment, standardized tool integration, evaluation frameworks, and end-to-end monitoring. Leave with actionable insights to build secure, scalable agent infrastructures that balance centralized governance with team autonomy.
- SEC302Advanced
Leap ahead in Cloud Operations with AWS DevOps Agent
Downtime costs revenue. Alert fatigue burns out your best engineers. Manual incident investigation wastes hours that could be spent building. Every cloud team faces these operational challenges, yet most still rely on tribal knowledge and context-switching across multiple tools to diagnose issues. In this session, we demonstrate how AWS DevOps Agent transforms incident response from hours of manual investigation to minutes of autonomous analysis. Watch as the agent automatically correlates data across your observability tools, identifies root causes, and delivers actionable mitigation plans freeing your team to build instead of firefight.
- DEV210Intermediate
AI-Driven Incident Triage: From Slack Alert to Root Cause
Modern AWS environments generate more alerts than teams can realistically investigate. This session demonstrates a proof-of-concept that transforms Slack alerts into automated investigation workflows using AI.Learn how to trigger parallel queries across CloudWatch, Amazon EKS, Prometheus, and deployment history when an alert fires — returning correlated summaries with probable causes and dashboard links directly in Slack.You'll leave understanding practical integration patterns for AI-assisted triage, telemetry hygiene requirements, and guardrails for safely introducing AI into production incident response. Discover how AI augments — rather than replaces — your existing observability stack, meaningfully reducing time-to-insight during incidents.
- WPS203Intermediate
Optimising Outpatient Waitlists with ML at Gold Coast Health
Deploying ML in high-stakes environments demands enterprise readiness, governance, and continuous monitoring. In this session, you'll learn how Gold Coast Health moved from pilot to production with a predictive model identifying patients unlikely to attend procedures — achieving 33% precision, doubling the 15% manual baseline — while ensuring fairness across cohorts. The session covers real-world ML architecture on Amazon SageMaker Pipelines, production monitoring including data quality, pipeline health, and drift detection, plus navigating AI governance through bias analysis and impact assessment. Whether you're in healthcare, financial services, or any regulated industry, walk away with actionable patterns for deploying responsible ML at scale.
- DEV206Intermediate
AI Isnt Just for Developers: Using Kiro CLI & AWS MCP for Cloud Ops
You cant turn your head sideways without seeing a slew of articles, blogs, or videos about AI, and most of them focus on developer tooling and writing code. But AI isnt just for developers. Its an incredibly powerful tool for operations folks, too.In this lightning talk, Ill share how I use Kiro CLI and the Kiro console with AWS Model Context Protocol (MCP) integrations for day-to-day cloud operations. From information gathering and log analysis to reporting and IAM policy interpretation, these tools help reduce cognitive load and speed up your output when working with AWS environments.Ill also discuss how I used Kiros spec-driven development approach to build a Python-based reporting tool, despite not being a software developer.This session is designed to make AI tooling feel approachable and practical for anyone working in AWS — not just developers.
- SMB203Intermediate
From Vision AI to Agentic AI: Real-Time Ops & Compliance in QSR
Fingermark's Eyecue platform turns drive-thru video feeds into real-time operational intelligence for some of the world's largest QSR brands. Using hybrid edge-cloud architecture on AWS, they track every customer journeycapturing precise timing at order points, windows, and bayswhile keeping sensitive data at the edge. Now they're taking the next leap: agentic AI powered by Amazon Bedrock AgentCore. Autonomous agents automatically answer compliance questions"Are there spills Are staff following food handling protocols"replacing manual audits with continuous monitoring. See how a Kiwi company scaled from local innovation to global impact, and from computer vision to autonomous agents.
Live updates related to this topic LIVE
Sourced via Parallel AI Monitor — continuous web watch on 21 topical streams. Updated .
- producthunt.com high confidence Agent dev tools & observability
The best new AI agents in 2026 - Product Hunt
TraceRoot launched an open-source observability platform for AI agents featuring a 'self-healing layer' that captures traces and uses AI to automatically identify bugs and open fix PRs by analyzing source code and GitHub history. It includes an OpenTelemetry-compatible SDK for ca
- traceroot.ai high confidence Agent dev tools & observability
TraceRoot
TraceRoot launched an open-source observability platform for AI agents featuring a 'self-healing layer' that captures traces and uses AI to automatically identify bugs and open fix PRs by analyzing source code and GitHub history. It includes an OpenTelemetry-compatible SDK for ca
- indsurf.com high confidence Agent dev tools & observability
Introducing the Agent Command Center and Devin in ...
TraceRoot launched an open-source observability platform for AI agents featuring a 'self-healing layer' that captures traces and uses AI to automatically identify bugs and open fix PRs by analyzing source code and GitHub history. It includes an OpenTelemetry-compatible SDK for ca
- prnewswire.com high confidence Agent dev tools & observability
Edge Delta Makes All Telemetry Pipelines Data ...
TraceRoot launched an open-source observability platform for AI agents featuring a 'self-healing layer' that captures traces and uses AI to automatically identify bugs and open fix PRs by analyzing source code and GitHub history. It includes an OpenTelemetry-compatible SDK for ca
- grafana.com Agent dev tools & observability
grafana.com
Grafana Labs announced several AI-focused observability and agent tools on April 21, 2026: 1) AI Observability in Grafana Cloud for real-time monitoring of agent inputs, outputs, and execution flows; 2) Expanded Grafana Assistant with a new API, Automations, and Remote MCP server
External links matched to this topic via topic relevance. The KB does not endorse third-party content; verify before citing.
Non-obvious insights
From the PlaybookOne sharp, contrarian insight per session — the things teams don't think of unprompted.