Observability & Monitoring

Logs, metrics, traces — and now AI-powered insights.

18 sessions at the summit5 external resources

Overview

Observability is more than monitoring: it's the ability to ask new questions of your system without shipping new code. AWS's native stack includes Amazon CloudWatch (metrics, logs, dashboards, alarms), AWS X-Ray (distributed tracing), Amazon Managed Service for Prometheus, and Amazon Managed Grafana. CloudWatch Application Signals automatically discovers services and shows golden signals. AI-driven tools like Amazon Q in CloudWatch and partner solutions (Datadog, New Relic, Dynatrace, Splunk) help triage incidents.

Key concepts

  1. Three pillars: logs, metrics, traces — plus events and profiles
  2. OpenTelemetry as the vendor-neutral standard
  3. SLOs, SLIs, error budgets — the SRE language
  4. Anomaly detection and automated root-cause analysis
  5. Cost-aware observability — sampling, retention, log-class tiers

Key AWS services

  • Amazon CloudWatch
  • AWS X-Ray
  • Amazon Managed Service for Prometheus
  • Amazon Managed Grafana
  • AWS Distro for OpenTelemetry

Learn more — curated resources

Hand-picked official docs, foundational papers, and the best community guides for going deeper on this topic.

Sessions on this topic

18 sessions from the Summit covered this topic. Each is a self-contained mini-lesson.

  1. PRT215-SIntermediate

    The Visibility Gap: Turning Observability into DevSecOps Signals

    The Visibility Gap: Turning Observability into DevSecOps Signals (sponsored by Datadog)Security teams and dev teams share the same production environment but operate from different signals. Without runtime context, security monitoring has blind spots with misconfigured infrastructure and threats in flight. This session draws on Fone Dynamics' ISO 27001 journey to show how runtime telemetry, cloud audit logs, and code scanning give DevSecOps and SecOps teams shared context.

  2. AIM201Intermediate

    From demo to deployment: solving agentic AI's toughest challenges

    Most AI agent projects stall when moving from prototype to production. This session tackles the top challenges builders face when deploying agentic AI at scale. You'll learn how to answer the fundamental question of whether to build custom agents or leverage pre-built agents for DevOps, security, development, and business productivity use cases. Then you'll discover how to address the critical production challenges of reliability, observability, cost management, security, and evaluation. Drawing from real customer deployments and AWS's portfolio of agentic AI capabilities, you'll gain actionable approaches for building agents that don't just demo well but ship and scale.

  3. DEV204Intermediate

    AI-Powered EKS Troubleshooting with AWS DevOps Agent

    Managing EKS clusters means correlating logs, metrics, IAM policies, and network configurations under pressure. The AWS DevOps Agent, announced at re:Invent 2025, changes this workflow fundamentally. In this session, you'll watch a live demonstration where the DevOps Agent autonomously investigates an EKS service failuretracing issues from Pod logs through VPC Security Groups without manual intervention. You'll learn how the agent correlates cross-service dependencies, generates verified remediation plans, and integrates into existing SRE workflows.

  4. ISV302Advanced

    Architecting Scalable AI Agents using Amazon Bedrock AgentCore

    Discover how to build powerful AI agents using Amazon Bedrock's suite of tools, with a focus on Amazon Bedrock AgentCore. This session explores how Parrot Analytics leveraged the modular components of Amazon Bedrock AgentCore and Amazon Nova foundational models to achieve 10x the processing speed of manual classification across 2M+ entities. We will dive into prompt and context engineering, knowledge bases, and observability for production agentic workloads.

  5. DEV207Intermediate

    Data Observability Without the Pain - Lessons from a Production System

    Modern IoT platforms are inherently data platforms. Events flow through APIs, queues, AWS Lambda Serverless functions, storage systems, and device networks before becoming meaningful data. When something goes wrong, tracing a single event across these distributed components quickly becomes painfuland the question shifts from _what happened_ to _where do I even start looking Ill walk through three practical observability patterns drawn from building and operating a production, event-driven IoT healthcare platform on AWS that processes tens of thousands of device events daily. Using OpenTelemetry, AWS X-Ray and Honeycomb, well explore techniques for gaining visibility into asynchronous event pipelines, correlating activity across services, and tracing events as they move through distributed systems. Youll leave with three concrete patterns you can apply immediately to your own event-driven data systems.

  6. COP301Advanced

    Elevating your Agentic AI Observability

    Gain deep visibility into the performance and reliability of autonomous agents with Amazon CloudWatch. This session showcases how CloudWatch delivers endtoend observability for agentic AI workloadstracking decision quality, token efficiency, and workflow execution at scale. Explore prebuilt dashboards and advanced metrics that help you optimize agent performance, control operational costs, and maintain consistent behavior across complex intelligent systems. Walk away ready to implement productiongrade observability that ensures your AI agents operate reliably, make optimal decisions, and deliver measurable outcomes at scale.

  7. ISV210Intermediate

    Boost performance and reduce costs with Aurora: Canva's story

    From initial idea to executionlearn about Canva's journey migrating MySQL workloads from Amazon RDS to Aurora at scale. Discover how Canva achieved meaningful performance improvements, cost savings, and operational efficiencies through this strategic migration. This lightning talk shares real-world insights on planning and executing large-scale database migrations, key Aurora best practices for optimizing cost and performance, and how the latest monitoring features help maintain efficiency as you scale. Learn how AWS Countdown Premium (CDP) accelerated and de-risked Canva's migration, delivering tangible business value while minimizing operational disruption.

  8. PRT209-SIntermediate

    How Auto & General leverage observability foundations for AI

    As one of Australia's leading general insurers, Auto & General knew AI would play an important part in their future IT operations. To ensure success, the team embarked on a comprehensive observability maturity journey to build solid foundations, governance, and structure. Learn how A&G worked with New Relic to successfully lay observability foundations for the AI age.

  9. PRT101-SFoundational

    Accelerating Innovation with GitLab DAP Powered by Amazon Bedrock

    Learn how GitLab Duo Agent Platform (DAP) powered by Amazon Bedrock brings agentic AI into every stage of the software development lifecycle while keeping data, logs, and inference traffic inside your AWS environment. We'll show how teams can orchestrate AI-assisted workflows for planning, coding, security, and compliance using Amazon Bedrock foundation models behind GitLab's AI Gateway.

  10. PRT203-SIntermediate

    Design, Deploy, and Govern AI Agents with Boomis Agentstudio 5 Steps to Enterprise-Grade AI Security for Amazon Bedrock Projects

    Transform enterprise automation with Boomi's AI agent ecosystem. Learn to use Agent Designer to visually build agents that connect across systems, and Agent Control Tower for centralised governance, compliance, and performance monitoring. Securely orchestrate your AI lifecycle at scale with Amazon Bedrock.

  11. DEV312Advanced

    Strands Agents on Lambda: Observability With Powertools & X-Ray

    When a Strands Agent fails across five Lambda log streams with no correlation, debugging takes 20 minutes minimum. This session demonstrates a structured observability layer that reduces diagnosis to under two minutes. You'll learn how Lambda Powertools Tracer wraps Strands tool invocations as X-Ray subsegments, how Powertools Logger injects AgentCore session correlation IDs across invocations, and how Powertools Metrics surfaces tool retry frequency as CloudWatch alarms — before timeouts occur. The session covers three production failure classes — tool timeout, reasoning loop, and retry storm — and delivers a reusable CDK construct providing full instrumentation for any Strands Agent Lambda deployment.

  12. DAT402Expert

    Deep dive into database integrations with AWS Zero-ETL

    Learn how AWS zero-ETL integrations eliminate complex data movement pipelines across multiple database engines, enabling data engineers, architects, and DBAs to reduce maintenance overhead while ensuring near real-time data availability for analytics and ML workloads. Examine the underlying architecture for supported zero-ETL integrations between Amazon Aurora, Amazon DynamoDB, and Amazon RDS sources to Amazon Redshift, Amazon SageMaker, and Amazon OpenSearch Service targets. Explore data movement options, tunable settings, and monitoring capabilities for ongoing data replicationall without traditional ETL complexity.

  13. DEV305Advanced

    Agents in the enterprise: Best practices with Amazon Bedrock AgentCore

    As organizations scale AI agent development, robust enterprise architecture patterns become essential. In this advanced session, we'll explore how Amazon Bedrock AgentCore enables teams to build modular systems using their preferred frameworks while sharing tools through MCP gateways. Learn about A2A collaboration, shared memory, identity-based access controls, and integrated observability. Discover practical strategies for secure runtime deployment, standardized tool integration, evaluation frameworks, and end-to-end monitoring. Leave with actionable insights to build secure, scalable agent infrastructures that balance centralized governance with team autonomy.

  14. SEC302Advanced

    Leap ahead in Cloud Operations with AWS DevOps Agent

    Downtime costs revenue. Alert fatigue burns out your best engineers. Manual incident investigation wastes hours that could be spent building. Every cloud team faces these operational challenges, yet most still rely on tribal knowledge and context-switching across multiple tools to diagnose issues. In this session, we demonstrate how AWS DevOps Agent transforms incident response from hours of manual investigation to minutes of autonomous analysis. Watch as the agent automatically correlates data across your observability tools, identifies root causes, and delivers actionable mitigation plans freeing your team to build instead of firefight.

  15. DEV210Intermediate

    AI-Driven Incident Triage: From Slack Alert to Root Cause

    Modern AWS environments generate more alerts than teams can realistically investigate. This session demonstrates a proof-of-concept that transforms Slack alerts into automated investigation workflows using AI.Learn how to trigger parallel queries across CloudWatch, Amazon EKS, Prometheus, and deployment history when an alert fires — returning correlated summaries with probable causes and dashboard links directly in Slack.You'll leave understanding practical integration patterns for AI-assisted triage, telemetry hygiene requirements, and guardrails for safely introducing AI into production incident response. Discover how AI augments — rather than replaces — your existing observability stack, meaningfully reducing time-to-insight during incidents.

  16. WPS203Intermediate

    Optimising Outpatient Waitlists with ML at Gold Coast Health

    Deploying ML in high-stakes environments demands enterprise readiness, governance, and continuous monitoring. In this session, you'll learn how Gold Coast Health moved from pilot to production with a predictive model identifying patients unlikely to attend procedures — achieving 33% precision, doubling the 15% manual baseline — while ensuring fairness across cohorts. The session covers real-world ML architecture on Amazon SageMaker Pipelines, production monitoring including data quality, pipeline health, and drift detection, plus navigating AI governance through bias analysis and impact assessment. Whether you're in healthcare, financial services, or any regulated industry, walk away with actionable patterns for deploying responsible ML at scale.

  17. DEV206Intermediate

    AI Isnt Just for Developers: Using Kiro CLI & AWS MCP for Cloud Ops

    You cant turn your head sideways without seeing a slew of articles, blogs, or videos about AI, and most of them focus on developer tooling and writing code. But AI isnt just for developers. Its an incredibly powerful tool for operations folks, too.In this lightning talk, Ill share how I use Kiro CLI and the Kiro console with AWS Model Context Protocol (MCP) integrations for day-to-day cloud operations. From information gathering and log analysis to reporting and IAM policy interpretation, these tools help reduce cognitive load and speed up your output when working with AWS environments.Ill also discuss how I used Kiros spec-driven development approach to build a Python-based reporting tool, despite not being a software developer.This session is designed to make AI tooling feel approachable and practical for anyone working in AWS — not just developers.

  18. SMB203Intermediate

    From Vision AI to Agentic AI: Real-Time Ops & Compliance in QSR

    Fingermark's Eyecue platform turns drive-thru video feeds into real-time operational intelligence for some of the world's largest QSR brands. Using hybrid edge-cloud architecture on AWS, they track every customer journeycapturing precise timing at order points, windows, and bayswhile keeping sensitive data at the edge. Now they're taking the next leap: agentic AI powered by Amazon Bedrock AgentCore. Autonomous agents automatically answer compliance questions"Are there spills Are staff following food handling protocols"replacing manual audits with continuous monitoring. See how a Kiwi company scaled from local innovation to global impact, and from computer vision to autonomous agents.

Live updates related to this topic LIVE

Sourced via Parallel AI Monitor — continuous web watch on 21 topical streams. Updated .

External links matched to this topic via topic relevance. The KB does not endorse third-party content; verify before citing.

Non-obvious insights

From the Playbook

One sharp, contrarian insight per session — the things teams don't think of unprompted.

The fastest path to ISO 27001 evidence isn't more controls — it's tagging existing logs to existing control IDs. Most enterprises already have 70%+ of evidence; they just can't find it on demand. ---PRT215-S — The Visibility Gap: Turning Observability into DevSe…
The single highest-leverage practice in agent ops is the offline eval suite. It's tedious to build but it unlocks everything downstream — model upgrades, prompt iteration, regression testing, vendor swaps. Teams that skip evals end up trapped on a single model and prompt forever. ---AIM201 — From demo to deployment: solving agentic AI's toughe…
Agents are best at the boring 80% of incidents. The hard 20% they'll fumble — that's where humans still win. So measure success on *time-to-page-the-human*, not on full autoresolution. The agent's job is to short-circuit the easy stuff and hand off cleanly when it's stuck. ---DEV204 — AI-Powered EKS Troubleshooting with AWS DevOps Agent
Modular AgentCore decomposition lets you swap models per stage. Use a cheap model for triage ("is this even worth processing?"), a mid-tier for the bulk, and an expensive model only for ambiguous cases that fail confidence checks. Don't run uniform inference. The cost difference is 10×. ---ISV302 — Architecting Scalable AI Agents using Amazon Bedrock…
The biggest observability win is not tools — it's a *correlation ID standard* the team enforces. Pick one (the X-Ray trace ID is fine), enforce it everywhere, and stop debating. Tooling matters far less than you think once the IDs are consistent. ---DEV207 — Data Observability Without the Pain - Lessons from a…
The metric most agentic systems should track and don't is *loop count* — how many tool calls per completed task. It's the canary for prompt regression, model drift, and broken tools. When loop count starts trending up week-over-week, something is wrong even if all your other metrics look fine. ---COP301 — Elevating your Agentic AI Observability