Overview
AWS gives you two managed container orchestrators: Amazon EKS (managed Kubernetes, conformant with upstream) and Amazon ECS (AWS-native, simpler). Both run on either EC2 (you manage the nodes) or AWS Fargate (serverless containers — AWS manages the nodes). EKS Auto Mode further abstracts node lifecycle. For images, use Amazon ECR; for service mesh, Amazon VPC Lattice or App Mesh; for scaling, Karpenter (open source, AWS-built).
Key concepts
- Kubernetes core objects: pods, deployments, services, ingress
- EKS Auto Mode and managed node groups vs. self-managed
- Fargate vs. EC2 — cost, isolation, and operational trade-offs
- Karpenter — fast, just-in-time node provisioning
- Service mesh and zero-trust networking with VPC Lattice
Key AWS services
- Amazon EKS
- Amazon ECS
- AWS Fargate
- Amazon ECR
- Karpenter
Learn more — curated resources
Hand-picked official docs, foundational papers, and the best community guides for going deeper on this topic.
Sessions on this topic
19 sessions from the Summit covered this topic. Each is a self-contained mini-lesson.
- DEV204Intermediate
AI-Powered EKS Troubleshooting with AWS DevOps Agent
Managing EKS clusters means correlating logs, metrics, IAM policies, and network configurations under pressure. The AWS DevOps Agent, announced at re:Invent 2025, changes this workflow fundamentally. In this session, you'll watch a live demonstration where the DevOps Agent autonomously investigates an EKS service failuretracing issues from Pod logs through VPC Security Groups without manual intervention. You'll learn how the agent correlates cross-service dependencies, generates verified remediation plans, and integrates into existing SRE workflows.
- STP210Intermediate
TeamForm's Generative Dashboards with Strands & Bedrock AgentCore
Most teams are still piloting AI - TeamForm is shipping it. In this session, we show how we built enterprise and production-ready generative dashboards in weeks on AWS Bedrock and AgentCore, and how an AI-native operating model made that velocity possible. Learn what it actually takes to operationalise AI across product and engineering, not just prototype it.
- PRT217-SIntermediate
Your Agents Should Be Durable
Your Agents Should Be Durable (sponsored by Temporal)Building AI agents is easy — making them production-ready is hard. Crashes, API failures, and state management are just a few challenges when moving from PoC to production. Learn how durable execution with Temporal makes it simple to build reliable agents that run for days, weeks, or months, using a code-first approach developers love.
- DEV209Intermediate
CI/CD Guardrails for Agentic Coding Workflows
AI coding agents introduce failure modes traditional CI/CD pipelines weren't built to catch — deleted tests, weakened type constraints, silent cross-service regressions. This session examines practical pipeline-level guardrails for agentic workflows running on ECS Fargate and distributed CI environments. You'll learn which failure patterns agents introduce that humans rarely do, which automated checks reliably catch them, and how to structure pipelines that apply appropriate scrutiny to agent-generated code without blocking developer velocity. Leave with concrete, implementable patterns covering test integrity enforcement, type safety validation, and cross-service regression detection — applicable whether you're managing one agent or coordinating many across multiple repositories.
- ARC303Advanced
Unlock GenAI inference anywhere with Amazon EKS Hybrid Nodes
Join this session to explore how Amazon EKS Hybrid Nodes enables GenAI inference anywhere. We'll discuss reference architectures for adding on-prem GPUs to your EKS hybrid cluster, and for running real-time data capture and processing at the edge. You'll learn how EKS Hybrid Nodes enables seamless integration between the cloud and your on-prem or edge environments. Well also walk through a real-world example, showcasing how to accelerate GenAI inference at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX platform.
- DEV210Intermediate
AI-Driven Incident Triage: From Slack Alert to Root Cause
Modern AWS environments generate more alerts than teams can realistically investigate. This session demonstrates a proof-of-concept that transforms Slack alerts into automated investigation workflows using AI.Learn how to trigger parallel queries across CloudWatch, Amazon EKS, Prometheus, and deployment history when an alert fires — returning correlated summaries with probable causes and dashboard links directly in Slack.You'll leave understanding practical integration patterns for AI-assisted triage, telemetry hygiene requirements, and guardrails for safely introducing AI into production incident response. Discover how AI augments — rather than replaces — your existing observability stack, meaningfully reducing time-to-insight during incidents.
- STP203Intermediate
Build, Evaluate and Scale Production ready Agents with AWS Containers
Building an agent that works once is easy; building an agent that works reliably for thousands of users is an architectural challenge. This session bridges the gap between experimental notebooks and deployed systems, focusing on the specific engineering disciplines needed for success. Join us to learn practical strategies for: 1. System Design: architecting decoupled, scalable agent backends from day one. 2. Continuous Evaluation: moving beyond "vibes-based" testing to metrics-driven evaluation suites that ensure reliability. 3. DevEx & Tooling: streamlining the developer experience to tighten feedback loops and ship improvements faster using open-source frameworks.
- ISV201Intermediate
MCP on EKS: Xero's AI-Driven Developer Experience
AI coding agents are transforming how developers build and operate modern cloud-native applications. With tools such as Kiro CLI, Kiro IDE, or any MCP-compatible AI coding assistant, developers are embracing AI to move faster and scale smarter. This session explores how MCP servers help developers streamline code generation, deployment, and debugging by embedding infrastructure awareness directly into the AI assistant. Learn how Xero is leveraging MCP to speed up development, simplify operations, and deliver more reliable containerized apps at scale. Xero will also share their success story using Kiro CLI, Prometheus MCP, EKS MCP, and AWS Knowledge Base MCP to identify and resolve Prometheus cost spikesslashing costs by 40%.
- DEV307Advanced
Active-Active Global Architecture with CloudFront and Route 53
In this lightning talk, we'll walk through a real-world architectural pattern used in production: combining AWS CloudFront with Route 53 latency-based routing to make your ECS-backed services truly global. Starting with the problem of slow response times for APAC users, we'll build up a practical active-active architecture step by step. You'll see how CloudFront sits in front of your regional ALBs, how WAF is woven into the design from the start rather than bolted on later, and why getting your domain configuration right — distinguishing between your ALB origin domain and your public-facing CloudFront alternate domain — is critical to making this pattern work correctly.
- IND202Intermediate
How Zuru Uses AI to Analyze TikTok Trends for Rapid Content Creation
For modern consumer brands, winning means moving at the speed of culture. Zuru uses Amazon Bedrock and Twelve Labs to analyze up to 10,000 TikTok videos a day, rapidly identifying viral trends, emotional cues, and content patterns to create creator briefs in hours instead of weeks. Join this session to see how AWS gives Zuru a measurable edge, from 30 million organic views in seven days to 50x faster content creation, with industry peers now looking to replicate its speed-to-market advantage.
- DEV203Intermediate
Decisions Over Diagrams: How Bell Financial Group Architects on AWS
Architecture diagrams show what you built. They don't explain why. At Bell Financial Group, every major technology choice — from landing zone design to compute platform to database engine — is captured in an Architecture Decision Document that forces honest evaluation of trade-offs. In this talk, the Head of Engineering at Bell Financial Group walks through the real decisions behind their AWS platform: why ECS Fargate beat EKS, when DynamoDB wins over relational databases, why the entire infrastructure is written in TypeScript CDK, and the deliberate constraints they place on Lambda usage. No slides full of boxes and arrows — just the reasoning, the trade-offs, and the lessons learned building a regulated financial services platform on AWS.
- ISV102Foundational
From documents to voice - building AI products on AWS
How Affinda leverages Amazon Bedrock (Claude), SageMaker, EKS & CloudFormation to deliver intelligent document processing at enterprise scale, cutting setup time and costs by 90% with 95%+ accuracy. This session will demonstrate how Affinda powers real-world AI product development from Affinda's Intelligfent Document Processing platform to Pathfindr's (acquired by Affinda) custom AI agents. The session will showcase the complete journey of building Honey Insurance's voice agent - Australia's first voice agent in financial services, and how the Affinda-AWS partnership enables rapid AI product development for Enterprises.
- DEV310Advanced
Zero-Downtime Migration from Sydney to Auckland (ap-southeast-6)
With AWS ap-southeast-6 (Auckland) now open, New Zealand organizations can repatriate workloads from Sydney. This advanced session provides practical migration strategies minimizing downtime and eliminating data loss across every layer of your stack. You'll learn region-to-region migration patterns for: *Storage*: S3 replication, EBS snapshots, EFS cross-region transfers *Databases*: RDS read replicas, DynamoDB global tables, self-managed EC2 database replication *Applications*: Lambda, ECS/EKS workload migration, EC2 AMI copying Walk away with a prioritized migration playbook, realistic RTO/RPO targets, and battle-tested sequencing strategies for large-scale data transfers without extended application outages.
- INO103Foundational
Adopting AI-DLC at Scale: How SEEK Is Transforming Product Delivery
Most organisations use AI to help developers code faster, but few have figured out what needs to change when building is no longer the bottleneck. This session introduces AI-DLC, the next evolution in how teams deliver software: a methodology that compresses specification timelines from months to weeks, and fundamentally changes how product teams operate. SEEK's Principal Product Manager shares how AI-DLC reshaped their people, process, and technology, and how they're now scaling across multiple product teams. You'll hear what's working, what's hard, what they're still figuring out and what it means for how your organisation delivers.
- ISV213Intermediate
From GRC Platform to AI-Native Risk Intelligence on AWS:Protecht Story
Protecht, a global leader in enterprise risk management software, partnered with AWS and Caylent to build Cognita AI, an embedded AI assistant purpose-built for governance, risk, and compliance (GRC). Backed by a $280M PSG investment, Protecht built Cognita on a production-grade Amazon EKS foundation, integrating Amazon Bedrock and Anthropic's Claude models with a RAG architecture grounded in Protecht's proprietary GRC content. The result is a contextual, explainable, and auditable AI assistant that guides risk professionals through complex workflows, earning high accolades at the Gartner Enterprise Risk, Audit & Compliance Conference and setting a new benchmark for investor-grade, regulator-trusted AI in months.
- FSI202Intermediate
Accelerating Payment Innovation: Spec-Driven Development with AWS Kiro
Australian Payments Plusoperator of Australia's critical payment infrastructure including eftpos, BPAY, and NPP, processing millions of daily transactionstransformed their development practices by adopting Spec-Driven Development using AWS Kiro. AP+ manages the payment rails connecting banks, merchants, and consumers throughout Australia. Through intensive Event-Driven Architecture bootcamps and hands-on training, engineering teams now independently run development workshops every two weeks, accelerating delivery of payment platform innovations while maintaining the highest security and compliance standards required for national financial infrastructure. Learn the practical framework for building development velocity in regulated environments.
- ISV207Intermediate
How Canva Scales and Optimizes AI Workloads with Karpenter
his session explores how Canva leverages Karpenter to scale and optimize diverse workloads on Amazon EKS. Learn how Canva manages AI workloads using On-Demand Capacity Reservations (ODCRs) and EC2 Capacity Blocks for ML, while maximizing resource utilization by intelligently co-locating CPU and GPU workloads on GPU nodes. We will dive into NodePool management strategies for efficient scheduling of AI workloads and examine how Canva uses a range of Amazon EC2 instance types to operate a multi-tenant container orchestration platform for all workloads, optimizing for cost-effectiveness and resource efficiency. Ideal for platform engineers and Kubernetes operators looking to optimize their EKS clusters for both AI and general workloads at scale.
- MAE204Intermediate
How Amazon Ads Creative Agent uses AWS to democratize ad creation
Media advertisers see up to 25% higher engagement when delivering custom creative to relevant audiences, yet producing quality video ads traditionally requires weeks of expensive and specialized expertise. Discover the inner workings of Amazon Ads new AI Creative Agent, and how it's transforming the creative process by automating and enhancing the generation of multi-format ads to businesses regardless of their size or creative expertise. Explore how Amazon Bedrock, custom-built ML models, GPUs, and model evaluations are used to orchestrate and generate compelling ad creatives into full video productions with professional voiceovers from conversational natural language, while reducing creative development time.
- FSI203Intermediate
How HBF Transformed Claims Processing From Two Weeks to Two Minutes
In this session discover how HBF revolutionized claims processing using AWS. By leveraging Amazon Bedrock and Amazon Textract, they cut claim costs from $2 to just 10 cents and reduced the processing time from two weeks to two minutes. With accuracy in the high 90s and 70,000 claims processed monthly, their end-to-end AI-powered architecture for claims processing sets a new benchmark for speed, cost, and customer satisfaction.
Live updates related to this topic LIVE
Sourced via Parallel AI Monitor — continuous web watch on 21 topical streams. Updated .
- gruve.ai high confidence Scaling infra for agent workloads
FAQs
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
- cloud.google.com high confidence Scaling infra for agent workloads
What’s new in compute at Next ‘26 | Google Cloud Blog
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
- forbes.com high confidence Scaling infra for agent workloads
AWS Cuts AI Agent Setup To 3 API Calls In AgentCore Update
Waxell published a detailed framework on AI Agent Circuit Breakers, proposing automated circuit breakers implemented at the governance plane (outside agent code) to prevent runaway loops, monitor cost velocity, handle consecutive failures, and stop scope violations.
- agentbudget.dev high confidence Scaling infra for agent workloads
AgentBudget - Real-time cost enforcement for AI agents
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
- pitchbook.com high confidence Scaling infra for agent workloads
Empathic 2026 Company Profile
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
External links matched to this topic via topic relevance. The KB does not endorse third-party content; verify before citing.
Non-obvious insights
From the PlaybookOne sharp, contrarian insight per session — the things teams don't think of unprompted.