Overview
Resilience is the ability of a workload to recover from failures. AWS offers AWS Resilience Hub (continuous resilience assessment), AWS Backup (centralized backup), AWS Elastic Disaster Recovery (DRS — sub-minute RPO), and AWS Fault Injection Service (chaos engineering). The four common DR strategies are Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active — trading cost for RTO/RPO. Multi-region active-active with Aurora DSQL or DynamoDB global tables is now feasible for many applications.
Key concepts
- RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
- Four DR strategies and the cost/RTO trade-off
- Chaos engineering and game days
- Multi-AZ vs. multi-region — what each protects against
- Cell-based architectures and bulkheads
Key AWS services
- AWS Resilience Hub
- AWS Backup
- AWS Elastic Disaster Recovery
- AWS Fault Injection Service
Learn more — curated resources
Hand-picked official docs, foundational papers, and the best community guides for going deeper on this topic.
Sessions on this topic
11 sessions from the Summit covered this topic. Each is a self-contained mini-lesson.
- PRT104-SFoundational
Building Resilience for AI Data Foundations and Cloud-Native Apps 5 Steps to Enterprise-Grade AI Security for Amazon Bedrock Projects
AI innovation depends on consistent, trusted data. When disrupted, AI systems and the business decisions they support are at risk. In this session, learn how cloudnative protection models support AI pipelines, reduce recovery time after disruptions, and minimise operational overhead. Discover best practices to protect AI and cloudnative applications in AWS while innovating with confidence.
- PRT207-SIntermediate
Charting the CX Frontier: A Cohesive, AI-Enabled Engagement Platform
Geopolitical instability, rising CX demands, rapid tech shifts, and escalating cyber threats converge faster than manual processes can handle. Join our expert panel as they leverage AWS and AI to build customer solutions, elevate engagement, and neutralise cyber threats. We'll share real deployments, proven governance, and measurable gains in efficiency, resilience, and customer impact.
- AIM301Advanced
Commbank pioneering AI-driven DevSecOps with AWS DevOps Agent
CBA is achieving operational excellence by harnessing the power of the AWS DevOps Agent, part of AWS's new Frontier Agents. In this session, discover how CBA is using AI-driven automation to streamline incident response, reduce operational friction, and strengthen resilience across critical systems. We'll discuss CBA's cloud transformation journey and operational challenges, explore the DevOps Agent implementation including architecture, integration, and user journeys, and share results and business impact with real-world metrics. You'll see how automated remediation, and proactive insights are helping teams move faster with greater confidence. Join us to discover how CBA is shaping a future where operations are smarter, safer, and built for scale.
- ARC201Intermediate
Building on AWS resilience: Innovations for critical success
Essential services that power global economies and critical infrastructure demand exceptional resilience. Through nearly two decades of focused innovation, AWS has developed core engineering practices and operational approaches that power critical workloads worldwide. Explore how AWS's architectural innovations and organizational practices help customers build robust services that maintain resilience during severe disruptions. Learn how AWS's continued investment in resilience provides the foundation for delivering essential services across governments, economies, and critical infrastructure.
- ARC307Advanced
AI Powered Resilience Lifecycle
Not all disaster recovery strategies can address the complex, dynamic nature of modern cloud infrastructures, leading to gaps in system resilience and compliance adherence. Discover how to enhance resilience and disaster recovery on AWS empowered by AI. This approach bridges infrastructure insights and application-level testing, enabling more effective disaster recovery preparation. You will learn how to leverage Large Language Models (LLMs) with AWS Resilience Hub and AWS Systems Manager to modernize testing, analyze infrastructure, and generate targeted AWS Fault Injection Service experiments and recovery runbooks. Walk away with practical examples of automated test generation with templates and learn to design prompts.
- ARC402Expert
DynamoDB: Resilience & lessons from the Oct 2025 service disruption
In this session, we will walk through the architecture for the Amazon DynamoDB DNS management system that triggered the service disruption on October 20, 2025. We will share the lessons that the DynamoDB team learned from this event and explain how we are using these insights to improve both DynamoDB and AWS. You will walk away with actionable knowledge that you can apply to the systems you build.
- PRT111-SFoundational
From Risk to Resilience - How Mimecast Works with AWS
Human risk is a critical layer of any security strategy. Human risk management addresses how employee behaviorfrom accidental sharing to shadow AI usecreates organizational exposure. Discover how Mimecast, on AWS, helps identify risky behavior, protect critical data and account access, and support compliance. Real-world insights. Behavioural analytics. Adaptive controls. Measurable ROI.
- FSI201Intermediate
BELIEVE: The Impossible Migration That Transformed Australian Banking
Commonwealth Bank migrated the world's largest SAP core banking deployment to AWS in 18 months: the system behind 40% of Australia's payments, 15 million customers, running 247. This isn't a lift-and-shift story, it's a reinvention of how the bank runs critical systems - from architecture and resilience engineering to replacing siloed operational teams with full-stack automation, and the cultural shift this required. Join us to hear how CBA's critical financial infrastructure was modernised with AWS, and what this unlocks for their AI-enabled future. If you're building foundations for regulated, mission-critical workloads, this is the session you don't want to miss.
- ISV202Intermediate
Architecting for growth and resilience: Cell based design deep dive
As business demands evolve, architectural patterns must evolve too. SafetyCulture and Buildkite implemented cell-based architectures driven by distinct business objectivesscaling for hypergrowth and enhancing infrastructure resilience. SafetyCulture's expansion plans required proactive architectural evolution to unlock unlimited scaling capacity. Buildkite needed to meet stringent security isolation requirements while achieving scale through repeatable deployment units. This session shares real-world experiences as both companies designed and implemented cell-based architectures for their SaaS platforms. Discover how SafetyCulture identified bottlenecks, redesigned systems for isolation and resilience, and aligned technical capabilities with business growth targets. Learn how Buildkite leveraged cell-based design to achieve both scale and security isolation. Walk away with actionable patterns for building resilient, scalable architectures.
- WPS204Intermediate
Safe Transport Victoria's Migration to AWS Cloud
Join us for an in-depth case study on Safe Transport Victoria's successful use of Cloud to modernise, streamline and save costs while moving from on-premises infrastructure to AWS Cloud. This session will demonstrate how a small Regulator has added resilience to their safety outcomes, including to those Victorians with accessibility and mobility needs achieved modernization while maintaining service continuity and reliability.
- WPS302Advanced
Secure and Resilient Agentic AI for High-Assurance Environments
Autonomous AI systems that plan, decide, and act across workflows are transforming how organisations deliver mission-critical services. This session shares security-first architecture best practices for designing, deploying, and governing agentic AI in high-assurance environments, drawing from Australia's Information Security Manual (ISM) and AWS security frameworks. Discover practical patterns for architecting proactive, intelligent services while maintaining security, transparency, and operational resilience through defense-in-depth strategies and purpose-built AWS capabilities.
Live updates related to this topic LIVE
Sourced via Parallel AI Monitor — continuous web watch on 21 topical streams. Updated .
- cloud.google.com high confidence Scaling infra for agent workloads
What’s new in compute at Next ‘26 | Google Cloud Blog
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
- gruve.ai high confidence Scaling infra for agent workloads
FAQs
AgentBudget was identified as an open-source Python SDK that provides real-time cost enforcement for AI agents, allowing developers to set a hard dollar limit on any single AI agent session to prevent runaway expenses.
- fast.io Scaling infra for agent workloads
AI Agent Rate Limiting Strategies & Best Practices
Arcjet introduced 'Guards,' a runtime security service for AI agent workflows that enables enforcement of per-user token budgets and spend limits inside agent loops and can detect prompt injection in tool results.
- cencori.com high confidence Scaling infra for agent workloads
Circuit Breakers for AI Agents: How We Stop Cascading ...
Waxell published a detailed framework on AI Agent Circuit Breakers, proposing automated circuit breakers implemented at the governance plane (outside agent code) to prevent runaway loops, monitor cost velocity, handle consecutive failures, and stop scope violations.
- insights.reinventing.ai high confidence Scaling infra for agent workloads
Multi-Agent Orchestration Patterns Drive Enterprise ROI in 2026
Waxell published a detailed framework on AI Agent Circuit Breakers, proposing automated circuit breakers implemented at the governance plane (outside agent code) to prevent runaway loops, monitor cost velocity, handle consecutive failures, and stop scope violations.
External links matched to this topic via topic relevance. The KB does not endorse third-party content; verify before citing.
Non-obvious insights
From the PlaybookOne sharp, contrarian insight per session — the things teams don't think of unprompted.