Resilience & Disaster Recovery

Design for failure — because everything fails, all the time.

11 sessions at the summit4 external resources

Overview

Resilience is the ability of a workload to recover from failures. AWS offers AWS Resilience Hub (continuous resilience assessment), AWS Backup (centralized backup), AWS Elastic Disaster Recovery (DRS — sub-minute RPO), and AWS Fault Injection Service (chaos engineering). The four common DR strategies are Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active/Active — trading cost for RTO/RPO. Multi-region active-active with Aurora DSQL or DynamoDB global tables is now feasible for many applications.

Key concepts

  1. RTO (Recovery Time Objective) and RPO (Recovery Point Objective)
  2. Four DR strategies and the cost/RTO trade-off
  3. Chaos engineering and game days
  4. Multi-AZ vs. multi-region — what each protects against
  5. Cell-based architectures and bulkheads

Key AWS services

  • AWS Resilience Hub
  • AWS Backup
  • AWS Elastic Disaster Recovery
  • AWS Fault Injection Service

Learn more — curated resources

Hand-picked official docs, foundational papers, and the best community guides for going deeper on this topic.

Sessions on this topic

11 sessions from the Summit covered this topic. Each is a self-contained mini-lesson.

  1. PRT104-SFoundational

    Building Resilience for AI Data Foundations and Cloud-Native Apps 5 Steps to Enterprise-Grade AI Security for Amazon Bedrock Projects

    AI innovation depends on consistent, trusted data. When disrupted, AI systems and the business decisions they support are at risk. In this session, learn how cloudnative protection models support AI pipelines, reduce recovery time after disruptions, and minimise operational overhead. Discover best practices to protect AI and cloudnative applications in AWS while innovating with confidence.

  2. PRT207-SIntermediate

    Charting the CX Frontier: A Cohesive, AI-Enabled Engagement Platform

    Geopolitical instability, rising CX demands, rapid tech shifts, and escalating cyber threats converge faster than manual processes can handle. Join our expert panel as they leverage AWS and AI to build customer solutions, elevate engagement, and neutralise cyber threats. We'll share real deployments, proven governance, and measurable gains in efficiency, resilience, and customer impact.

  3. AIM301Advanced

    Commbank pioneering AI-driven DevSecOps with AWS DevOps Agent

    CBA is achieving operational excellence by harnessing the power of the AWS DevOps Agent, part of AWS's new Frontier Agents. In this session, discover how CBA is using AI-driven automation to streamline incident response, reduce operational friction, and strengthen resilience across critical systems. We'll discuss CBA's cloud transformation journey and operational challenges, explore the DevOps Agent implementation including architecture, integration, and user journeys, and share results and business impact with real-world metrics. You'll see how automated remediation, and proactive insights are helping teams move faster with greater confidence. Join us to discover how CBA is shaping a future where operations are smarter, safer, and built for scale.

  4. ARC201Intermediate

    Building on AWS resilience: Innovations for critical success

    Essential services that power global economies and critical infrastructure demand exceptional resilience. Through nearly two decades of focused innovation, AWS has developed core engineering practices and operational approaches that power critical workloads worldwide. Explore how AWS's architectural innovations and organizational practices help customers build robust services that maintain resilience during severe disruptions. Learn how AWS's continued investment in resilience provides the foundation for delivering essential services across governments, economies, and critical infrastructure.

  5. ARC307Advanced

    AI Powered Resilience Lifecycle

    Not all disaster recovery strategies can address the complex, dynamic nature of modern cloud infrastructures, leading to gaps in system resilience and compliance adherence. Discover how to enhance resilience and disaster recovery on AWS empowered by AI. This approach bridges infrastructure insights and application-level testing, enabling more effective disaster recovery preparation. You will learn how to leverage Large Language Models (LLMs) with AWS Resilience Hub and AWS Systems Manager to modernize testing, analyze infrastructure, and generate targeted AWS Fault Injection Service experiments and recovery runbooks. Walk away with practical examples of automated test generation with templates and learn to design prompts.

  6. ARC402Expert

    DynamoDB: Resilience & lessons from the Oct 2025 service disruption

    In this session, we will walk through the architecture for the Amazon DynamoDB DNS management system that triggered the service disruption on October 20, 2025. We will share the lessons that the DynamoDB team learned from this event and explain how we are using these insights to improve both DynamoDB and AWS. You will walk away with actionable knowledge that you can apply to the systems you build.

  7. PRT111-SFoundational

    From Risk to Resilience - How Mimecast Works with AWS

    Human risk is a critical layer of any security strategy. Human risk management addresses how employee behaviorfrom accidental sharing to shadow AI usecreates organizational exposure. Discover how Mimecast, on AWS, helps identify risky behavior, protect critical data and account access, and support compliance. Real-world insights. Behavioural analytics. Adaptive controls. Measurable ROI.

  8. FSI201Intermediate

    BELIEVE: The Impossible Migration That Transformed Australian Banking

    Commonwealth Bank migrated the world's largest SAP core banking deployment to AWS in 18 months: the system behind 40% of Australia's payments, 15 million customers, running 247. This isn't a lift-and-shift story, it's a reinvention of how the bank runs critical systems - from architecture and resilience engineering to replacing siloed operational teams with full-stack automation, and the cultural shift this required. Join us to hear how CBA's critical financial infrastructure was modernised with AWS, and what this unlocks for their AI-enabled future. If you're building foundations for regulated, mission-critical workloads, this is the session you don't want to miss.

  9. ISV202Intermediate

    Architecting for growth and resilience: Cell based design deep dive

    As business demands evolve, architectural patterns must evolve too. SafetyCulture and Buildkite implemented cell-based architectures driven by distinct business objectivesscaling for hypergrowth and enhancing infrastructure resilience. SafetyCulture's expansion plans required proactive architectural evolution to unlock unlimited scaling capacity. Buildkite needed to meet stringent security isolation requirements while achieving scale through repeatable deployment units. This session shares real-world experiences as both companies designed and implemented cell-based architectures for their SaaS platforms. Discover how SafetyCulture identified bottlenecks, redesigned systems for isolation and resilience, and aligned technical capabilities with business growth targets. Learn how Buildkite leveraged cell-based design to achieve both scale and security isolation. Walk away with actionable patterns for building resilient, scalable architectures.

  10. WPS204Intermediate

    Safe Transport Victoria's Migration to AWS Cloud

    Join us for an in-depth case study on Safe Transport Victoria's successful use of Cloud to modernise, streamline and save costs while moving from on-premises infrastructure to AWS Cloud. This session will demonstrate how a small Regulator has added resilience to their safety outcomes, including to those Victorians with accessibility and mobility needs achieved modernization while maintaining service continuity and reliability.

  11. WPS302Advanced

    Secure and Resilient Agentic AI for High-Assurance Environments

    Autonomous AI systems that plan, decide, and act across workflows are transforming how organisations deliver mission-critical services. This session shares security-first architecture best practices for designing, deploying, and governing agentic AI in high-assurance environments, drawing from Australia's Information Security Manual (ISM) and AWS security frameworks. Discover practical patterns for architecting proactive, intelligent services while maintaining security, transparency, and operational resilience through defense-in-depth strategies and purpose-built AWS capabilities.

Live updates related to this topic LIVE

Sourced via Parallel AI Monitor — continuous web watch on 21 topical streams. Updated .

External links matched to this topic via topic relevance. The KB does not endorse third-party content; verify before citing.

Non-obvious insights

From the Playbook

One sharp, contrarian insight per session — the things teams don't think of unprompted.

Most teams over-protect model weights (cheap to retrain) and under-protect feature stores (expensive to rebuild from raw). Your backup budget is probably allocated wrongly. ---PRT104-S — Building Resilience for AI Data Foundations and Clou…
The CX value of AI is rarely full automation — it's making humans *look smarter* to customers. Warm context, faster recall of customer history, faster handoffs. The "AI-augmented agent" outperforms both pure-AI and pure-human in most studies. Optimise for human-AI teaming, not displacement. ---PRT207-S — Charting the CX Frontier: A Cohesive, AI-Enabled Eng…
"Shadow AI use" — employees feeding company data into public LLMs — is the new shadow IT. It's already everywhere; your DLP probably can't see most of it. Plan for it as a category, not a one-off incident. ---PRT111-S — From Risk to Resilience - How Mimecast Works with AWS
18 months for a SAP core banking migration sounds impossible because it usually *is* impossible — except CBA also redesigned their team structure simultaneously. The migration was an *outcome* of the org change, not the cause of it. Most failed migrations try to lift-and-shift the org alongside the workloads. That doesn't work. ---FSI201 — BELIEVE: The Impossible Migration That Transformed A…
Cells aren't just an architecture pattern — they're an *organisational* pattern. The team structure must mirror the cells, or operational complexity explodes. Most cell migrations fail because the org didn't move with the architecture. Conway's Law applies in reverse here too. ---ISV202 — Architecting for growth and resilience: Cell based d…
Cloud platforms now provide accessibility primitives (live captioning, screen-reader-friendly APIs, language translation) that on-prem rarely had. The migration unlocks accessibility, not just scale. Frame the business case accordingly when selling to government boards. ---WPS204 — Safe Transport Victoria's Migration to AWS Cloud