Behind Canva’s November 2024 Outage: What Went Wrong and What’s Next
Recently, Canva faced a significant outage that disrupted access to its platform for nearly an hour, leaving users worldwide unable to access the design tool. This outage interrupted workflows for creative professionals, marketers, and small business owners who depend on Canva for tasks such as creating social media graphics, presentations, and promotional materials, highlighting the critical role the platform plays in day-to-day operations. From 9:08 AM UTC to 10:00 AM UTC, canva.com was completely unavailable, a rare occurrence for the platform. This blog dives into the root causes of the outage, how it unfolded, the immediate steps taken to restore functionality, and the preventative measures Canva is implementing to avoid similar incidents in the future.
The Anatomy of the Outage
The outage stemmed from a confluence of factors, including:
- A software deployment issue: The rollout introduced enhancements to Canva’s editor, including improved object panel performance and additional layer-management features. However, an unforeseen bug in the deployment pipeline caused compatibility issues with client-side caching, which contributed to the incident.
- Network instability: Cloudflare, Canva’s CDN provider, encountered latency and packet loss issues in its Singapore-to-Ashburn network route.
- A locking issue in the API Gateway: A telemetry bug within Canva’s infrastructure further exacerbated performance challenges.
These interrelated issues ultimately overwhelmed Canva’s API Gateway, a critical component that handles authentication, authorization, and rate limiting for API requests, causing a cascading failure that rendered the site inaccessible.
How the Incident Unfolded
Initial Deployment (8:47 AM UTC)
A new version of Canva’s editor went live, prompting client devices to fetch updated static assets from Cloudflare’s caching system. Among these assets was a JavaScript file essential for displaying the editor’s object panel.
Network Latency Emerges
Concurrently, Cloudflare’s Singapore-to-Ashburn network route experienced a dramatic increase in latency, with time-to-first-byte times soaring by over 1700%. One critical JavaScript file took up to 20 minutes to fetch, leaving users in Asia unable to load the object panel.
Cache Stream Overload
Cloudflare’s caching system aggregated over 270,000 requests for the same JavaScript file. When the asset finally loaded at 9:07 AM UTC, a “thundering herd” of 1.5 million simultaneous API requests overwhelmed Canva’s API Gateway, tripling its typical peak load.
API Gateway Collapse
Under the surge in traffic, the API Gateway’s performance degraded due to a telemetry bug causing thread-locking issues. This led to memory overuse, triggering the Linux Out-Of-Memory Killer and terminating all tasks running on the Gateway. By 9:08 AM UTC, canva.com was entirely offline.
Mitigating the Crisis
Canva’s engineering team responded with a series of measures:
- Scaling API Gateway tasks: Initial attempts to autoscale tasks failed as new tasks became overwhelmed by ongoing traffic spikes.
- Blocking traffic at the CDN level: At 9:29 AM UTC, Canva temporarily blocked all traffic at the CDN layer to stabilize the API Gateway.
- Gradual traffic restoration: Starting with Australian users under strict rate limits, Canva incrementally restored global access, ensuring system stability at each step.
By 10:00 AM UTC, the platform was back online.
Lessons Learned and Action Plan
To enhance reliability and prevent future outages, Canva has outlined immediate and long-term measures to address critical areas:
In terms of incident response, Canva is developing a comprehensive runbook for traffic management during emergencies and working to improve user communication by providing clearer error pages during downtime. To strengthen API Gateway resilience, the team plans to increase its baseline capacity and memory allocation, implement load-shedding rules for better handling of traffic surges, and conduct regular load testing to simulate extreme scenarios.
To address specific issues like the telemetry bug, Canva has deployed a patch to fix the thread-locking problem and is enhancing its testing processes to avoid similar complications in the future. For deployment guardrails, additional safeguards are being introduced, including monitoring page load completion events, extending canary release durations to better detect issues during staged rollouts, and adding timeouts for asset requests to avoid prolonged delays.
Lastly, Canva is collaborating closely with Cloudflare to refine traffic routing and caching mechanisms, ensuring smoother handling of high-demand situations. Together, these measures aim to bolster Canva’s infrastructure and prevent similar outages from occurring again.
A Commitment to Transparency
This outage marks Canva’s first publicly shared incident report, reflecting its dedication to transparency and continuous improvement. As Canva’s user base grows, so does its commitment to building a resilient infrastructure that supports its mission to empower the world to design.
Canva’s efforts to analyze and address the outage underscore the company’s proactive approach to learning from challenges. By implementing these changes, Canva aims to ensure a more robust and reliable platform for its millions of users worldwide.
Leveraging Solutions to Prevent Outages
Outages like Canva’s can often be mitigated or even prevented entirely with robust solutions designed to enhance infrastructure resilience. Tools like RELIANOID’s high-performance proxies and API Gateway optimizations offer key advantages, including real-time load balancing, advanced traffic routing, and automated failover mechanisms. By deploying cutting-edge telemetry systems and hot-restart features, such tools ensure seamless operations even under extreme conditions. Organizations adopting these solutions can proactively address performance bottlenecks, improve incident response, and maintain consistent uptime for critical applications. Contact us for more information.