500 errors on USPROD for 21 minutes

Incident Report for Hivebrite Status page

Postmortem

Dear customer,

You recently faced an incident on our services. We understand the impact this may have had on your experience and thus want to explain what happened.

Incident Overview

INCIDENT PROPERTIES
Started Jan 22, 2025 04:45 pm UTC
Impact Ended Jan 22, 2025 05:06 pm UTC
Impact Duration 1 minutes
Environment Usprod

What Happened

Between 2025-01-22T16:40:00 and 2025-01-22T17:06:00, customers on the US production platform experienced a large number of 500 errors. This was due to two simultaneous overloads of activity at the same time.

  1. A large campaign led to security bot scanning flooding the platform. This is a known issue that occurs regularly and leads to repetitive micro outages (<1minute).
  2. A big event in a wide community led to an additonnal surge of traffic. In usual times, this would not have been an issue but added to the first event, it led to such a long service disruption.

Preventive Measures

We have identified the security bot scanning issue 2 months ago and have our teams working on a long-term fix targeted for Q1 2025. We also are working to improve our autoscaling, our app bootime and our caching of public pages. This will allow our platform to reach better performance, scalability and more particularly reliability.

We sincerely appreciate your understanding and are committed to continually improving our services to deliver a more resilient and reliable experience.
If you have any lingering questions, do not hesitate to reach out to your CSM or our support team.

Sincerely,

Hivebrite

Posted Jan 24, 2025 - 23:55 CET

Resolved

Between 17:45 and 18:06 Jan 22 (in French time), there was a significant occurrence of 500 errors affected all US production customers, causing service disruptions.
Posted Jan 24, 2025 - 23:52 CET