On November 12, the Geniuslink service temporarily went down – and all of our corresponding links were essentially “broken” for nearly an hour and a half. And, for that, I offer my sincerest apologies to all of our clients that rely on our service.
The gist was that we had a chain of events that took down the bulk of our regional databases, databases that are essential for serving clicks from high traffic regions. This issue was exacerbated by a spike in traffic and connection issues with a third-party service we rely on. The full post mortem write-up by our CTO is below along with the steps we have taken to ensure an issue like this doesn’t happen again.
While Thursday’s events were humbling and incredibly insightful about our blind spots, we are really proud of how our service is built and runs. It’s been a while since we wrote about our stack ([Technical post] Geniuslink’s Full Tech Stack) but in the last four years, we’ve continued to maintain that “nearly” 100% uptime status (trailing 12 months shows us at 99.96% uptime including Thursday’s outage according to Pingdom’s monitoring and 99.99% via our StatusPage.io numbers – more on each below).
We understand that speed and uptime are paramount in building a SaaS company, especially one dealing with e-commerce (and affiliate marketing). This is likely from our work in the hosting space twenty years ago. While my co-founder, the other Jesse (Pasichnyk), and I were in college we started a web hosting company that Pasichnyk then went on to run for nearly a decade. Our head of product, Chief Product Officer Steven Sundheim, was also a co-founder of a separate, and much larger, web hosting company (ModWest) for the bulk of the 2000s as well.
While we’ve traditionally had a good track record, and are always actively focused on improving our infrastructure, it was a hard, and important lesson, to see things crash as hard as they did on Thursday (a lesson we don’t want to repeat).
So what happened you ask? From the words of our co-founder and Chief Technology Officer:
Yesterday, November 12th, at 1710Z, we received numerous service monitoring alerts indicating that link processing, our API, and our client dashboard had become slow and/or unavailable. We responded with all hands on deck and worked quickly to assess the situation. Our fix was attempted at 1730Z which allowed normal click volumes to process again in our US/West data center, however, it did not completely resolve the issue. At 1830Z we began rolling out another fix to the same datacenter, verified, then applied it worldwide. The system was fully restored (with exception to Sovrn Commerce / VigLink affiliation support) by 1900Z, less than two hours after issues manifested. During this outage, while we were able to serve some requests, the majority, unfortunately, resulted in HTTP 503 error responses or timeouts.
During initial troubleshooting, we observed database cluster issues which were most likely caused by a problematic database replication event combined with lower memory limits in several of our datacenters – This is not something we have seen happen before. The result was that several regions stopped replication of data and paused/blocked incoming connections. Each of our regions has a local replica for increased performance and availability (ironically), so those regions were immediately impacted. Further, since we allow regional hosts to cross-connect to data replicas in other datacenters for better availability (again, ironically), otherwise healthy regions were impacted as well.
During the investigation of this outage, we also discovered an issue with TLS certificate verification when calling Sovrn Commerce / VigLink API (used for client affiliation if tokens are provided), which doesn’t seem to be a root cause of this outage, yet contributed to obfuscate and exacerbate the database issues. This is the reason that Sovern Commerce / VigLink support was not restored at the same time as the rest of the system.
We were already undertaking a project to rebuild our regional deployments, including many infrastructure design improvements and upgraded software versions, with a planned completion before the end of the year. However, we will further expedite this effort and include additional changes to better isolate the impact of regional issues such as the ones we saw yesterday.
Actively Learning From Our Mistakes
As Jesse P. mentions in his postmortem above, we’ve had a major infrastructure project in the works for a good chunk of 2020. I’m excited to announce that the team has put in some ridiculous hours over the last week and a half since the outage and we are now live with a significantly improved process for monitoring, managing, and (re)deploying our regional service nodes. Last Thursday, we deployed the first of our new regional deployments, two more will go live today, and we plan to provision more throughout the week.
We’ve also made the hard decision that no third-party service can be relied upon during the processing of a click, regardless of their track history. Adding this layer of insulation will ensure we have no dependencies that can slow down or break the service (good!) with minimum downsides (the first click on a new link may not get the “Geniuslink magic,” but all subsequent clicks will).
As much as I want to promise you 100% uptime in perpetuity, the nature of the beast is that sometime in the future (hopefully the very far off future) something may happen again and something could still break. We will obviously do everything that we can to prevent this, but if something does go wrong, here are a few resources to help you keep an eye on us.
Status (Powered by StatusPage.io) – The best place to keep an eye on us or dig in when something is funky is via the Geniuslink Status Page (https://status.geni.us). From here you can see a breakdown of the different aspects of Geniuslink (eg. Metrics, Dashboard, API, Website, Links, and Real-time Link Optimization), get the current details of an incident, and most importantly, subscribe to updates so you get alerted when we have an issue and as soon as it is resolved.
Uptime (Powered by Pingdom) – One of the key features of Geniuslink is that we have servers around the world to quickly process and handle links so an interruption or outage in one region doesn’t mean the whole service has crashed. So, while Pingdom’s monitoring of Geniuslink can be super helpful it only monitors our US and European uptime. Thus it is very binary in that the service is either up or down instead of being able to report reduced capacity or other nuances. While Pingdom is a great monitoring tool it’s often the most pessimistic of them all (which isn’t necessarily a bad thing!). You can find our Pingdom uptime monitoring of the Geniuslink service here: http://stats.pingdom.com/a8w38imsfvbm/322734/history.
Twitter – While the bulk of our tweeting is sharing industry news, blog posts, and having fun interacting with our community, we are also diligent about posting updates when things go awry with the service (eg. https://twitter.com/geniuslink/status/1326942423059103750).
Geniuslink Support – We have an awesome team (Joey, Isaac, Andy, and Matt) who manage our support/client success inbox and chat. This is the team to go to when you have questions or concerns! You can find them via the Intercom chat bubble from inside the dashboard, the “Contact Us” form at the footer of the website, or emailing email@example.com.
Jesse Lakes – Many of our thousands of clients have been in personal contact with me (CEO / Co-Founder) over the past eight years and know that your success, and good technical support, is incredibly important to me. As a result, I read every email in my inbox every day (and reply to about 99% of them within 12 hours or less) but as we slowly continue to grow in size, I, unfortunately, am not the best point of contact for time-sensitive issues. Case and point – I was in an all-day class on Thursday with my cell phone, email, and messaging turned off and “Do Not Disturb” turned on. I was actually one of the last members of the team to learn about the outage. While I still want to hear from you, our awesome clients, I encourage you to reach out directly to our support team for anything urgent.
Again, I want to personally offer my sincerest apologies for Thursday’s issue, and on behalf of the team, I want to promise you that we are doing our absolute best to ensure this never happens again.
I also want to assure you that your trust in us is something we value over everything else. We understand that trust has to be earned and Thursday’s event did the exact opposite of that. We are prepared to work even harder to rebuild that trust. Please bear with us.
On that final note about trust, I have a request for you — I want to ask you to please share any feedback, at any time, where you feel like we are not being fully transparent, or you think we are abusing our trust. We can always get better and appreciate your help in doing so.
My email is simply my initials at geni.us and is the best way to contact me directly. Or ping the support team and they can always pass along the note.
Again, I’m deeply sorry for the outage and really appreciate you being willing to continue to work with us so we can work together to maximize your commissions and marketing efforts through the holidays and into 2021.
Co-Founder / CEO