• The underlying domain name for this website is lfgss.microco.sm and I've just discovered that the microco.sm domain name has been suspended (why?!) by nic.sm that manages it.

    The domain name has been suspended!!!!

    https://www.nic.sm

    Domain Name: microco.sm
    Registration date: 21/12/2011
    Status: Suspended
    

    This was discovered by people reporting broken avatars and image attachments.

    I've opened a support case with Gandi (the domain registrar), but will start moving things to a spare domain.

    I'll be moving from microco.sm to microcosm.app which I already own and it is active on Cloudflare already, which means that I can adjust the DNS records there and expect < 10s for each change.

    There will 100% be breakage during this process as microco.sm is a SaaS platform and it changing breaks the many websites using it (of which LFGSS is the largest)... One doesn't just change the domain name for a SaaS service, but I shall.

    Started: 2023-05-15T21:26
    Finished: 2023-05-16T11:25
    Duration of incident: 13h 59m
    Impact: 1h 3m of total outage from 21:26 to 22:24 on 2023-05-15
    Costs: £550 (TLS certs, email campaigns updating people on some sites, domain renewal / extension, DNS provider, new domain name)

    1. Auth0 has been updated to send from auth0@microcosm.app
    2. Sendgrid has been configured to send from microcosm.app
    3. Confirmed that Sendgrid can send with the new config
    4. Re-wrote every reference in microcosm app to point microco.sm to microcosm.app
    5. Purchased new TLS wildcard cert for *.microcosm.app (£250!!!)
    6. Updated all references to microco.sm within the Microcosm Go API
    7. Updated all references to microco.sm within the Python web site
    8. Updated all references to microco.sm within the https://microcosm.app site
    9. Updated all /etc/hosts references
    10. Updated the load balancer
    11. Installed the TLS cert
    12. Restarted everything
    13. Contacted the sites that have had new content this year and informed them of how to update their URLs or DNS and offered email lists should they need them for communication.
    14. Purge HTML cache to rebuild all of the content (fixes outbound links).
    15. Fix all forum logos on all sites, as they typically are assets on microco.sm
    16. Replace all microco.sm with microcosm.app in all comments
    17. Purge nginx file cache for the API
    18. Fix Sendgrid DKIM and SPF, fix the HSTS for the pervasive link tracking that I can't remove
    19. Reconfigure Cloudflare page rules for the API paths to cache all the files, but bypass cache for the API
    20. Open Support ticket with Cloudflare as microco.sm had cross-user CNAME grandfathered in, and microcosm.app does not, which means that other forums with their domains on Cloudflare cannot CNAME their custom domain.
    21. Update 37 applications in Auth0 to ensure that the CORS all correctly point to microcosm.app
    22. Extend the microcosm.app domain to 10 years (cost of £110)
    23. Add microcosm.app to DNS Made Easy, and pay 1 year (cost of £180)
    24. Change DNS Nameservers for microcosm.app from Cloudflare to DNS Made Easy as I need to temporaily get off Cloudflare whilst the CNAME does not work
    25. Hard disable SendGrid link tracking to prevent HTTPS errors
    26. Removed all Cloudflare proxying from microcosm.app, tested nothing broke, awaiting nameserver updates to flush through.
    27. Disable automatic renewal of microco.sm in Gandi (why pay for a suspended domain!?)
    28. Network traffic now I've moved microcosm.app off of Cloudflare is more than double, verified that we have a fast enough set of SSDs and also several 4Gbps links and should be able to keep up. But, I shall prep a load balanced second cache machine to bear the load if needed.
    29. Verify SPF and DMARC
    30. The admin panel on https://microcosm.app login was broken, re-wrote all references in the admin app.
    31. Changed nameservers to Linode as DNS Made Easy simply does not work in Firefox 🤷 will get the pro-rated refund and see how Linode fares. Linode Support confirmed that they're not white-labelling Cloudflare, so the cross-user CNAME issues should not arise.
    32. ERR no spare domains... purchased microcosm.ch as one should always have a spare domain name lying around in case you need it.
    33. Permanently HTTP redirect (HTTP 301) microco.sm to microcosm.app, the old domain will exist until mid-December, which is enough time for everything to be pointed at the new URLs.

    Note: microcosm.app now has DNS hosted by Akamai Linode rather than Cloudflare. As such we'll have higher outbound traffic fees in the future. The reason for this is I worked at Cloudflare and microco.sm was on a free (staff) Enterprise plan, but microcosm.app has no such favour given to it, and so it does not allow the same configuration to be achieved without a high price tag.

    No-one has "your domain name of your SaaS provider is suspended" on their disaster recovery playbooks, which goes to show the art of the game is to be able to handle the unexpected. Top tips for those doing disaster recover plans... just model the following scenarios: compute failure, DNS failure (including domain name), network failure, storage failure. If you have those modelled, you can compose the response to cover any scenario.

    Attachment shows impact to web traffic... it's about 1h3m of total outage, but mostly things have recovered. There is still a bumpy ride expected as DNS records flush through.


    1 Attachment

    • Screenshot 2023-05-16 111859.png
About

Avatar for Velocio @Velocio started