The underlying domain name for this website is lfgss.microco.sm and I've just discovered that the microco.sm domain name has been suspended (why?!) by nic.sm that manages it.
This was discovered by people reporting broken avatars and image attachments.
I've opened a support case with Gandi (the domain registrar), but will start moving things to a spare domain.
I'll be moving from microco.sm to microcosm.app which I already own and it is active on Cloudflare already, which means that I can adjust the DNS records there and expect < 10s for each change.
There will 100% be breakage during this process as microco.sm is a SaaS platform and it changing breaks the many websites using it (of which LFGSS is the largest)... One doesn't just change the domain name for a SaaS service, but I shall.
Started: 2023-05-15T21:26
Finished: 2023-05-16T11:25
Duration of incident: 13h 59m
Impact: 1h 3m of total outage from 21:26 to 22:24 on 2023-05-15
Costs: £550 (TLS certs, email campaigns updating people on some sites, domain renewal / extension, DNS provider, new domain name)
Auth0 has been updated to send from auth0@microcosm.app
Sendgrid has been configured to send from microcosm.app
Confirmed that Sendgrid can send with the new config
Re-wrote every reference in microcosm app to point microco.sm to microcosm.app
Purchased new TLS wildcard cert for *.microcosm.app (£250!!!)
Updated all references to microco.sm within the Microcosm Go API
Updated all references to microco.sm within the Python web site
Contacted the sites that have had new content this year and informed them of how to update their URLs or DNS and offered email lists should they need them for communication.
Purge HTML cache to rebuild all of the content (fixes outbound links).
Fix all forum logos on all sites, as they typically are assets on microco.sm
Replace all microco.sm with microcosm.app in all comments
Purge nginx file cache for the API
Fix Sendgrid DKIM and SPF, fix the HSTS for the pervasive link tracking that I can't remove
Reconfigure Cloudflare page rules for the API paths to cache all the files, but bypass cache for the API
Open Support ticket with Cloudflare as microco.sm had cross-user CNAME grandfathered in, and microcosm.app does not, which means that other forums with their domains on Cloudflare cannot CNAME their custom domain.
Update 37 applications in Auth0 to ensure that the CORS all correctly point to microcosm.app
Extend the microcosm.app domain to 10 years (cost of £110)
Add microcosm.app to DNS Made Easy, and pay 1 year (cost of £180)
Change DNS Nameservers for microcosm.app from Cloudflare to DNS Made Easy as I need to temporaily get off Cloudflare whilst the CNAME does not work
Hard disable SendGrid link tracking to prevent HTTPS errors
Removed all Cloudflare proxying from microcosm.app, tested nothing broke, awaiting nameserver updates to flush through.
Disable automatic renewal of microco.sm in Gandi (why pay for a suspended domain!?)
Network traffic now I've moved microcosm.app off of Cloudflare is more than double, verified that we have a fast enough set of SSDs and also several 4Gbps links and should be able to keep up. But, I shall prep a load balanced second cache machine to bear the load if needed.
Verify SPF and DMARC
The admin panel on https://microcosm.app login was broken, re-wrote all references in the admin app.
Changed nameservers to Linode as DNS Made Easy simply does not work in Firefox 🤷 will get the pro-rated refund and see how Linode fares. Linode Support confirmed that they're not white-labelling Cloudflare, so the cross-user CNAME issues should not arise.
ERR no spare domains... purchased microcosm.ch as one should always have a spare domain name lying around in case you need it.
Permanently HTTP redirect (HTTP 301) microco.sm to microcosm.app, the old domain will exist until mid-December, which is enough time for everything to be pointed at the new URLs.
Note: microcosm.app now has DNS hosted by Akamai Linode rather than Cloudflare. As such we'll have higher outbound traffic fees in the future. The reason for this is I worked at Cloudflare and microco.sm was on a free (staff) Enterprise plan, but microcosm.app has no such favour given to it, and so it does not allow the same configuration to be achieved without a high price tag.
No-one has "your domain name of your SaaS provider is suspended" on their disaster recovery playbooks, which goes to show the art of the game is to be able to handle the unexpected. Top tips for those doing disaster recover plans... just model the following scenarios: compute failure, DNS failure (including domain name), network failure, storage failure. If you have those modelled, you can compose the response to cover any scenario.
Attachment shows impact to web traffic... it's about 1h3m of total outage, but mostly things have recovered. There is still a bumpy ride expected as DNS records flush through.
The underlying domain name for this website is
lfgss.microco.sm
and I've just discovered that themicroco.sm
domain name has been suspended (why?!) bynic.sm
that manages it.The domain name has been suspended!!!!
https://www.nic.sm
This was discovered by people reporting broken avatars and image attachments.
I've opened a support case with Gandi (the domain registrar), but will start moving things to a spare domain.
I'll be moving from
microco.sm
tomicrocosm.app
which I already own and it is active on Cloudflare already, which means that I can adjust the DNS records there and expect < 10s for each change.There will 100% be breakage during this process as
microco.sm
is a SaaS platform and it changing breaks the many websites using it (of which LFGSS is the largest)... One doesn't just change the domain name for a SaaS service, but I shall.Started: 2023-05-15T21:26
Finished: 2023-05-16T11:25
Duration of incident: 13h 59m
Impact: 1h 3m of total outage from 21:26 to 22:24 on 2023-05-15
Costs: £550 (TLS certs, email campaigns updating people on some sites, domain renewal / extension, DNS provider, new domain name)
auth0@microcosm.app
microcosm.app
microcosm
app to pointmicroco.sm
tomicrocosm.app
*.microcosm.app
(£250!!!)microco.sm
within the Microcosm Go APImicroco.sm
within the Python web sitemicroco.sm
within the https://microcosm.app site/etc/hosts
referencesmicroco.sm
microco.sm
withmicrocosm.app
in all commentsmicroco.sm
had cross-user CNAME grandfathered in, andmicrocosm.app
does not, which means that other forums with their domains on Cloudflare cannot CNAME their custom domain.microcosm.app
microcosm.app
domain to 10 years (cost of £110)microcosm.app
to DNS Made Easy, and pay 1 year (cost of £180)microcosm.app
from Cloudflare to DNS Made Easy as I need to temporaily get off Cloudflare whilst the CNAME does not workmicrocosm.app
, tested nothing broke, awaiting nameserver updates to flush through.microco.sm
in Gandi (why pay for a suspended domain!?)microcosm.app
off of Cloudflare is more than double, verified that we have a fast enough set of SSDs and also several 4Gbps links and should be able to keep up. But, I shall prep a load balanced second cache machine to bear the load if needed.microcosm.ch
as one should always have a spare domain name lying around in case you need it.microco.sm
tomicrocosm.app
, the old domain will exist until mid-December, which is enough time for everything to be pointed at the new URLs.Note: microcosm.app now has DNS hosted by Akamai Linode rather than Cloudflare. As such we'll have higher outbound traffic fees in the future. The reason for this is I worked at Cloudflare and microco.sm was on a free (staff) Enterprise plan, but microcosm.app has no such favour given to it, and so it does not allow the same configuration to be achieved without a high price tag.
No-one has "your domain name of your SaaS provider is suspended" on their disaster recovery playbooks, which goes to show the art of the game is to be able to handle the unexpected. Top tips for those doing disaster recover plans... just model the following scenarios: compute failure, DNS failure (including domain name), network failure, storage failure. If you have those modelled, you can compose the response to cover any scenario.
Attachment shows impact to web traffic... it's about 1h3m of total outage, but mostly things have recovered. There is still a bumpy ride expected as DNS records flush through.
1 Attachment