-
Yeah, it's here:
https://github.com/microcosm-cc/microcosm/commit/1c22d43dd994abdf6a48c3715c5d87275e520978Gist:
- You watch a conversation.
- Every new reply to something you're watching triggers a write to the
updates
table to say "someone has replied"... you may get a notification (based on the last thing you've read in the conversation, we don't spam you for each reply made only the first - this happens at the time of writing to the table) - The Following page uses the last entry in the table to determine the things that you're watching that have been updated.
So... if 100 people are watching a conversation, and 100 replies are made... 10,000 rows just got created, but we only need the most recent 100, the most recent one for each person watching the conversation.
The job of the query then...
- Find all updates for a user for a user, and delete everything that isn't the latest.
The old query did it like this "find the latest updates as we want to keep those, use that to find all of the old updates as we want to lose those, delete the ones we want to lose":
WITH keep AS ( SELECT MAX(u.update_id) update_id ,u.for_profile_id ,u.parent_item_type_id ,u.parent_item_id FROM updates u WHERE u.update_type_id IN (1,4) GROUP BY u.for_profile_id ,u.parent_item_type_id ,u.parent_item_id ), lose AS ( SELECT update_id FROM updates WHERE update_type_id IN (1,4) AND for_profile_id != 0 AND parent_item_type_id != 0 AND parent_item_id != 0 AND update_id NOT IN (SELECT update_id FROM keep) ) DELETE FROM updates WHERE update_id IN (SELECT * FROM lose);
But this is using a couple of sequential scans, and most importantly some very high cardinality JOINs, i.e. within the
NOT IN
. At a certain point we appear to have crossed some magic number and this query went from being really fast... to basically not completing. The JOINs introduced the problem.The new query does it differently "find all updates except for the latest":
DELETE FROM updates WHERE update_id IN ( -- All updates for an item SELECT update_id FROM updates WHERE update_type_id = 1 AND parent_item_type_id > 0 EXCEPT -- Latest update for an item SELECT update_id FROM ( SELECT DISTINCT ON (for_profile_id, parent_item_type_id, parent_item_id) update_id FROM updates WHERE update_type_id = 1 AND parent_item_type_id > 0 ORDER BY for_profile_id, parent_item_type_id, parent_item_id, created DESC ) AS latest );
This still has a sequential scan, but it no longer has a JOIN that forces an iteration. The UNION that underpins this does a hash compare.
I am reminded though, that I'm not sure I like
SELECT DISTINCT ON
let aloneEXCEPT
. They're powerful, but I always have to contort my thinking to remember precisely what they're doing. My natural habit is to reach for aSELECT GROUP BY HAVING
which I did try but it didn't yield anything near the performance as it ran an additional aggregation step and sort which theDISTINCT ON
didn't seem to do in the plans I looked at. My traditional approach only got it down to a few minutes, but theEXCEPT
andDISTINCT ON
got it down to seconds. - You watch a conversation.
That's quite the optimisation!