You are reading a single comment by @mashton and its replies. Click here to read the full conversation.
  • That's quite the optimisation!

  • Yeah, it's here:
    https://github.com/microcosm-cc/microcosm/commit/1c22d43dd994abdf6a48c3715c5d87275e520978

    Gist:

    • You watch a conversation.
    • Every new reply to something you're watching triggers a write to the updates table to say "someone has replied"... you may get a notification (based on the last thing you've read in the conversation, we don't spam you for each reply made only the first - this happens at the time of writing to the table)
    • The Following page uses the last entry in the table to determine the things that you're watching that have been updated.

    So... if 100 people are watching a conversation, and 100 replies are made... 10,000 rows just got created, but we only need the most recent 100, the most recent one for each person watching the conversation.

    The job of the query then...

    • Find all updates for a user for a user, and delete everything that isn't the latest.

    The old query did it like this "find the latest updates as we want to keep those, use that to find all of the old updates as we want to lose those, delete the ones we want to lose":

    WITH keep AS (
    SELECT MAX(u.update_id) update_id
          ,u.for_profile_id
          ,u.parent_item_type_id
          ,u.parent_item_id
      FROM updates u
     WHERE u.update_type_id IN (1,4)
     GROUP BY u.for_profile_id
          ,u.parent_item_type_id
          ,u.parent_item_id
    ), lose AS (
        SELECT update_id
          FROM updates
         WHERE update_type_id IN (1,4)
           AND for_profile_id != 0
           AND parent_item_type_id != 0
           AND parent_item_id != 0
           AND update_id NOT IN (SELECT update_id FROM keep)
    )
    DELETE FROM updates
     WHERE update_id IN (SELECT * FROM lose);
    

    But this is using a couple of sequential scans, and most importantly some very high cardinality JOINs, i.e. within the NOT IN. At a certain point we appear to have crossed some magic number and this query went from being really fast... to basically not completing. The JOINs introduced the problem.

    The new query does it differently "find all updates except for the latest":

    DELETE
      FROM updates
     WHERE update_id IN (
              -- All updates for an item
              SELECT update_id
                FROM updates
               WHERE update_type_id = 1
                 AND parent_item_type_id > 0
              EXCEPT 
              -- Latest update for an item
              SELECT update_id
                FROM (
                        SELECT DISTINCT ON (for_profile_id, parent_item_type_id, parent_item_id) update_id
                          FROM updates
                         WHERE update_type_id = 1
                           AND parent_item_type_id > 0
                         ORDER BY for_profile_id, parent_item_type_id, parent_item_id, created DESC
                     ) AS latest
           );
    

    This still has a sequential scan, but it no longer has a JOIN that forces an iteration. The UNION that underpins this does a hash compare.

    I am reminded though, that I'm not sure I like SELECT DISTINCT ON let alone EXCEPT. They're powerful, but I always have to contort my thinking to remember precisely what they're doing. My natural habit is to reach for a SELECT GROUP BY HAVING which I did try but it didn't yield anything near the performance as it ran an additional aggregation step and sort which the DISTINCT ON didn't seem to do in the plans I looked at. My traditional approach only got it down to a few minutes, but the EXCEPT and DISTINCT ON got it down to seconds.

About

Avatar for mashton @mashton started