-
We have a table that looks roughly like this:
CREATE TABLE read AS ( profile_id bigint NOT NULL, item_type_id bigint NOT NULL, item_id bigint NOT NULL, when timestamp NOT NULL );
Where
item_type_id
contains a number that basically answers the question "Is this a conversation? Is it a PM? Is it a forum?" and so on for every "type" we support.When you read a conversation (like this one) we will put a value in there:
INSERT INTO read (50180, 6, 131364, timestamp_of_last_post_on_page)
.If you read a forum, then that trumps a conversation... so we can delete from
read
everything that refers to an item within the forum, and then insert the row for the forum.Then when we want to check whether something is "read", we work out what you're talking about... a conversation... and determine whether a
read
row exists for that, and whether aread
row exists for any of it's parents (forum or site), and then take the MAX of any of those values, and compare it to the last_modified for the conversation.All of that is presently wrapped in a decoration row-based function (used in the
SELECT
and not in theWHERE
that roughly looks like:SELECT is_unread(50180, 6, 131364) AS unread
.This all works blindingly well. If you read something on your mobile, the desktop reflects it, when we build apps and you read something in an app, the desktop reflects it. Read it on your desktop, the app would reflect it. And so on.
It only doesn't work now that there is a clamour for a filter based on
is_unread
.As
is_unread
is a row-based function, were you to apply it to the search (for example), then it would be applied as part of theWHERE
. Applying a row-based function to theSELECT
means that on a page of 25 items it's only going to be called 25 times. Applying a row-based function to theWHERE
or to a subquery that will be filtered means applying it to potentially hundreds of thousands (or even millions) of items, the vast majority of which are going to be discarded.An example scenario in which row-based functions bite people in the arse is in Postgres's full-text search: http://www.postgresql.org/docs/9.2/static/textsearch-controls.html#TEXTSEARCH-HEADLINE
They have a row-based function which is also cheap on a per row basis, but expensive across lots of rows. It fetches from the searched documents the highlighted fragment that matched the search:
SELECT ts_headline('english', 'The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query.', to_tsquery('query & similarity')); ts_headline ------------------------------------------------------------ containing given <b>query</b> terms and return them in order of their <b>similarity</b> to the <b>query</b>.
Their warning on that page: A typical mistake is to call ts_headline for every matching document when only ten documents are to be shown. SQL subqueries can help; here is an example:
And their example:
SELECT id, ts_headline(body, q), rank FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank FROM apod, to_tsquery('stars') q WHERE ti @@ q ORDER BY rank DESC LIMIT 10) AS foo;
Row-based queries are extremely ill-suited to filtering, and should never be used when the majority of the work will actually just be thrown away afterwards.
Obviously we have the data in the
read
table, but our current way of using that data is to check the "read" status of individual items through a row-based function for the purpose of painting a "New" label. The current row-based function is ill-suited to being used as a filter (we know because we've done it).
It is the same.
Presently determining "unread" is a row-based function.
Performance with SQL is never achieved when row-based functions are called across collections.