Speeding up text pattern searching and sorting in PostgreSQL
Introduction
Suppose you have a table with records that you want to be searchable by a textual attribute, such as name or title. You also need to sort those records in a certain way that is not necessarily linked to the name or title - for example, you want to display the most recent records, or the most popular ones, first. After a certain period, the table has grown in size and search has become very slow, so you need to speed it up somehow. What are your options?
What follows are some of the SQL performance optimization techniques, with accent on avoiding the sort operation as being potentially expensive operation.
As an example, we’ll be using a sample table that holds data about various music artists:
create table music_artist (
id integer,
name character varying(255),
cover_photo character varying(255),
short_bio text,
active boolean,
rating bigint
);
While some of the concepts can be reused in different database systems, this article will discuss mostly PostgreSQL and it’s features.
Prefix/postfix search
While searching for text that starts (or ends) in a certain way is not that common, it is worth describing on how to optimize in such cases, since it can be done with standard b-tree indexes.
Prefix and postfix search use the LIKE
operator, with a %
sign at the end (prefix), or in front (postfix) of input text to search for. To make the search case-insensitive, you can either use ILIKE
operator , or convert both column and input text to either lower or upper case.
Prefix search
Prepared statement for prefix search could look like this:
prepare mas(text) as
select * from music_artist
where name ilike $1 || '%' and active = true
order by rating desc
limit 5;
Running explain
on this query shows that whole table needs to be read sequentially, results are then sorted in memory, and then the first 5 results are selected:
=> explain execute mas('the');
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=29890.60..29891.17 rows=5 width=50)
-> Gather Merge (cost=29890.60..34297.28 rows=38319 width=50)
Workers Planned: 1
-> Sort (cost=28890.59..28986.39 rows=38319 width=50)
Sort Key: rating DESC
-> Parallel Seq Scan on music_artist (cost=0.00..28254.12 rows=38319 width=50)
Filter: (active AND ((name)::text ~~* 'the%'::text))
Since b-tree index supports prefix search of text (case sensitive), we can create an index that will contain name and rating columns, with name stored in lower case:
create index pref_search_idx on music_artist (rating desc, lower(name))
where active = true;
This index is a partial, function-based index that stores transformed data for all rows that have the active column set to true
. As we are interested in the most popular results first, we’ll use the modifier desc
on the rating column, and also place it as a first column within the index definition, since the index needs to match the ORDER BY
clause as well.
To update statistics after the new index is created, use analyze music_artist;
command.
Now the query needs to be rewritten so it makes use of the new index by using standard LIKE
operator, with the column and the input text transformed to lower case:
prepare mas2(text) as
select * from music_artist
where lower(name) like lower($1) || '%' and active = true
order by rating desc
limit 5;
Execution plan of the new query now only has an index scan that stops after the first 5 results are retrieved:
=> explain execute mas2('the');
QUERY PLAN
--------------------------------------------------------------------------------------------------
Limit (cost=0.43..36.77 rows=5 width=50)
-> Index Scan using pref_search_idx on music_artist (cost=0.43..58586.92 rows=8061 width=50)
Filter: (lower((name)::text) ~~ 'the%'::text)
Response time is now improved from several hundreds of milliseconds down to only a handful of milliseconds (in a test environment with around 1.6M records in the table)!
As you can see, there is no Sort
step in the query execution plan, which means that results are already sorted, and no additional sorting is required.
Caveats
While the retrieval of popular records will be fast, you could experience significant drop in performance when the search needs to go deeper to find less popular records. If the performance becomes really bad in such cases, it can be further improved by creating a covering index (instead of the previous one) that will include all the columns from the query:
-- covering index
create index pref_search_idx on music_artist (rating desc, lower(name))
include (id, name, cover_photo, short_bio)
where active = true;
-- new query
prepare mas4(text) as
select id, name, cover_photo, short_bio, rating from music_artist
where lower(name) like lower($1) || '%' and active = true
order by rating desc
limit 5;
Creating a covering index essentially means duplicating lots of columns (or even the whole table), so it could become a pricey solution in terms of disk usage. But the benefit of a covering index is that it keeps all the included data physically close in an order defined by the index, so looking deeper for results would become faster by reading just the index, without the need for random data access on the table heap.
Postfix search
Postfix search consists of simply reversing both the column and the input text, and creating an index with the same transformation functions:
-- index
create index pref_search_rev_idx on music_artist (rating desc, reverse(lower(name)))
where active = true;
-- query
prepare mas3(text) as
select * from music_artist
where reverse(lower(name)) like reverse(lower($1)) || '%' and active = true
order by rating desc
limit 5;
Same execution plan and caveats apply just like for the prefix search.
Infix search
This is a more common way of searching, when the input text can be anywhere within a string - the infix search. Infix search is when the input text is surrounded with %
signs, for example ILIKE %input text%
. How to speed up such query?
Most common way in PostgreSQL is to use the pg_trgm extension which supports pattern search operators (LIKE
, ILIKE
, ~*
), and then create a GIN (or GiST) index over the text column:
-- add extension
create extension pg_trgm;
-- create GIN index
create index ma_name_gin_idx on music_artist using gin(name gin_trgm_ops)
where active = true;
-- query
prepare mas5(text) as
select * from music_artist
where name ilike '%' || $1 || '%' and active = true
order by rating desc
limit 5;
There is a huge performance gain when doing it this way. With the reasonable amount of data, it should be enough to keep the response time short. But what when it isn’t enough? Is there anything else that can be done?
If you look at the execution plan of the query above, it does involve several steps, including sorting:
=> explain (costs off) execute mas5('the');
QUERY PLAN
---------------------------------------------------------------------------
Limit
-> Sort
Sort Key: rating DESC
-> Bitmap Heap Scan on music_artist
Recheck Cond: (((name)::text ~~* '%the%'::text) AND active)
-> Bitmap Index Scan on ma_name_gin_idx
Index Cond: ((name)::text ~~* '%the%'::text)
Addressing this requires a little bit more of creativity.
GiST index
The pg_trgm extension can also use the GiST index for pattern search indexing, even though the GIN is generally recommended as being faster in this particular use-case. GiST is also “lossy”, which affects performance.
What is special about GiST index? This index can use the k-nearest neighbours algorithm to sort the results via distance (<->
) operator, whereas GIN index doesn’t support sorting at all.
Since GiST index doesnt support scalar values by default, and rating column is a scalar (bigint), firstly we’ll need an extension to enable such support:
create extension btree_gist;
Now, let’s create a GiST index on the table (note: any previously created indexes should be dropped). Order of the columns is not important:
create index ma_name_gist_idx on music_artist using gist(name gist_trgm_ops, rating)
where active = true;
A query that will utilize the new index can be written like this:
prepare mas6(text) as
select * from music_artist
where name ilike '%' || $1 || '%' and active = true
order by rating <-> 50000
limit 5;
Explain plan looks far simpler now:
=> explain (costs off) execute mas6('the');
QUERY PLAN
-------------------------------------------------------------------------
Limit
-> Index Scan using ma_name_gist_idx on music_artist
Index Cond: ((name)::text ~~* (('%'::text || $1) || '%'::text))
Order By: (rating <-> '50000'::bigint)
Value 50000 here is the value equal or larger than the maximum value of the rating column in the table, so that results are retrieved in an descending order. What is important to notice here is that the distance operator needs a constant. If you try to use a subquery instead of value 50000, index-enabled sorting will not trigger, and a Sort
operation will show up in the execution plan! This means that you should have this value collected beforehand, if you’d like to use the search in this way.
Note: be wary of using bigint type with this approach, as there seems to be a bug if you try to use an overly large value as a constant, which could mess up the search results. I even asked a question on Stack Overflow about this, but there doesn’t seem to be an answer.
Caveats
This way of indexing suffers from the same problem as the prefix/postfix search. Lesser the popularity, bigger the search time. Unfortunately, it’s performance cannot be improved further with a covering index. GiST index does support covering, but not all operator classes will trigger the index-only scan. And it seems that it won’t trigger in this very use-case.
Conclusion
Whether you will use standard index, GIN or GiST to speed up your search depends on what your goal is, and on how much data do you have. I hope I’ve managed to give some insight into pros and cons for each of them when it comes to performance of text pattern searching. However, if a relational database is not up to the task of searching lots of text, perhaps using a specialized tool (i.e. Elastic Search) would be a better option.
Further reading
- I highly recommend visiting Use The Index, Luke! website by Markus Winand, where you can learn all the power of database indexing and SQL, a much needed skill if your job description includes databases
- If you’d like to see more performance tuning examples, I have a GitHub repo where I’ve dealt with optimization techniques similar to the ones in this article, and also with some of the data denormalization strategies