Version: 1.0.16

pg_trgm

mdx: format: md

The pg_trgm module provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings.

This module is considered "trusted", meaning it can be installed by non-superusers who have CREATE privilege on the current database.

1. Trigram (or Trigraph) Concept

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea has proven to be an effective method for measuring word similarity in many natural languages.

Note: When extracting trigrams from a string, pg_trgm ignores non-word characters (non-alphanumeric). When determining the set of trigrams contained in a string, each word is considered to have two spaces prefixed and one space suffixed. For example, the set of trigrams in the string "cat" is: " c", " ca", "cat", and "at ". The string "foo

2. Functions and Operators

The functions provided by the pg_trgm module are shown in Table C.24, and the operators are shown in Table C.25.

Table C.24. pg_trgm Functions

Function/Description
similarity ( text, text ) → real Returns a number indicating how similar the two arguments are. The result ranges from 0 (indicating the two strings are completely dissimilar) to 1 (indicating the two strings are identical).
show_trgm ( text ) → text[] Returns an array of all trigrams in the given string (rarely useful except for debugging).
word_similarity ( text, text ) → real Returns a number that is the greatest similarity between the trigram set of the first string and any continuous extent of an ordered set of trigrams in the second string. strict_word_similarity ( text, text ) → real Same as word_similarity(text, text), but forces the boundaries of the continuous extent to match word boundaries. Since we do not have cross-word trigrams, this function actually returns the similarity between the first string and any continuous word-based extent of the second string.
show_limit () → real Returns the current similarity threshold used by the % operator. For example, this sets the minimum similarity that two words must satisfy to be considered sufficiently similar (deprecated; use SHOW pg_trgm.similarity_threshold instead).
set_limit ( real ) → real Sets the current similarity threshold used by the % operator. The threshold must be between 0 and 1 (default is 0.3). Returns the same value that was passed in (deprecated; use SET pg_trgm.similarity_threshold instead)

Consider the following examples:

test=## SELECT word_similarity('word', 'two words');

word_similarity

-----------------

0.8

(1 row)

In the first string, the trigram set is {" w"," wo","wor","ord","rd "}. In the second string, the ordered set of trigrams is {" t"," tw","two","wo "," w"," wo","wor","ord","rds","ds "}.

The most similar portion of the ordered trigram set in the second string is {" w"," wo","wor","ord"}, and the similarity is 0.8.

The value returned by this function can be roughly understood as the maximum similarity between the first string and any substring of the second string.

However, this function does not add padding to the boundaries of the portion. Therefore, the number of extra characters present in the second string beyond mismatched word boundaries is not taken into account.

Meanwhile, strict_word_similarity selects a word-based portion in the second string. In the example above, strict_word_similarity would select the portion formed by the single word 'words', whose trigram set is {"w"," wo","wor","ord","rds","ds "}.

test=## SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');

strict_word_similarity | similarity

------------------------+------------

0.5714286 | 0.5714286

(1 row)

Thus, the strict_word_similarity function is useful for computing whole-word similarity, while word_similarity is more suitable for computing partial word similarity.

Table C.25. pg_trgm Operators

Operator/Description
text % text → boolean Returns true if the arguments have a similarity greater than the similarity threshold set by pg_trgm.similarity_threshold. text <% text → boolean Returns true if the similarity between the trigram set of the first argument and a continuous extent of the ordered trigram set of the second argument exceeds the similarity threshold set by the pg_trgm.word_similarity_threshold parameter.
text %> text → boolean The commutator of the <% operator.
text <<% text → boolean Returns true if a continuous extent of the ordered trigram set of the second argument matches word boundaries, and its similarity with the trigram set of the first argument exceeds the similarity threshold set by the pg_trgm.strict_word_similarity_threshold parameter.
text %>> text → boolean The commutator of the <<% operator.
text <-> text → real Returns the "distance" between the arguments, that is, 1 minus the similarity() value.
text <<-> text → real Returns the "distance" between the arguments, which is 1 minus the word_similarity() value.
text <->> text → real The commutator of the <<-> operator.
text <<<-> text → real Returns the "distance" between the arguments, that is, 1 minus the strict_word_similarity() value.
text <->>> text → real The commutator of the <<<-> operator.

3. GUC Parameters

pg_trgm.similarity_threshold (real)

Sets the current similarity threshold used by the % operator. The threshold must be between 0 and 1 (default is 0.3).

pg_trgm.word_similarity_threshold (real)

Sets the current word similarity threshold used by the <% and %> operators. The threshold must be between 0 and 1 (default is 0.6).

pg_trgm.strict_word_similarity_threshold (real)

Sets the current strict word similarity threshold used by the <<% and %>> operators. The threshold must be between 0 and 1 (default is 0.5).

4. Index Support

The pg_trgm module provides GiST and GIN index operator classes that allow you to create indexes on a text column for fast similarity searches. These index types support the similarity operators described above, and additionally support trigram-based index searches for LIKE, ILIKE, ~, and ~* queries (these indexes do not support equality or simple comparison operators, so you may also need a regular B-tree index).

Examples:

CREATE TABLE test_trgm (t text);

CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);

Or CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);

The gist_trgm_ops GiST opclass approximates a set of trigrams as a bitmap signature. Its optional integer parameter siglen determines the signature length in bytes. The default length is 12 bytes. Valid values for signature length are between 1 and 2024 bytes. Longer signatures result in more precise searches (scanning a smaller portion of the index and fewer heap pages), but at the cost of a larger index.

Example of creating such an index with a signature length of 32 bytes:

CREATE INDEX trgm_idx ON test_trgm USING GIST

(t gist_trgm_ops(siglen=32));

At this point, you will have an index on the t column that you can use for similarity searches. A typical query is:

test=## SELECT t, similarity(t, 'word') AS sml

test-## FROM test_trgm

test-## WHERE t % 'word'

test-## ORDER BY sml DESC, t;

t | sml

---+-----

(0 rows)

This will return all values in the text column that are sufficiently similar to word, sorted from best match to worst match. The index will be used to speed up this search, even on a very large data set.

A variant of the above query is:

test=## SELECT t, t <-> 'word' AS dist FROM test_trgm

test-## ORDER BY dist LIMIT 10;

t | dist

---+------

(0 rows)

This can be efficiently implemented with a GiST index, but not with a GIN index. When only a few closest matches are desired, this is typically better than the first form.

You can also use an index on the t column for word similarity or strict word similarity searches. Typical queries are:

test=## SELECT t, word_similarity('word', t) AS sml

test-## FROM test_trgm

test-## WHERE 'word' <% t

test-## ORDER BY sml DESC, t;

t | sml

---+-----

(0 rows)

And test=## SELECT t, strict_word_similarity('word', t) AS sml

test-## FROM test_trgm

test-## WHERE 'word' <<% t

test-## ORDER BY sml DESC, t;

t | sml

---+-----

(0 rows)

This will return all values in the text column where there is a continuous extent in the ordered trigram set that is sufficiently similar to the trigram set of word, sorted from best match to worst match. Even on very large data sets, the index will make this operation fast enough.

Possible variants of the above query are:

test=## SELECT t, 'word' <<-> t AS dist

test-## FROM test_trgm

test-## ORDER BY dist LIMIT 10;

t | dist

---+------

(0 rows)

And

test=## SELECT t, 'word' <<<-> t AS dist

test-## FROM test_trgm

test-## ORDER BY dist LIMIT 10;

t | dist

---+------

(0 rows)

This can be efficiently implemented with a GiST index, but not with a GIN index.

For example:

SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

The index search works by extracting trigrams from the search string and looking them up in the index. The more trigrams in the search string, the more efficient the index search. Unlike B-tree-based searches, the search string does not need to be left-anchored.

For example:

SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';

The index search works by extracting trigrams from the regular expression and looking them up in the index. The more trigrams that can be extracted from the regular expression, the more efficient the index search. Unlike B-tree-based searches, the search string does not need to be left-anchored.

For LIKE and regular expression searches, remember that patterns with no extractable trigrams will degrade to a full index scan.

The choice between GiST and GIN indexes depends on the relative performance characteristics of GiST and GIN, which are discussed elsewhere.

5. Text Search Integration

When used in conjunction with a full-text index, trigram matching is a very useful tool. In particular, it can help identify misspelled input words that would not be matched directly by the full-text search mechanism.

The first step is to generate an auxiliary table containing all unique words from the documents:

CREATE TABLE words AS SELECT word FROM

ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');

Where documents is a table with a text field bodytext that we want to search. The reason for using the simple configuration with to_tsvector instead of a language-specific configuration is that we want a list of raw (unstemmed) words.

Next, create a trigram index on the word column:

CREATE INDEX words_idx ON words USING GIN(word gin_trgm_ops);

Now, a SELECT query similar to the previous example can be used to suggest spellings for misspelled words in a user's search terms. Requiring the selected words to also have a similar length to the misspelled word is a useful additional test.

Note: Since the words table has been generated as a separate, static table, it will need to be regenerated periodically so that it stays reasonably consistent with the document collection. However, it typically does not need to be exactly synchronized with the document collection.

mdx: format: md​

1. Trigram (or Trigraph) Concept​

2. Functions and Operators​

3. GUC Parameters​

4. Index Support​

5. Text Search Integration​