PostgreSQL Full Text Search
Summary: in this tutorial, you will learn about PostgreSQL full text search and how to use it to perform complex searches on text stored in the database.
Introduction to the PostgreSQL full-text search
In PostgreSQL, full-text search is a built-in feature that allows you to perform complex searches on text stored in the database.
Full-text search enables efficient searching of documents, articles, or any form of text-based content by indexing their words and phrases.
The following describes an overview of how the PostgreSQL full text search works:
- Indexing: PostgreSQL allows you to create indexes on columns with the tsvector data type. When creating a full-text search index, PostgreSQL analyzes text data and generates a specialized data structure called tsvector, which represents the indexed document in a format optimized for searches.
- Text analysis: For building a full text index, PostgreSQL performs a text analysis process that involves tokenizing text into individual words or tokens, removing stop words ( like
the
,and
,..) applying stemming (or lemmatization) to reduce words to their root forms, and performing other linguistic transformations to prepare the text for indexing. - Querying: Once PostgreSQL creates the full text index, you can perform full-text search queries using dedicated full text search functions and operators. For example, you can search for specific words or phrases, apply boolean operators to combine search terms, and more.
- Ranking: allows you to rank the search results based on their relevance to the query. PostgreSQL offers the
ts_rank()
function to calculate a rank score for each document based on its similarity to the search query. You can use the ranking feature to sort the search results based on relevance. - Highlighting: PostgreSQL can generate snippets or summaries of documents containing the matching words from the search query.
In practice, you often use the full text search feature in applications including content management systems (CMS
), document management systems, and so on to enable fast searches on a large volume of text.
PostgreSQL full text search data types
PostgreSQL provides you with two data types unique to full text search: tsvector
and tsquery
:
tsvector
tsvector
is a data type that allows you to store preprocessed documents in a format optimized for efficient searching and retrieval of text.
A tsvector
value contains a sorted list of lexemes (words) and their positions and weight within a document.
Note that lexemes are words without the variation created by suffixes, for example, watches
, watched
, and watching
words have the lexeme watch
.
For example, the following uses the to_tsvector()
function to convert the words watches
, watched
, and watching
to tsvector
:
Output:
In this example, the to_tsvector()
function converts the words into tsvector values. Instead of returning the original words, it returns the lexemes of these words, which is watch
.
The following example uses the to_tsvector()
function to convert a string to a tsvector value. For example:
Output:
In this example:
- Each entry in the tsvector value represents a word (lexeme) and its position within the string (or document). For example, the word
quick
appears at position 2, the wordbrown
appears at position 3, and so on. - Words are sorted in alphabetical order.
- Articles and stop words are omitted like
The
andover
.
tsquery
tsquery
is a data type that represents search queries in full-text searches. It allows you to specify search conditions containing the indexed document’s words or phrases.
Additionally, a tsquery
value can include search operators to refine the search conditions.
- Boolean operators AND (&), OR (|), and NOT (!): can combine search terms and define logical relationships between them.
- Phrase search (“”): Double quotes (“”) indicate that the enclosed words must appear together in the index document in the specified order.
- Prefix search (:) : A colon (:) after a word indicates that the search term should match words with the same prefix.
- Negation (-): a negation excludes specific terms from the search results.
- Grouping (): You can use parentheses to group terms and operators to define flexible search conditions.
For example:
This tsquery
searches for the documents that contain the words “quick” and “brown” together (in any order), or the word “fox”.
The to_tsquery()
converts a string to a tsquery
. For example, the following statement uses the to_tsquery()
to convert the word “jumping” to a tsquery
:
Output:
The match operator (@@)
The match operator (@@
) evaluates the similarity between the text in a document (tsvector
) and the terms specified in the search query (tsquery
) and returns true if there is a match or false otherwise.
For example, the following statement uses the @@
operator to determine if the string matches the tsquery
:
Output:
It returns true because the tsvector contains the word jump which is the lexeme of the word jumping.
The following example uses the match operator (@@
) to determine if the string contains the word cat:
Output:
Using PostgreSQL full-text search index with table data
Let’s take some examples of using full-text searches with boolean operators.
1) Setting up a sample table
First, create a new table called posts
:
In the posts
table, the body_search
is a generated column with the data type tsvector
.
Whenever you change data in the body
column, PostgreSQL will convert it to a tsvector
using the to_tsvector()
function and store it in the body_search
column.
Second, insert some rows into the posts
table:
Third, retrieve data from the id
and body_search
columns:
Output:
2) Simple full-text search
The following example uses the match operator (@@
) to search for posts whose body
contains the word "PostgreSQL"
:
Output:
3) Full-text search with AND operator
The following example uses the AND operator (&) to search for posts whose body contains both words “PostgreSQL” and “techniques” that can appear in any order:
Output:
4) Full-text search with OR operator
The following example uses the OR operator (|) to search for posts whose body contains either the word "efficient"
or "optimization"
:
Output:
5) full-text search with phrase
The following example searches for posts whose body contains the phrase “PostgreSQL technique”:
Output:
6) Full-text search with negation
The following example searches for posts whose body does not contain the word “efficient”:
Output:
Full text search using GIN indexes
In PostgreSQL, GIN stands for Generalized Inverted Index. GIN index is a type of index that is optimized for full-text search vectors (tsvector
).
First, drop the posts
table and recreate it using the following statements:
Second, insert rows into the posts
table:
Third, create a GIN index on the body
column of the posts
table:
Finally, search for the posts
whose body contains either the word basic
or advanced
:
Output:
Summary
- Use PostgreSQL full text search feature to perform complex searches on text stored in the database.
- Use tsvector and tsquery data types for full text searches.
- Use the match operator (
@@
) to check if documents match a query. - Use the generated column with
tsvector
type to store the tsvector data for full text searches. - Use
GIN
indexes for full-text search vectors (tsvector
).