Full Text Search
Full-text retrieval is essential in modern data applications, significantly enhancing their capabilities and maximizing data value. This article covers functions like trgm, tsvector, and tsquery, making it easy for users to implement full-text retrieval in data processing applications.
Challenges of Full Text Search
Implementing full-text retrieval presents several challenges within data architecture.
Functional Challenges
- User-Friendly Query Syntax:
- Provide a query syntax that is easy for users to understand and use, reducing learning costs and barriers to usage.
- Data Preprocessing Functions:
- Word Stemming: Support for word stemming is necessary due to different word forms in various languages. For example, searching for "cat" should also find documents containing "cats."
- Stop Word Filtering: Remove common but insignificant words, such as "the" and "is" in English.
Performance Challenges
- Performance and Scalability: Must meet QPS and latency requirements in business scenarios, with the ability to scale horizontally as data volumes increase.
- Real-Time Data: Ensure search queries can access up-to-date information in real-time while maintaining consistent query results.
Calculate string similarity using the trgm extension
For simple retrieval cases involving string fuzzy matching, such as finding users by a few characters in their email, Tacnode integrates the PostgreSQL pg_trgm extension. This extension enhances partial text search functionality through trigram matching and provides various functions and operators to assess text similarity.
A trigram consists of three consecutive characters from a text string. By segmenting text into trigrams, users can conduct similarity searches more effectively and flexibly.
Compute Triples
The pg_trgm
module calculates triples from text strings as follows:
- Only alphanumeric characters are included.
- Convert the string to lowercase before computing the triples.
- Each word is treated as having a prefix of two spaces and a suffix of one space.
- A set of triples is produced for deduplicated results.
Calculate Similarity
For two strings A
and B
, pg_trgm determines the similarity score using a set of triples by:
- Dividing the size of the intersection of both sets by the size of their union.
The show_trgm
and similarity
functions allow us to examine how pg_trgm counts triples in a string and how the similarity score is computed:
This straightforward and user-friendly extension is ideal for fuzzy string-matching situations.
Here's an easy example:
SPLIT_GIN Index
Tacnode features a SPLIT_GIN
distributed inverted index, designed to enhance query efficiency in full-text retrieval situations.
Implement SPLIT_GIN
indexes to accelerate related search queries. By utilizing gin_trgm_ops
parameters, boost the performance of LIKE
and ILIKE
operators.
Leverage advanced retrieval expressions using tsvector
and tsquery
Essential text search features are provided through native operators like ~
, ~*
, LIKE
, and ILIKE
, along with the trgm extension.
For more sophisticated search needs, employ advanced functions for precise word matching and logical combinations of terms using AND
, OR
, and NOT
.
Tacnode offers two essential data types and query operators for complex full-text searches:
tsvector
type: encapsulates a collection of lexemes formatted into a string based on chosen word segmentation rules.tsquery
type: defines a text query, utilizing Boolean operators& (AND)
,| (OR)
, and! (NOT)
to formulate lexeme combinations.@@
operator: evaluatestsvector
@@tsquery
, returning a boolean value indicating whether the word segmentation aligns with the query.
tsvector type
tsvector
: Transforms a string into a word segmentation format based on chosen word segmentation rules. It can be seen as a collection of abstracted lexemes. For further details, refer totsvector
.- Additionally, the built-in
to_tsvector
function conducts normalization.
- Additionally, the built-in
tsquery type
- Lexemes represent a text query and combine them using the Boolean operators
& (AND)
,| (OR)
, and! (NOT)
. Refer totsquery
. - The built-in
to_tsquery
functions will also undergo normalization.
tsquery Operator
-
&
(AND): Both parameters must be present in the document for a match to occur. -
|
(OR): At least one argument must be present. -
!
(NOT): Ensures its argument does not appear to match. For example, the queryfat &! rat
matches documents containingfat
while excluding those withrat
. -
<->
(FOLLOWED BY)tsquery
: Looks for phrases, matching only if its arguments are adjacent and arranged in the specified order. For example: -
<N>
serves as a broader variant of theFOLLOWED BY
operator, whereN
denotes an integer specifying the gap between token positions.<1>
aligns with<->
, whereas<2>
permits one additional token between the matches, and so forth. Thephraseto_tsquery
function employs this operator to create a phrase that corresponds to a multi-wordtsquery
, even when some words are stop words. For instance: -
Use parentheses to manage the nesting of
tsquery
operators.- Without parentheses, the priority of
|
is the lowest, followed in ascending order by&
,<->
, and!
.
- Without parentheses, the priority of
tsquery functions
plainto_tsquery
transforms the raw textquerytext
into atsquery
value. The input text undergoes parsing and normalization, similar toto_tsvector
, and the&
(AND) Boolean operator is placed between the remaining terms.
plainto_tsquery
websearch_to_tsquery
The function websearch_to_tsquery
generates a tsquery
value from querytext
by utilizing an alternate syntax where plain, unformatted text qualifies as a valid query. In contrast to plainto_tsquery
and phraseto_tsquery
, it also accepts special operators. Moreover, this function guarantees no syntax errors will occur, enabling the direct use of user-supplied input for search purposes. The supported syntaxes include:
no quoted text
: Text not surrounded by quotes will be processed into&
operator-separated words usingplainto_tsquery
.- "quote text": Text enclosed in quotes will be transformed into
<->
operator-separated words viaphraseto_tsquery
. OR
: The term "or” will be converted into the|
operator.-
: This symbol will be changed to!
operators.
SPLIT_GIN Index
Creating a SPLIT_GIN
index for the ts_vector
field is recommended.
Examples: