tacnode

Full Text Search

Full-text retrieval is essential in modern data applications, significantly enhancing their capabilities and maximizing data value. This article covers functions like trgm, tsvector, and tsquery, making it easy for users to implement full-text retrieval in data processing applications.

Implementing full-text retrieval presents several challenges within data architecture.

Functional Challenges

  • User-Friendly Query Syntax:
    • Provide a query syntax that is easy for users to understand and use, reducing learning costs and barriers to usage.
  • Data Preprocessing Functions:
    • Word Stemming: Support for word stemming is necessary due to different word forms in various languages. For example, searching for "cat" should also find documents containing "cats."
    • Stop Word Filtering: Remove common but insignificant words, such as "the" and "is" in English.

Performance Challenges

  • Performance and Scalability: Must meet QPS and latency requirements in business scenarios, with the ability to scale horizontally as data volumes increase.
  • Real-Time Data: Ensure search queries can access up-to-date information in real-time while maintaining consistent query results.

Calculate string similarity using the trgm extension

For simple retrieval cases involving string fuzzy matching, such as finding users by a few characters in their email, Tacnode integrates the PostgreSQL pg_trgm extension. This extension enhances partial text search functionality through trigram matching and provides various functions and operators to assess text similarity.

A trigram consists of three consecutive characters from a text string. By segmenting text into trigrams, users can conduct similarity searches more effectively and flexibly.

Compute Triples

The pg_trgm module calculates triples from text strings as follows:

  • Only alphanumeric characters are included.
  • Convert the string to lowercase before computing the triples.
  • Each word is treated as having a prefix of two spaces and a suffix of one space.
  • A set of triples is produced for deduplicated results.

Calculate Similarity

For two strings A and B, pg_trgm determines the similarity score using a set of triples by:

  • Dividing the size of the intersection of both sets by the size of their union.

The show_trgm and similarity functions allow us to examine how pg_trgm counts triples in a string and how the similarity score is computed:

SELECT show_trgm('Zhangsan'), show_trgm('Zhan'), similarity('Zhangsan', 'Zhan');
 
                  show_trgm                  |          show_trgm          | similarity
---------------------------------------------+-----------------------------+------------
{"  z"," zh","an ",ang,gsa,han,ngs,san,zha} | {"  z"," zh","an ",han,zha} |  0.5555556(1 row)
 
-- The two input strings have 9 different triplets (union) and 5 identical triplets (intersection).
-- Therefore, the similarity score is 5/9 (0.5555556).

This straightforward and user-friendly extension is ideal for fuzzy string-matching situations.

Here's an easy example:

-- Create the test_email table
CREATE TABLE email_search (
    email TEXT
) USING COLUMNAR;
 
INSERT INTO email_search (email) VALUES
('alice@example.com'),
('bob@sample.org'),
('charlie@gmail.com'),
('david@yahoo.com'),
('eve@hotmail.com'),
('frank@outlook.com'),
('gerry@qq.com'),
('lisi@sina.com'),
('zhangsan@qq.com');

SPLIT_GIN Index

Tacnode features a SPLIT_GIN distributed inverted index, designed to enhance query efficiency in full-text retrieval situations.

Implement SPLIT_GIN indexes to accelerate related search queries. By utilizing gin_trgm_ops parameters, boost the performance of LIKE and ILIKE operators.

-- create index
CREATE INDEX email_search_email_gin ON email_search USING SPLIT_GIN(email gin_trgm_ops);
 
-- verify index
textsearch=> SELECT email FROM email_search where email like '%zhangsan%';
      email
-----------------
 zhangsan@qq.com
(1 row)
 
textsearch=> SELECT email FROM email_search where email ilike '%lisi%';
     email
---------------
 lisi@sina.com
(1 row)

Leverage advanced retrieval expressions using tsvector and tsquery

Essential text search features are provided through native operators like ~, ~*, LIKE, and ILIKE, along with the trgm extension.

For more sophisticated search needs, employ advanced functions for precise word matching and logical combinations of terms using AND, OR, and NOT.

Tacnode offers two essential data types and query operators for complex full-text searches:

  1. tsvector type: encapsulates a collection of lexemes formatted into a string based on chosen word segmentation rules.
  2. tsquery type: defines a text query, utilizing Boolean operators & (AND), | (OR), and ! (NOT) to formulate lexeme combinations.
  3. @@ operator: evaluates tsvector @@ tsquery, returning a boolean value indicating whether the word segmentation aligns with the query.

tsvector type

  • tsvector: Transforms a string into a word segmentation format based on chosen word segmentation rules. It can be seen as a collection of abstracted lexemes. For further details, refer to tsvector.
    • Additionally, the built-in to_tsvector function conducts normalization.
textsearch=> SELECT 'The Fat Rats'::tsvector;
      tsvector
--------------------
 'Fat' 'Rats' 'The'
(1 row)
 
textsearch=> SELECT to_tsvector('The Fat Rats');
   to_tsvector
-----------------
 'fat':2 'rat':3
(1 row)

tsquery type

  • Lexemes represent a text query and combine them using the Boolean operators & (AND), | (OR), and ! (NOT). Refer to tsquery.
  • The built-in to_tsquery functions will also undergo normalization.
textsearch=> SELECT 'Fat & Rats'::tsquery;
    tsquery
----------------
 'Fat' & 'Rats'(1 row)
 
textsearch=> SELECT to_tsquery('english', 'Fat & Rats');
  to_tsquery
---------------
 'fat' & 'rat'(1 row)

tsquery Operator

  • & (AND): Both parameters must be present in the document for a match to occur.

  • | (OR): At least one argument must be present.

  • ! (NOT): Ensures its argument does not appear to match. For example, the query fat &! rat matches documents containing fat while excluding those with rat.

  • <-> (FOLLOWED BY) tsquery: Looks for phrases, matching only if its arguments are adjacent and arranged in the specified order. For example:

    textsearch=> SELECT to_tsvector('fatal error') @@ to_tsquery('fatal <-> error');
     ?column?
    ----------
     t
     
    textsearch=> SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal <-> error');
     ?column?
    ----------
     f
  • <N> serves as a broader variant of the FOLLOWED BY operator, where N denotes an integer specifying the gap between token positions. <1> aligns with <->, whereas <2> permits one additional token between the matches, and so forth. The phraseto_tsquery function employs this operator to create a phrase that corresponds to a multi-word tsquery, even when some words are stop words. For instance:

    textsearch=> SELECT phraseto_tsquery('cats ate rats');
    	 phraseto_tsquery
    ---------------------------
     'cat' <-> 'ate' <-> 'rat'(1 row)
     
    textsearch=> SELECT phraseto_tsquery('the cats ate the rats');
         phraseto_tsquery
    ---------------------------
     'cat' <-> 'ate' <2> 'rat'(1 row)
  • Use parentheses to manage the nesting of tsquery operators.

    • Without parentheses, the priority of | is the lowest, followed in ascending order by &, <->, and !.

tsquery functions

  • plainto_tsquery transforms the raw text querytext into a tsquery value. The input text undergoes parsing and normalization, similar to to_tsvector, and the & (AND) Boolean operator is placed between the remaining terms.

plainto_tsquery

textsearch=> SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery
-----------------
 'fat' & 'rat'(1 row)
  • websearch_to_tsquery

The function websearch_to_tsquery generates a tsquery value from querytext by utilizing an alternate syntax where plain, unformatted text qualifies as a valid query. In contrast to plainto_tsquery and phraseto_tsquery, it also accepts special operators. Moreover, this function guarantees no syntax errors will occur, enabling the direct use of user-supplied input for search purposes. The supported syntaxes include:

  • no quoted text: Text not surrounded by quotes will be processed into & operator-separated words using plainto_tsquery.
  • "quote text": Text enclosed in quotes will be transformed into <-> operator-separated words via phraseto_tsquery.
  • OR: The term "or” will be converted into the | operator.
  • -: This symbol will be changed to ! operators.
textsearch=> SELECT websearch_to_tsquery('english', 'The fat rats');
 websearch_to_tsquery
----------------------
 'fat' & 'rat'
(1 row)
 
textsearch=> SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
 websearch_to_tsquery
----------------------------------
 'supernova' <-> 'star' & !'crab'
 
textsearch=> SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
 websearch_to_tsquery
-----------------------------------
 'sad' <-> 'cat' | 'fat' <-> 'rat'
(1 row)
 
textsearch=> SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
 websearch_to_tsquery
---------------------------------------
 'signal' & !( 'segment' <-> 'fault' )
(1 row)
 
textsearch=> SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->');
 websearch_to_tsquery
----------------------
 'dummi' <-> 'queri'
(1 row)

SPLIT_GIN Index

Creating a SPLIT_GIN index for the ts_vector field is recommended.

-- Specify columnar storage to facilitate the use of SPLIT_GIN index
CREATE TABLE text_search (content text) using COLUMNAR;
 
INSERT INTO text_search VALUES
('Ruby on Rails for web application development.'),
('Rust ownership and memory safety.'),
('Concurrency in Golang programming.'),
('Building web APIs with Golang and Rust.'),
('Golang vs Rust: A comparison of programming languages.'),
('Mobile app development using React Native.'),
('Functional programming with Haskell and Scala.'),
('The impact of cloud computing on modern IT.'),
('Security best practices in web development.'),
('Optimizing SQL queries for performance.'),
('JavaScript frameworks for web application development.'),
('Software development life cycle methodologies.'),
('Artificial intelligence and machine learning in Python.');
 
--Add a tsvector calculated column to the document (can also be added as a normal column)
ALTER TABLE text_search ADD tsv_content TSVECTOR
GENERATED ALWAYS AS (to_tsvector('english', content)) STORED;
 
-- Create an index for the tsvector column
CREATE INDEX text_search_tsv_gin ON text_search USING SPLIT_GIN(tsv_content);

Examples:

textsearch=> SELECT content FROM text_search WHERE tsv_content @@ to_tsquery('english', 'web & development');
                   content
--------------------------------------------------------
 Security best practices in web development.
 Ruby on Rails for web application development.
 JavaScript frameworks for web application development.
(3 rows)
 
textsearch=> SELECT content FROM text_search WHERE tsv_content @@ to_tsquery('english', 'web <-> development');
                   content
---------------------------------------------
 Security best practices in web development.
(1 row)
 
textsearch=> SELECT content FROM text_search  WHERE tsv_content @@ to_tsquery('english', 'JavaScript | application');
                   content
--------------------------------------------------------
 Ruby on Rails for web application development.
 JavaScript frameworks for web application development.
(2 rows)
 
textsearch=> SELECT content FROM text_search  WHERE tsv_content @@ to_tsquery('english', 'Rust & !programming');
                   content
-----------------------------------------
 Building web APIs with Golang and Rust.
 Rust ownership and memory safety.
(2 rows)
 
textsearch=> SELECT content FROM text_search WHERE tsv_content @@ to_tsquery('english', 'cloud | !develop');
                   content
---------------------------------------------------------
 Rust ownership and memory safety.
 Concurrency in Golang programming.
 Building web APIs with Golang and Rust.
 Golang vs Rust: A comparison of programming languages.
 Functional programming with Haskell and Scala.
 The impact of cloud computing on modern IT.
 Optimizing SQL queries for performance.
 Artificial intelligence and machine learning in Python.
(8 rows)