tacnode

Vector Search

Overview

Artificial intelligence algorithms can convert various forms of unstructured data—such as speech, images, videos, language, text, and behaviors—into multi-dimensional vectors. These vectors act like coordinates in a mathematical space, representing different entities and their relationships.

The process of transforming unstructured data into vectors is called Embedding, while retrieving these vectors to find the associated entities is known as unstructured retrieval.

Traditional keyword search focuses on keyword matching and often misses the meaning and context behind queries. Semantic search, on the other hand, improves user intent fulfillment, especially for complex queries. It understands query context, handles ambiguous or incomplete searches, and delivers more relevant and useful results.

Vector retrieval capabilities are currently used in business, with common applications including:

  1. Using open-source vector retrieval kernels (like faiss, scann, proxima, and nmslib) to build an in-house online vector retrieval service via a Restful API.
  2. Implementing specialized vector retrieval databases to facilitate vector retrieval tasks.
  3. Choosing database or data warehouse products that support vector search features.

In many cases, the disadvantages of using a dedicated vector database outweigh its benefits. These disadvantages include data redundancy, excessive data movement, inconsistencies between distributed components, and the need for specialized knowledge. Other considerations include costs such as learning, labor, software licensing fees, limited query language capabilities, programmability, scalability issues, a narrow toolchain, and inferior data integrity and availability compared to traditional databases. Supporting vector retrieval within a distributed database allows leveraging its distributed features to efficiently manage scenarios involving large vector data. In systems that handle distributed transactions, operations such as writing, updating, and deleting vectors align with the strong consistency requirements of those transactions.

Tacnode offers vector retrieval features and supports two index types: IVFFlat (Inverted File with Flat Quantization) and HNSW (Hierarchical Navigable Small World), both enabling efficient vector retrieval and recall. It supports batch data imports, real-time updates and deletions, and combined queries that integrate conditional filtering with vector retrieval.

Additionally, Tacnode is enhanced by plug-ins that implement the pgvector feature of the PostgreSQL ecosystem, facilitating vector retrieval capabilities. pgvector offers a sleek, user-friendly interface and benefits from the extensive advantages of the PostgreSQL ecosystem.

Create Vector Table

The vector data type can support up to 2000 dimensions. When declaring a vector field, you must specify its dimensional information. Internally, there is an array of Float type. These arrays are organized into a compact binary format to ensure efficient storage and retrieval.

CREATE TABLE items
(
    id        bigserial PRIMARY KEY,
    embedding vector(3) -- dimensions
)
    USING columnar;

Vector Index

Currently, three methods for distance measurement are available:

  • <->: Euclidean distance (L2 Distance), using the index parameter vector_l2_ops
  • <#>: Inner Product distance (Inner Product), utilizing the index parameter vector_ip_ops
  • <=>: Cosine Distance, employing the index parameter vector_cosine_ops

The construction process can support various vector index types.

  • IVFFlat is an approximate nearest neighbor search algorithm using an inverted index to efficiently compare vector similarities. It partitions the vector space into multiple regions, each containing several vectors, and uses an inverted index to quickly locate vectors similar to a specified vector. The number of cluster center points can be determined using the lists parameter. The recommended approach for defining list values is: if rows are fewer than 1 million, use lists = rows/1000; if rows exceed 1 million, use lists = sqrt(rows), which enhances clustering outcomes. Though IVFFlat allows for faster index building and lower memory consumption, query performance may be slightly reduced. This method is particularly advantageous for handling large data sets, significantly decreasing the search space and enhancing query efficiency.
  • HNSW is a graph-based indexing structure that creates a multi-tier navigation graph. It uses a random jump technique for proximity searching, allowing for quick identification of similar vectors across various graph layers. While it requires more memory and a slower index-building process, its advantages include enhanced query performance and improved accuracy when handling high-dimensional data.
-- Create a Euclidean distance index
CREATE INDEX ON items USING split_ivfflat (embedding vector_l2_ops) WITH (lists = 100);
-- Creating an Inner Product distance index
CREATE INDEX ON items USING split_ivfflat (embedding vector_ip_ops) WITH (lists = 100);
-- Create a Cosine distance index
CREATE INDEX ON items USING split_ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
 
--Import vector data
INSERT INTO items (embedding) VALUES ('[1.0,0.0,-1.0]');

Note: The HNSW method is not officially available yet. Currently, only the brute force search and IVFFlat methods are offered. If no index is set up, the brute force search will be used for vector search.

-- ivfflat.probes. Set the number of cluster centers to examine. The larger the value, the higher the recall rate and the worse the retrieval performance. When this value is equal to the number of lists specified when building the index, it degenerates into a brute-force search.
 
-- Globally effective configuration
alter system set ivfflat.probes=10;
 
-- The current session configuration takes effect, which has a higher priority than the global configuration
SET ivfflat.probes = 10;
 
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 5;

On this page