Developing UDFs with PL/Python
1. PL/Python Extension Overview
PL/Python is a loadable procedural language that allows you to write database User-Defined Functions (UDF) or stored procedures using Python. This enables you to leverage the powerful Python ecosystem directly within the database for complex data processing, analysis, and machine learning inference, without needing to move data out of the database, thereby improving development efficiency and reducing data movement overhead.
Core Value:
- Extensibility: Break through the limitations of built-in SQL functions to implement arbitrary complex logic.
- Flexibility: Leverage Python's vast ecosystem (such as NumPy, Pandas, Scikit-learn, Requests, etc.).
- Convenience: Data processing logic is closer to the data, reducing data movement and improving development efficiency.
In Tacnode, the supported PL/Python extensions include:
plpython3u
: Untrusted language extension supporting Python 3. Theu
stands for "untrusted," meaning it has fewer security sandbox restrictions, making it powerful but requiring careful use.plpythonvec3u
: Untrusted language extension for Python 3 that supports vectorized execution. It's designed to significantly improve performance when processing batch data rows.
Note: Using PL/Python requires first binding a UDF Server and keeping the UDF Server in a running state.
2. Built-in Python Library Support
To improve development convenience and performance, the system comes pre-installed with the following commonly used Python libraries that you can directly import and use in PL/Python UDFs without additional installation:
2.1 Database Connection and Operations
- psycopg2: PostgreSQL adapter for connecting and operating PostgreSQL databases in Python
- SQLAlchemy: Python SQL toolkit and Object-Relational Mapping, providing efficient data access patterns
2.2 Data Processing and Analysis
- numpy: Fundamental scientific computing library providing high-performance multidimensional array objects and mathematical functions
- pandas: Powerful data analysis and manipulation tool providing DataFrame data structures
- SciPy: Scientific computing library based on numpy, providing mathematical, scientific, and engineering computing functions
- Scikit-learn: Machine learning library providing various classification, regression, and clustering algorithms
2.3 Security and Encryption
- pycryptodome: Encryption toolkit providing various encryption algorithm implementations
- cryptography: Cryptographic library providing encryption recipes and primitives
- passlib: Password hashing library supporting multiple password hashing schemes
- pyOpenSSL: OpenSSL bindings providing SSL/TLS functionality
- base58: Base58 encoding/decoding library commonly used for cryptocurrency address encoding
2.4 Network and Cloud Services
- Requests: HTTP library for sending HTTP requests and calling REST APIs
- boto3: AWS SDK for Python for interacting with AWS services
2.5 Utilities and Tools
- pytest: Testing framework for writing and running test cases
- pendulum: Date-time library providing more intuitive date-time operation interfaces
These pre-installed libraries cover common use cases from data processing, machine learning to security encryption and cloud service integration, allowing you to quickly build complex in-database processing logic.
3. Enabling PL/Python Support
To use PL/Python, you first need to enable the corresponding extension in the target database.
3.1 Enable plpython3u
Use the CREATE EXTENSION
command to enable plpython3u
. Since it's an untrusted language, superuser privileges are required.
3.2 Enable plpythonvec3u
4. Writing PL/Python UDFs with plpython3u
The core of PL/Python UDF is writing Python scripts within SQL function definitions.
4.1 Basic Structure
A basic PL/Python UDF creation syntax is as follows:
4.2 Data Type Mapping
PL/Python automatically converts between SQL and Python data types:
SQL Type | Python Type |
---|---|
TEXT | str |
INTEGER | int |
FLOAT8 | float |
BOOL | bool |
ARRAY (e.g., INTEGER[]) | list (e.g., [int, int, ...]) |
JSON/JSONB | dict/list (Python objects) |
NULL | None |
4.3 Basic Examples
4.3.1 String Processing Function
4.3.2 Numerical Calculation and Array Processing
4.3.3 Returning Arrays
5. Writing PL/Python UDFs with plpythonvec3u
5.1 Vectorized Execution and plpythonvec3u
plpythonvec3u
is an extension designed to significantly improve PL/Python UDF performance when processing batch data.
Core Concept: Vectorized Execution
-
Traditional UDF (plpython3u): The Python function is called once for each row in the query result set. This means if you're processing 1 million rows of data, the Python interpreter will be called 1 million times, processing only one row of data each time. The overhead of this pattern (inter-process communication, function calls) is very large.
-
Vectorized UDF (plpythonvec3u): Receives a batch of rows at once (such as an array or matrix) and performs vectorized or batch processing operations within the Python function. This greatly reduces the number of function calls and inter-process communication overhead, and can better utilize the vectorized operation advantages of Python scientific computing libraries (such as NumPy, Pandas).
Expected Advantages of plpythonvec3u
- Dramatically Improve Performance: Reduce function call and process switching overhead, especially suitable for aggregation, transformation, and machine learning inference scenarios that need to process large numbers of rows.
- Better Hardware Utilization: More efficiently utilize modern CPU SIMD instructions and multi-core parallel capabilities.
- Seamless Integration with Scientific Computing Libraries: Directly receive and return NumPy arrays or Pandas DataFrames, facilitating the use of optimized batch computation functions from these libraries.
5.2 Usage Examples
6. Built-in Python Library UDF Usage Examples
6.1 NumPy UDF Array Statistical Analysis Example
6.2 Pandas UDF Data Cleaning and Transformation Example
6.3 Cryptography UDF Password Hash Verification Example
6.4 Requests Exchange Rate Query Example
7. Security and Production Environment Considerations
-
Untrusted Language: Both
plpython3u
andplpythonvec3u
are untrusted languages, powerful but high-risk. Never grant permissions to use or create these functions to untrusted users. -
Module Installation: Python libraries used in UDFs can only use built-in Python libraries; manual installation of third-party libraries is not supported.
-
Error Handling: Use
try-except
blocks in Python code to properly handle potential exceptions, avoiding entire query failures due to uncaught exceptions. -
Resource Management: Complex Python functions may consume significant CPU and memory resources. Monitor database resource usage to ensure UDFs don't affect overall database stability.
-
Code Quality: Ensure written Python code is robust, efficient, and thoroughly tested, especially handling edge cases (such as empty input, NULL values).
Advanced Examples
Machine Learning Integration
Time Series Analysis
This comprehensive guide provides everything needed to develop sophisticated PL/Python UDFs in Tacnode, from basic concepts to advanced machine learning integration.