Monitoring and Metrics
Monitoring and Metrics
Effective monitoring is essential for maintaining optimal performance and reliability in your Tacnode data warehouse. This guide covers the comprehensive metrics available for tracking system health, performance, and resource utilization.
Overview of Monitoring Capabilities
Key Monitoring Areas
- Nodegroup Performance: CPU, memory, and network utilization
- Query Performance: Latency, throughput, and error rates
- Database Storage: Size tracking and growth patterns
- System Health: Connection pools, failed operations, and resource contention
Monitoring Dashboard Features
- Real-time metrics visualization
- Historical trend analysis
- Configurable time ranges
- Exportable data for external analysis
Nodegroup Metrics
Monitor the health and performance of your compute resources through comprehensive nodegroup metrics.
Resource Utilization
Target and Current Size
- Shows nodegroup's planned vs. actual unit count
- Current size reflects units in normal service
- Temporary differences during scaling operations are normal
- Persistent mismatches may indicate system issues
Resource Utilization
- Combined CPU and memory consumption percentage
- Consistently above 80% indicates need for capacity expansion
- Helps identify optimal scaling points
- Tracks resource efficiency trends
Query Performance Monitoring
Queries Per Second (QPS) Track the number of SQL statements processed per second:
- SELECT QPS: Read operations throughput
- INSERT QPS: Data insertion rates
- UPDATE QPS: Modification operations
- DELETE QPS: Data removal operations
- COPY QPS: Bulk data loading operations
SQL Latency Metrics Monitor query execution times with percentile-based measurements:
- P99 Latency: 99th percentile response time
- P90 Latency: 90th percentile response time
- Available for each SQL operation type
- Extended abnormal latency requires investigation
Network and Connection Monitoring
Network Throughput
- Bytes received and sent per second
- Identifies network bottlenecks
- Tracks data transfer patterns
- Helps size network capacity
Connection Management
- Active Connections: Currently executing queries
- Idle Connections: Established but inactive connections
- Total Connections: Overall connection pool usage
- Connection pool optimization insights
Error and Performance Tracking
Failed Query Count
- Number of failed SQL statements per second
- Sudden increases indicate system or application issues
- Requires correlation with business and system conditions
- Essential for troubleshooting
Affected Rows Tracks data modification impact:
- Rows affected by INSERT operations
- Rows affected by UPDATE operations
- Rows affected by DELETE operations
- Rows affected by COPY operations
Database Storage Metrics
Storage Size Tracking
Nodegroup-Level Storage
- Total storage used by all databases in each nodegroup
- Helps with capacity planning
- Tracks storage consumption patterns
Database-Level Storage
- Individual database storage consumption
- Includes table data, indexes, and transaction logs
- Affected by data ingestion, modifications, indexing, and replication
Storage Growth Factors
- Data insertions and updates
- Index creation and maintenance
- Transaction log retention
- Schema changes and reorganization
- Backup and snapshot operations
Monitoring Metrics Reference
Nodegroup Computation Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
nodegroup_expect_units | Expected Units | gauge | Target unit count for the nodegroup |
nodegroup_running_units | Running Units | gauge | Current units in normal service |
nodegroup_resource_percent_normalized | Utilization | gauge | Combined CPU and memory utilization percentage |
Query Performance Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
nodegroup_select_qps | SELECT QPS | gauge | SELECT statements per second |
nodegroup_insert_qps | INSERT QPS | gauge | INSERT statements per second |
nodegroup_update_qps | UPDATE QPS | gauge | UPDATE statements per second |
nodegroup_delete_qps | DELETE QPS | gauge | DELETE statements per second |
nodegroup_copy_qps | COPY QPS | gauge | COPY statements per second |
nodegroup_failure_qps | Failed Query QPS | gauge | Failed statements per second |
Latency Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
nodegroup_sql_service_p99_latency | SQL Latency (P99) | gauge | 99th percentile query execution time |
nodegroup_sql_service_p90_latency | SQL Latency (P90) | gauge | 90th percentile query execution time |
nodegroup_sql_select_p99_latency | SELECT Latency (P99) | gauge | 99th percentile SELECT execution time |
nodegroup_sql_insert_p99_latency | INSERT Latency (P99) | gauge | 99th percentile INSERT execution time |
Network and Connection Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
nodegroup_network_receive_bytes | Network Throughput (Receive) | gauge | Bytes received per second |
nodegroup_network_send_bytes | Network Throughput (Send) | gauge | Bytes sent per second |
nodegroup_active_sql_connections | Active SQL Connections | gauge | Currently active database connections |
nodegroup_idle_sql_connections | Idle SQL Connections | gauge | Idle database connections |
Storage Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
nodegroup_size_bytes | Database Storage Size | gauge | Storage used by each database |
backup_size_bytes | Backup Storage Size | gauge | Storage used by backup files |
Data Sync Metrics
Metric Key | Metric Name | Type | Description |
---|---|---|---|
datasync_source_idle_time | Source Idle Time | gauge | Time since last data source activity |
datasync_emit_event_time | Event Processing Delay | gauge | Delay between event time and processing |
datasync_rps | Records Per Second | gauge | Data sync throughput in records |
datasync_bps | Bytes Per Second | gauge | Data sync throughput in bytes |
Best Practices for Monitoring
Setting Up Effective Monitoring
-
Establish Baselines
- Monitor normal operating patterns
- Document typical resource utilization ranges
- Identify seasonal or cyclical patterns
-
Configure Alerting Thresholds
- Set up proactive alerts for key metrics
- Avoid alert fatigue with appropriate thresholds
- Implement escalation procedures for critical issues
-
Regular Performance Reviews
- Analyze trends over time
- Identify optimization opportunities
- Plan capacity expansions proactively
Troubleshooting with Metrics
High Resource Utilization:
- Monitor CPU and memory trends
- Correlate with query patterns and workload changes
- Consider scaling or workload optimization
Query Performance Issues:
- Analyze latency percentiles by operation type
- Identify long-running or inefficient queries
- Correlate with resource utilization metrics
Storage Growth Concerns:
- Track database size growth rates
- Identify tables or databases with rapid expansion
- Plan storage capacity and cleanup strategies
Integration with External Systems
Metrics Export:
- Export metrics data for external analysis
- Integration with existing monitoring systems
- Custom dashboards and visualization tools
Automated Response:
- Set up automated scaling based on metrics
- Implement self-healing procedures
- Create operational runbooks linked to specific metric patterns
This comprehensive monitoring approach ensures proactive management of your Tacnode data warehouse environment.
Metric Key | Metric Name | Type | Sample Value | Label | Description |
---|---|---|---|---|---|
Nodegroup Computation Metrics | |||||
nodegroup_expect_units | Expected Units | gauge | 5 |
| Target/Current Units: Displays the target and current unit count of Nodegroup. The current unit count reflects the number of units in normal service. During cluster creation or scaling, target units > current units may occur temporarily before equilibrium. If target units != current units persists, it may indicate an abnormal state—contact support if this occurs. |
nodegroup_running_units | Running Units | gauge | 5 | ||
nodegroup_resource_percent_normalized | Utilization | gauge | 0.8 | Resource Utilization: Indicates overall resource utilization of Nodegroup, incorporating both CPU and memory usage. If persistently above 80%, consider scaling clusters. | |
nodegroup_select_qps | Select QPS | gauge | 1000 | QPS: Number of SQL statements handled per second by Nodegroup, including Select, Update, Insert, Delete, and Copy queries. | |
nodegroup_update_qps | Update QPS | gauge | 1000 | ||
nodegroup_insert_qps | Insert QPS | gauge | 1000 | ||
nodegroup_delete_qps | Delete QPS | gauge | 1000 | ||
nodegroup_copy_qps | Copy QPS | gauge | 1 | ||
nodegroup_failure_qps | Failed Query QPS | gauge | 1 | Failed Queries: Number of failed SQL statements executed per second by Nodegroup. Investigate if this value surges in conjunction with business/system status. | |
nodegroup_insert_affected_rows | Rows Affected by Insert | gauge | 10000 | Rows Affected: Shows the number of rows impacted per second by INSERT, UPDATE, or DELETE operations executed by Nodegroup. If anomalies or unexpected results occur, further analysis is required in conjunction with application and system status. | |
nodegroup_update_affected_rows | Rows Affected by Update | gauge | 10000 | ||
nodegroup_delete_affected_rows | Rows Affected by Delete | gauge | 10000 | ||
nodegroup_copy_affected_rows | Rows Affected by Copy | gauge | 10000 | ||
nodegroup_sql_select_p90_latency | Select Latency (P90) | gauge | 38818282.52 (ns) | SQL Latency by Type: Collects P99 and P90 latency metrics for each type of SQL operation (SELECT, INSERT, UPDATE, DELETE, COPY) in Nodegroup. Extended abnormal values (lasting several minutes or more) should be troubleshot relative to business processes and system conditions. | |
nodegroup_sql_select_p99_latency | Select Latency (P99) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_insert_p90_latency | Insert Latency (P90) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_insert_p99_latency | Insert Latency (P99) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_update_p90_latency | Update Latency (P90) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_update_p99_latency | Update Latency (P99) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_delete_p90_latency | Delete Latency (P90) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_delete_p99_latency | Delete Latency (P99) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_copy_p90_latency | Copy Latency (P90) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_copy_p99_latency | Copy Latency (P99) | gauge | 38818282.52 (ns) | ||
nodegroup_sql_service_p90_latency | SQL Latency (P90) | gauge | 38818282.52 (ns) | SQL Latency: | |
nodegroup_sql_service_p99_latency | SQL Latency (P99) | gauge | 38,818,282.52 (ns) | ||
nodegroup_network_receive_bytes | Network Throughput (Receive) | gauge | 92,468.533333 (bytes) | Network Throughput: Displays Nodegroup network throughput, including bytes received and sent. | |
nodegroup_network_send_bytes | Network Throughput (Send) | gauge | 92,468.533333 (bytes) | ||
nodegroup_active_sql_connections | Active SQL Connections | gauge | 100 | Connections: Shows SQL connections on Nodegroup, including active and idle connections. | |
nodegroup_idle_sql_connections | Idle SQL Connections | gauge | 20 | ||
Database Storage Metrics | |||||
nodegroup_size_bytes | Database Storage Size | gauge | 1,073,741,824 (1TB) |
| Per-database storage size: Displays the storage size for each database. Includes all underlying physical storage used: table data, indexes, and WAL. Storage size is affected by insertions, updates, index rebuilds, transactions, schema changes, replication, and snapshots. |
Backup Metrics | |||||
backup_size_bytes | Backup Storage Size | gauge | 1,073,741,824 (1TB) |
| Storage size per backup |
Data Sync Metrics | |||||
datasync_source_idle_time | gauge | todo |
| Source idle time (seconds): Current system time - last record event time. Increases when there is no incoming data. | |
datasync_emit_event_time | gauge | todo | Delay for the most recently received data (seconds): Last system receipt timestamp - last event's business time. Will not increment when there's no data at the source. | ||
datasync_source_heartbeat_time | gauge | todo | Source heartbeat time (seconds): Metric generation time - most recent attempt to read source. Growth indicates downstream backpressure. | ||
datasync_rps | gauge | todo | Records per second | ||
datasync_bps | gauge | todo | Bytes per second |