Monitoring and Metrics

Effective monitoring is essential for maintaining optimal performance and reliability in your Tacnode data warehouse. This guide covers the comprehensive metrics available for tracking system health, performance, and resource utilization.

Overview of Monitoring Capabilities

Key Monitoring Areas

Nodegroup Performance: CPU, memory, and network utilization
Query Performance: Latency, throughput, and error rates
Database Storage: Size tracking and growth patterns
System Health: Connection pools, failed operations, and resource contention

Monitoring Dashboard Features

Real-time metrics visualization
Historical trend analysis
Configurable time ranges
Exportable data for external analysis

Nodegroup Metrics

Monitor the health and performance of your compute resources through comprehensive nodegroup metrics.

Resource Utilization

Target and Current Size

Shows nodegroup's planned vs. actual unit count
Current size reflects units in normal service
Temporary differences during scaling operations are normal
Persistent mismatches may indicate system issues

Resource Utilization

Combined CPU and memory consumption percentage
Consistently above 80% indicates need for capacity expansion
Helps identify optimal scaling points
Tracks resource efficiency trends

Query Performance Monitoring

Queries Per Second (QPS) Track the number of SQL statements processed per second:

SELECT QPS: Read operations throughput
INSERT QPS: Data insertion rates
UPDATE QPS: Modification operations
DELETE QPS: Data removal operations
COPY QPS: Bulk data loading operations

SQL Latency Metrics Monitor query execution times with percentile-based measurements:

P99 Latency: 99th percentile response time
P90 Latency: 90th percentile response time
Available for each SQL operation type
Extended abnormal latency requires investigation

Network and Connection Monitoring

Network Throughput

Bytes received and sent per second
Identifies network bottlenecks
Tracks data transfer patterns
Helps size network capacity

Connection Management

Active Connections: Currently executing queries
Idle Connections: Established but inactive connections
Total Connections: Overall connection pool usage
Connection pool optimization insights

Error and Performance Tracking

Failed Query Count

Number of failed SQL statements per second
Sudden increases indicate system or application issues
Requires correlation with business and system conditions
Essential for troubleshooting

Affected Rows Tracks data modification impact:

Rows affected by INSERT operations
Rows affected by UPDATE operations
Rows affected by DELETE operations
Rows affected by COPY operations

Database Storage Metrics

Storage Size Tracking

Nodegroup-Level Storage

Total storage used by all databases in each nodegroup
Helps with capacity planning
Tracks storage consumption patterns

Database-Level Storage

Individual database storage consumption
Includes table data, indexes, and transaction logs
Affected by data ingestion, modifications, indexing, and replication

Storage Growth Factors

Data insertions and updates
Index creation and maintenance
Transaction log retention
Schema changes and reorganization
Backup and snapshot operations

Monitoring Metrics Reference

Nodegroup Computation Metrics

Metric Key	Metric Name	Type	Description
`nodegroup_expect_units`	Expected Units	gauge	Target unit count for the nodegroup
`nodegroup_running_units`	Running Units	gauge	Current units in normal service
`nodegroup_resource_percent_normalized`	Utilization	gauge	Combined CPU and memory utilization percentage

Query Performance Metrics

Metric Key	Metric Name	Type	Description
`nodegroup_select_qps`	SELECT QPS	gauge	SELECT statements per second
`nodegroup_insert_qps`	INSERT QPS	gauge	INSERT statements per second
`nodegroup_update_qps`	UPDATE QPS	gauge	UPDATE statements per second
`nodegroup_delete_qps`	DELETE QPS	gauge	DELETE statements per second
`nodegroup_copy_qps`	COPY QPS	gauge	COPY statements per second
`nodegroup_failure_qps`	Failed Query QPS	gauge	Failed statements per second

Latency Metrics

Metric Key	Metric Name	Type	Description
`nodegroup_sql_service_p99_latency`	SQL Latency (P99)	gauge	99th percentile query execution time
`nodegroup_sql_service_p90_latency`	SQL Latency (P90)	gauge	90th percentile query execution time
`nodegroup_sql_select_p99_latency`	SELECT Latency (P99)	gauge	99th percentile SELECT execution time
`nodegroup_sql_insert_p99_latency`	INSERT Latency (P99)	gauge	99th percentile INSERT execution time

Network and Connection Metrics

Metric Key	Metric Name	Type	Description
`nodegroup_network_receive_bytes`	Network Throughput (Receive)	gauge	Bytes received per second
`nodegroup_network_send_bytes`	Network Throughput (Send)	gauge	Bytes sent per second
`nodegroup_active_sql_connections`	Active SQL Connections	gauge	Currently active database connections
`nodegroup_idle_sql_connections`	Idle SQL Connections	gauge	Idle database connections

Storage Metrics

Metric Key	Metric Name	Type	Description
`nodegroup_size_bytes`	Database Storage Size	gauge	Storage used by each database
`backup_size_bytes`	Backup Storage Size	gauge	Storage used by backup files

Data Sync Metrics

Metric Key	Metric Name	Type	Description
`datasync_source_idle_time`	Source Idle Time	gauge	Time since last data source activity
`datasync_emit_event_time`	Event Processing Delay	gauge	Delay between event time and processing
`datasync_rps`	Records Per Second	gauge	Data sync throughput in records
`datasync_bps`	Bytes Per Second	gauge	Data sync throughput in bytes

Best Practices for Monitoring

Setting Up Effective Monitoring

Establish Baselines
- Monitor normal operating patterns
- Document typical resource utilization ranges
- Identify seasonal or cyclical patterns
Configure Alerting Thresholds
- Set up proactive alerts for key metrics
- Avoid alert fatigue with appropriate thresholds
- Implement escalation procedures for critical issues
Regular Performance Reviews
- Analyze trends over time
- Identify optimization opportunities
- Plan capacity expansions proactively

Troubleshooting with Metrics

High Resource Utilization:

Monitor CPU and memory trends
Correlate with query patterns and workload changes
Consider scaling or workload optimization

Query Performance Issues:

Analyze latency percentiles by operation type
Identify long-running or inefficient queries
Correlate with resource utilization metrics

Storage Growth Concerns:

Track database size growth rates
Identify tables or databases with rapid expansion
Plan storage capacity and cleanup strategies

Integration with External Systems

Metrics Export:

Export metrics data for external analysis
Integration with existing monitoring systems
Custom dashboards and visualization tools

Automated Response:

Set up automated scaling based on metrics
Implement self-healing procedures
Create operational runbooks linked to specific metric patterns

This comprehensive monitoring approach ensures proactive management of your Tacnode data warehouse environment.

Metric Key	Metric Name	Type	Sample Value	Label	Description
Nodegroup Computation Metrics
nodegroup_expect_units	Expected Units	gauge	5	_cloudProvider _region _datacloudId _id _name	Target/Current Units: Displays the target and current unit count of Nodegroup. The current unit count reflects the number of units in normal service. During cluster creation or scaling, target units > current units may occur temporarily before equilibrium. If target units != current units persists, it may indicate an abnormal state—contact support if this occurs.
nodegroup_running_units	Running Units	gauge	5
nodegroup_resource_percent_normalized	Utilization	gauge	0.8		Resource Utilization: Indicates overall resource utilization of Nodegroup, incorporating both CPU and memory usage. If persistently above 80%, consider scaling clusters.
nodegroup_select_qps	Select QPS	gauge	1000		QPS: Number of SQL statements handled per second by Nodegroup, including Select, Update, Insert, Delete, and Copy queries.
nodegroup_update_qps	Update QPS	gauge	1000
nodegroup_insert_qps	Insert QPS	gauge	1000
nodegroup_delete_qps	Delete QPS	gauge	1000
nodegroup_copy_qps	Copy QPS	gauge	1
nodegroup_failure_qps	Failed Query QPS	gauge	1		Failed Queries: Number of failed SQL statements executed per second by Nodegroup. Investigate if this value surges in conjunction with business/system status.
nodegroup_insert_affected_rows	Rows Affected by Insert	gauge	10000		Rows Affected: Shows the number of rows impacted per second by INSERT, UPDATE, or DELETE operations executed by Nodegroup. If anomalies or unexpected results occur, further analysis is required in conjunction with application and system status.
nodegroup_update_affected_rows	Rows Affected by Update	gauge	10000
nodegroup_delete_affected_rows	Rows Affected by Delete	gauge	10000
nodegroup_copy_affected_rows	Rows Affected by Copy	gauge	10000
nodegroup_sql_select_p90_latency	Select Latency (P90)	gauge	38818282.52 (ns)		SQL Latency by Type: Collects P99 and P90 latency metrics for each type of SQL operation (SELECT, INSERT, UPDATE, DELETE, COPY) in Nodegroup. Extended abnormal values (lasting several minutes or more) should be troubleshot relative to business processes and system conditions.
nodegroup_sql_select_p99_latency	Select Latency (P99)	gauge	38818282.52 (ns)
nodegroup_sql_insert_p90_latency	Insert Latency (P90)	gauge	38818282.52 (ns)
nodegroup_sql_insert_p99_latency	Insert Latency (P99)	gauge	38818282.52 (ns)
nodegroup_sql_update_p90_latency	Update Latency (P90)	gauge	38818282.52 (ns)
nodegroup_sql_update_p99_latency	Update Latency (P99)	gauge	38818282.52 (ns)
nodegroup_sql_delete_p90_latency	Delete Latency (P90)	gauge	38818282.52 (ns)
nodegroup_sql_delete_p99_latency	Delete Latency (P99)	gauge	38818282.52 (ns)
nodegroup_sql_copy_p90_latency	Copy Latency (P90)	gauge	38818282.52 (ns)
nodegroup_sql_copy_p99_latency	Copy Latency (P99)	gauge	38818282.52 (ns)
nodegroup_sql_service_p90_latency	SQL Latency (P90)	gauge	38818282.52 (ns)		SQL Latency: P99: 99th percentile of query execution duration measured in Nodegroup. P90: 90th percentile of query execution duration measured in Nodegroup. If P99 or P90 latency metrics remain abnormal for several minutes, business processes and system status must be referenced in diagnosis.
nodegroup_sql_service_p99_latency	SQL Latency (P99)	gauge	38,818,282.52 (ns)
nodegroup_network_receive_bytes	Network Throughput (Receive)	gauge	92,468.533333 (bytes)		Network Throughput: Displays Nodegroup network throughput, including bytes received and sent.
nodegroup_network_send_bytes	Network Throughput (Send)	gauge	92,468.533333 (bytes)
nodegroup_active_sql_connections	Active SQL Connections	gauge	100		Connections: Shows SQL connections on Nodegroup, including active and idle connections.
nodegroup_idle_sql_connections	Idle SQL Connections	gauge	20
Database Storage Metrics
nodegroup_size_bytes	Database Storage Size	gauge	1,073,741,824 (1TB)	_cloudProvider _region _datacloudId _handle	Per-database storage size: Displays the storage size for each database. Includes all underlying physical storage used: table data, indexes, and WAL. Storage size is affected by insertions, updates, index rebuilds, transactions, schema changes, replication, and snapshots.
Backup Metrics
backup_size_bytes	Backup Storage Size	gauge	1,073,741,824 (1TB)	_cloudProvider _region _datacloudId _handle	Storage size per backup
Data Sync Metrics
datasync_source_idle_time		gauge	todo	_cloudProvider _region _datacloudId _jobId _jobName	Source idle time (seconds): Current system time - last record event time. Increases when there is no incoming data.
datasync_emit_event_time		gauge	todo		Delay for the most recently received data (seconds): Last system receipt timestamp - last event's business time. Will not increment when there's no data at the source.
datasync_source_heartbeat_time		gauge	todo		Source heartbeat time (seconds): Metric generation time - most recent attempt to read source. Growth indicates downstream backpressure.
datasync_rps		gauge	todo		Records per second
datasync_bps		gauge	todo		Bytes per second

Monitoring and Metrics

On this page