Monitoring and Metrics

Monitoring and Metrics

Effective monitoring is essential for maintaining optimal performance and reliability in your Tacnode data warehouse. This guide covers the comprehensive metrics available for tracking system health, performance, and resource utilization.

Overview of Monitoring Capabilities

Key Monitoring Areas

  • Nodegroup Performance: CPU, memory, and network utilization
  • Query Performance: Latency, throughput, and error rates
  • Database Storage: Size tracking and growth patterns
  • System Health: Connection pools, failed operations, and resource contention

Monitoring Dashboard Features

  • Real-time metrics visualization
  • Historical trend analysis
  • Configurable time ranges
  • Exportable data for external analysis

Nodegroup Metrics

Monitor the health and performance of your compute resources through comprehensive nodegroup metrics.

Resource Utilization

Target and Current Size

  • Shows nodegroup's planned vs. actual unit count
  • Current size reflects units in normal service
  • Temporary differences during scaling operations are normal
  • Persistent mismatches may indicate system issues

Resource Utilization

  • Combined CPU and memory consumption percentage
  • Consistently above 80% indicates need for capacity expansion
  • Helps identify optimal scaling points
  • Tracks resource efficiency trends

Query Performance Monitoring

Queries Per Second (QPS) Track the number of SQL statements processed per second:

  • SELECT QPS: Read operations throughput
  • INSERT QPS: Data insertion rates
  • UPDATE QPS: Modification operations
  • DELETE QPS: Data removal operations
  • COPY QPS: Bulk data loading operations

SQL Latency Metrics Monitor query execution times with percentile-based measurements:

  • P99 Latency: 99th percentile response time
  • P90 Latency: 90th percentile response time
  • Available for each SQL operation type
  • Extended abnormal latency requires investigation

Network and Connection Monitoring

Network Throughput

  • Bytes received and sent per second
  • Identifies network bottlenecks
  • Tracks data transfer patterns
  • Helps size network capacity

Connection Management

  • Active Connections: Currently executing queries
  • Idle Connections: Established but inactive connections
  • Total Connections: Overall connection pool usage
  • Connection pool optimization insights

Error and Performance Tracking

Failed Query Count

  • Number of failed SQL statements per second
  • Sudden increases indicate system or application issues
  • Requires correlation with business and system conditions
  • Essential for troubleshooting

Affected Rows Tracks data modification impact:

  • Rows affected by INSERT operations
  • Rows affected by UPDATE operations
  • Rows affected by DELETE operations
  • Rows affected by COPY operations

Database Storage Metrics

Storage Size Tracking

Nodegroup-Level Storage

  • Total storage used by all databases in each nodegroup
  • Helps with capacity planning
  • Tracks storage consumption patterns

Database-Level Storage

  • Individual database storage consumption
  • Includes table data, indexes, and transaction logs
  • Affected by data ingestion, modifications, indexing, and replication

Storage Growth Factors

  • Data insertions and updates
  • Index creation and maintenance
  • Transaction log retention
  • Schema changes and reorganization
  • Backup and snapshot operations

Monitoring Metrics Reference

Nodegroup Computation Metrics

Metric KeyMetric NameTypeDescription
nodegroup_expect_unitsExpected UnitsgaugeTarget unit count for the nodegroup
nodegroup_running_unitsRunning UnitsgaugeCurrent units in normal service
nodegroup_resource_percent_normalizedUtilizationgaugeCombined CPU and memory utilization percentage

Query Performance Metrics

Metric KeyMetric NameTypeDescription
nodegroup_select_qpsSELECT QPSgaugeSELECT statements per second
nodegroup_insert_qpsINSERT QPSgaugeINSERT statements per second
nodegroup_update_qpsUPDATE QPSgaugeUPDATE statements per second
nodegroup_delete_qpsDELETE QPSgaugeDELETE statements per second
nodegroup_copy_qpsCOPY QPSgaugeCOPY statements per second
nodegroup_failure_qpsFailed Query QPSgaugeFailed statements per second

Latency Metrics

Metric KeyMetric NameTypeDescription
nodegroup_sql_service_p99_latencySQL Latency (P99)gauge99th percentile query execution time
nodegroup_sql_service_p90_latencySQL Latency (P90)gauge90th percentile query execution time
nodegroup_sql_select_p99_latencySELECT Latency (P99)gauge99th percentile SELECT execution time
nodegroup_sql_insert_p99_latencyINSERT Latency (P99)gauge99th percentile INSERT execution time

Network and Connection Metrics

Metric KeyMetric NameTypeDescription
nodegroup_network_receive_bytesNetwork Throughput (Receive)gaugeBytes received per second
nodegroup_network_send_bytesNetwork Throughput (Send)gaugeBytes sent per second
nodegroup_active_sql_connectionsActive SQL ConnectionsgaugeCurrently active database connections
nodegroup_idle_sql_connectionsIdle SQL ConnectionsgaugeIdle database connections

Storage Metrics

Metric KeyMetric NameTypeDescription
nodegroup_size_bytesDatabase Storage SizegaugeStorage used by each database
backup_size_bytesBackup Storage SizegaugeStorage used by backup files

Data Sync Metrics

Metric KeyMetric NameTypeDescription
datasync_source_idle_timeSource Idle TimegaugeTime since last data source activity
datasync_emit_event_timeEvent Processing DelaygaugeDelay between event time and processing
datasync_rpsRecords Per SecondgaugeData sync throughput in records
datasync_bpsBytes Per SecondgaugeData sync throughput in bytes

Best Practices for Monitoring

Setting Up Effective Monitoring

  1. Establish Baselines

    • Monitor normal operating patterns
    • Document typical resource utilization ranges
    • Identify seasonal or cyclical patterns
  2. Configure Alerting Thresholds

    • Set up proactive alerts for key metrics
    • Avoid alert fatigue with appropriate thresholds
    • Implement escalation procedures for critical issues
  3. Regular Performance Reviews

    • Analyze trends over time
    • Identify optimization opportunities
    • Plan capacity expansions proactively

Troubleshooting with Metrics

High Resource Utilization:

  • Monitor CPU and memory trends
  • Correlate with query patterns and workload changes
  • Consider scaling or workload optimization

Query Performance Issues:

  • Analyze latency percentiles by operation type
  • Identify long-running or inefficient queries
  • Correlate with resource utilization metrics

Storage Growth Concerns:

  • Track database size growth rates
  • Identify tables or databases with rapid expansion
  • Plan storage capacity and cleanup strategies

Integration with External Systems

Metrics Export:

  • Export metrics data for external analysis
  • Integration with existing monitoring systems
  • Custom dashboards and visualization tools

Automated Response:

  • Set up automated scaling based on metrics
  • Implement self-healing procedures
  • Create operational runbooks linked to specific metric patterns

This comprehensive monitoring approach ensures proactive management of your Tacnode data warehouse environment.

Metric KeyMetric NameTypeSample ValueLabelDescription
Nodegroup Computation Metrics
nodegroup_expect_unitsExpected Unitsgauge5
  • _cloudProvider
  • _region
  • _datacloudId
  • _id
  • _name

Target/Current Units: Displays the target and current unit count of Nodegroup. The current unit count reflects the number of units in normal service. During cluster creation or scaling, target units > current units may occur temporarily before equilibrium. If target units != current units persists, it may indicate an abnormal state—contact support if this occurs.

nodegroup_running_unitsRunning Unitsgauge5
nodegroup_resource_percent_normalizedUtilizationgauge0.8

Resource Utilization: Indicates overall resource utilization of Nodegroup, incorporating both CPU and memory usage. If persistently above 80%, consider scaling clusters.

nodegroup_select_qpsSelect QPSgauge1000

QPS: Number of SQL statements handled per second by Nodegroup, including Select, Update, Insert, Delete, and Copy queries.

nodegroup_update_qpsUpdate QPSgauge1000
nodegroup_insert_qpsInsert QPSgauge1000
nodegroup_delete_qpsDelete QPSgauge1000
nodegroup_copy_qpsCopy QPSgauge1
nodegroup_failure_qpsFailed Query QPSgauge1

Failed Queries: Number of failed SQL statements executed per second by Nodegroup. Investigate if this value surges in conjunction with business/system status.

nodegroup_insert_affected_rowsRows Affected by Insertgauge10000

Rows Affected: Shows the number of rows impacted per second by INSERT, UPDATE, or DELETE operations executed by Nodegroup. If anomalies or unexpected results occur, further analysis is required in conjunction with application and system status.

nodegroup_update_affected_rowsRows Affected by Updategauge10000
nodegroup_delete_affected_rowsRows Affected by Deletegauge10000
nodegroup_copy_affected_rowsRows Affected by Copygauge10000
nodegroup_sql_select_p90_latencySelect Latency (P90)gauge38818282.52 (ns)

SQL Latency by Type: Collects P99 and P90 latency metrics for each type of SQL operation (SELECT, INSERT, UPDATE, DELETE, COPY) in Nodegroup. Extended abnormal values (lasting several minutes or more) should be troubleshot relative to business processes and system conditions.

nodegroup_sql_select_p99_latencySelect Latency (P99)gauge38818282.52 (ns)
nodegroup_sql_insert_p90_latencyInsert Latency (P90)gauge38818282.52 (ns)
nodegroup_sql_insert_p99_latencyInsert Latency (P99)gauge38818282.52 (ns)
nodegroup_sql_update_p90_latencyUpdate Latency (P90)gauge38818282.52 (ns)
nodegroup_sql_update_p99_latencyUpdate Latency (P99)gauge38818282.52 (ns)
nodegroup_sql_delete_p90_latencyDelete Latency (P90)gauge38818282.52 (ns)
nodegroup_sql_delete_p99_latencyDelete Latency (P99)gauge38818282.52 (ns)
nodegroup_sql_copy_p90_latencyCopy Latency (P90)gauge38818282.52 (ns)
nodegroup_sql_copy_p99_latencyCopy Latency (P99)gauge38818282.52 (ns)
nodegroup_sql_service_p90_latencySQL Latency (P90)gauge38818282.52 (ns)

SQL Latency:
P99: 99th percentile of query execution duration measured in Nodegroup.
P90: 90th percentile of query execution duration measured in Nodegroup.
If P99 or P90 latency metrics remain abnormal for several minutes, business processes and system status must be referenced in diagnosis.

nodegroup_sql_service_p99_latencySQL Latency (P99)gauge38,818,282.52 (ns)
nodegroup_network_receive_bytesNetwork Throughput (Receive)gauge92,468.533333 (bytes)

Network Throughput: Displays Nodegroup network throughput, including bytes received and sent.

nodegroup_network_send_bytesNetwork Throughput (Send)gauge92,468.533333 (bytes)
nodegroup_active_sql_connectionsActive SQL Connectionsgauge100

Connections: Shows SQL connections on Nodegroup, including active and idle connections.

nodegroup_idle_sql_connectionsIdle SQL Connectionsgauge20
Database Storage Metrics
nodegroup_size_bytesDatabase Storage Sizegauge1,073,741,824 (1TB)
  • _cloudProvider
  • _region
  • _datacloudId
  • _handle

Per-database storage size: Displays the storage size for each database.

Includes all underlying physical storage used: table data, indexes, and WAL. Storage size is affected by insertions, updates, index rebuilds, transactions, schema changes, replication, and snapshots.

Backup Metrics
backup_size_bytesBackup Storage Sizegauge1,073,741,824 (1TB)
  • _cloudProvider
  • _region
  • _datacloudId
  • _handle
Storage size per backup
Data Sync Metrics
datasync_source_idle_timegaugetodo
  • _cloudProvider
  • _region
  • _datacloudId
  • _jobId
  • _jobName

Source idle time (seconds): Current system time - last record event time. Increases when there is no incoming data.

datasync_emit_event_timegaugetodo

Delay for the most recently received data (seconds): Last system receipt timestamp - last event's business time. Will not increment when there's no data at the source.

datasync_source_heartbeat_timegaugetodo

Source heartbeat time (seconds): Metric generation time - most recent attempt to read source. Growth indicates downstream backpressure.

datasync_rpsgaugetodoRecords per second
datasync_bpsgaugetodoBytes per second