Monitoring & Observability Guide

HyperStudy uses Prometheus and Grafana for comprehensive monitoring, providing real-time insights into application performance, resource usage, and system health.

Architecture Overview

The monitoring stack consists of:

Prometheus: Metrics collection and storage
Grafana: Visualization and dashboards
Node Exporter: Hardware and OS metrics
Kube-state-metrics: Kubernetes cluster metrics
Application Metrics: Custom metrics from HyperStudy services

Accessing Monitoring Tools

Local Access (Port Forwarding)

Prometheus

# Forward Prometheus to localhost:9090
kubectl port-forward -n monitoring svc/prometheus 9090:9090

# Access Prometheus UI
open http://localhost:9090

Grafana

# Forward Grafana to localhost:3000
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Access Grafana
open http://localhost:3000

# Default credentials:
# Username: admin
# Password: <GRAFANA_ADMIN_PASSWORD from secrets>

Production Access

In production, Grafana is accessible through the ingress:

https://grafana.hyperstudy.app

Available Dashboards

1. Cluster Overview Dashboard

Purpose: High-level view of cluster health and resource utilization

Key Metrics:

Node CPU and memory usage
Pod distribution across nodes
Cluster capacity and allocation
Network I/O statistics
Disk usage and I/O

Use Cases:

Capacity planning
Identifying resource bottlenecks
Node health monitoring

2. Application Performance Dashboard

Purpose: Monitor HyperStudy application metrics

Key Metrics:

Request rate and latency (p50, p95, p99)
Error rates by endpoint
Active connections
Response time distribution
Database query performance

Use Cases:

Performance optimization
SLA monitoring
Troubleshooting slow endpoints

3. Socket.IO Metrics Dashboard

Purpose: Real-time communication monitoring

Key Metrics:

Active WebSocket connections
Room occupancy
Message throughput
Connection/disconnection rates
Event processing times
Redis pub/sub metrics

Use Cases:

Scaling decisions
Connection stability monitoring
Debugging synchronization issues

4. Pod Performance Dashboard

Purpose: Individual pod monitoring

Key Metrics:

CPU usage per pod
Memory consumption
Network traffic
Restart count
Container states
Request handling capacity

Use Cases:

Identifying problematic pods
Resource optimization
Load balancing verification

5. Redis Metrics Dashboard

Purpose: Redis performance and health

Key Metrics:

Memory usage
Cache hit/miss rates
Command throughput
Connection count
Evicted keys
Persistence status

Use Cases:

Cache optimization
Memory management
Performance tuning

6. Experiment Metrics Dashboard

Purpose: HyperStudy experiment-specific metrics

Key Metrics:

Active experiments
Participant distribution
Media sync accuracy
Experiment completion rates
Error rates by experiment phase
LiveKit room statistics

Use Cases:

Experiment health monitoring
Participant experience tracking
Debugging experiment issues

Key Metrics Explained

Application Metrics

Request Latency

# 95th percentile latency
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

Error Rate

# Percentage of 5xx errors
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100

Active Connections

# Current WebSocket connections
socketio_connected_clients

Resource Metrics

CPU Usage

# CPU usage by pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

Memory Usage

# Memory usage percentage
container_memory_working_set_bytes 
/ 
container_spec_memory_limit_bytes * 100

Network Traffic

# Network bytes received per second
rate(container_network_receive_bytes_total[5m])

Setting Up Alerts

Alert Configuration

Alerts are defined in k8s/monitoring/prometheus-alerts.yaml:

groups:
  - name: hyperstudy
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) 
          / 
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 5% for 5 minutes"

Common Alerts

High CPU Usage

alert: HighCPUUsage
expr: |
  (sum(rate(container_cpu_usage_seconds_total[5m])) by (pod) 
  / 
  sum(container_spec_cpu_quota) by (pod)) > 0.8
for: 10m

Memory Pressure

alert: MemoryPressure
expr: |
  container_memory_working_set_bytes 
  / 
  container_spec_memory_limit_bytes > 0.9
for: 5m

Pod Restarts

alert: FrequentPodRestarts
expr: |
  increase(kube_pod_container_status_restarts_total[1h]) > 5

Socket.IO Connection Issues

alert: SocketIOConnectionDrop
expr: |
  rate(socketio_disconnect_total[5m]) > 10

Alert Notifications

Configure alert notifications in Grafana:

Navigate to Alerting → Contact points
Add notification channel (Email, Slack, PagerDuty, etc.)
Configure alert routing rules
Test notifications

Custom Metrics

Adding Application Metrics

Backend Metrics (Node.js)

// metrics.js
const promClient = require('prom-client');

// Create custom metrics
const httpDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const activeExperiments = new promClient.Gauge({
  name: 'hyperstudy_active_experiments',
  help: 'Number of active experiments'
});

const syncAccuracy = new promClient.Histogram({
  name: 'hyperstudy_sync_accuracy_ms',
  help: 'Media synchronization accuracy in milliseconds',
  buckets: [1, 5, 10, 25, 50, 100, 250, 500, 1000]
});

// Export metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Socket.IO Metrics

// Track Socket.IO connections
io.on('connection', (socket) => {
  connectedClients.inc();
  
  socket.on('disconnect', () => {
    connectedClients.dec();
    disconnectTotal.inc();
  });
  
  socket.on('join-room', (room) => {
    roomOccupancy.inc({ room });
  });
});

Prometheus Scraping Configuration

Add service monitor for custom metrics:

apiVersion: v1
kind: Service
metadata:
  name: backend-metrics
  labels:
    app: backend
    metrics: "true"
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080
  selector:
    app: backend

Debugging with Metrics

Performance Issues

Check latency percentiles:

histogram_quantile(0.99, 
  rate(http_request_duration_seconds_bucket[5m])
)

Identify slow endpoints:

topk(10, 
  histogram_quantile(0.95, 
    rate(http_request_duration_seconds_bucket[5m])
  ) by (route)
)

Check resource constraints:
- CPU throttling
- Memory limits
- Network saturation

Connection Issues

Monitor WebSocket connections:
```
socketio_connected_clients
```

Check disconnection reasons:

increase(socketio_disconnect_total[5m]) by (reason)

Verify Redis connectivity:
```
redis_connected_clients
```

Scaling Decisions

CPU-based scaling indicators:

avg(rate(container_cpu_usage_seconds_total[5m])) by (deployment)

Memory pressure indicators:

container_memory_working_set_bytes / container_spec_memory_limit_bytes

Request queue depth:
```
http_requests_pending
```

Best Practices

Dashboard Design

Use consistent time ranges across panels
Group related metrics logically
Include context (thresholds, targets)
Use appropriate visualizations:
- Graphs for time series
- Gauges for current values
- Tables for detailed breakdowns
- Heatmaps for distributions

Query Optimization

Use recording rules for expensive queries:

- record: job:http_requests:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

Limit cardinality in labels
Use appropriate time ranges
Aggregate before graphing

Alert Management

Avoid alert fatigue:
- Set appropriate thresholds
- Use proper time windows
- Group related alerts
Include actionable information:
- Clear descriptions
- Runbook links
- Suggested remediation
Test alerts regularly:
- Verify they fire correctly
- Check notification delivery
- Update as system evolves

Troubleshooting Monitoring

Prometheus Issues

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Verify metrics ingestion
curl http://localhost:9090/api/v1/query?query=up

# Check Prometheus logs
kubectl logs -n monitoring deployment/prometheus

Grafana Issues

# Reset admin password
kubectl exec -n monitoring deployment/grafana -- \
  grafana-cli admin reset-admin-password newpassword

# Check datasource connectivity
kubectl exec -n monitoring deployment/grafana -- \
  curl http://prometheus:9090/api/v1/query?query=up

# Review Grafana logs
kubectl logs -n monitoring deployment/grafana

Missing Metrics

Verify service discovery:

kubectl get servicemonitor -n monitoring

Check scrape configuration:

kubectl get configmap -n monitoring prometheus-config -o yaml

Test metric endpoint:

kubectl port-forward -n hyperstudy pod/backend-0 8080:8080
curl http://localhost:8080/metrics

Advanced Topics

Long-term Storage

Configure remote storage for historical data:

remote_write:
  - url: "https://prometheus-storage.example.com/api/v1/write"
    basic_auth:
      username: user
      password: pass

Federation

Set up Prometheus federation for multi-cluster monitoring:

scrape_configs:
  - job_name: 'federate'
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="hyperstudy"}'
    static_configs:
      - targets:
        - 'prometheus-cluster-b:9090'

Custom Exporters

Create custom exporters for third-party services:

// firebase-exporter.js
const admin = require('firebase-admin');
const express = require('express');
const promClient = require('prom-client');

const userCount = new promClient.Gauge({
  name: 'firebase_user_count',
  help: 'Total number of users'
});

// Update metrics periodically
setInterval(async () => {
  const users = await admin.auth().listUsers();
  userCount.set(users.users.length);
}, 60000);

Architecture Overview​

Accessing Monitoring Tools​

Local Access (Port Forwarding)​

Prometheus​

Grafana​

Production Access​

Available Dashboards​

1. Cluster Overview Dashboard​

2. Application Performance Dashboard​

3. Socket.IO Metrics Dashboard​

4. Pod Performance Dashboard​

5. Redis Metrics Dashboard​

6. Experiment Metrics Dashboard​

Key Metrics Explained​

Application Metrics​

Request Latency​

Error Rate​

Active Connections​

Resource Metrics​

CPU Usage​

Memory Usage​

Network Traffic​

Setting Up Alerts​

Alert Configuration​

Common Alerts​

High CPU Usage​

Memory Pressure​

Pod Restarts​

Socket.IO Connection Issues​

Alert Notifications​

Custom Metrics​

Adding Application Metrics​

Backend Metrics (Node.js)​

Socket.IO Metrics​

Prometheus Scraping Configuration​

Debugging with Metrics​

Performance Issues​

Connection Issues​

Scaling Decisions​

Best Practices​

Dashboard Design​

Query Optimization​

Alert Management​

Troubleshooting Monitoring​

Prometheus Issues​

Grafana Issues​

Missing Metrics​

Advanced Topics​

Long-term Storage​

Federation​

Custom Exporters​

Resources​

Architecture Overview

Accessing Monitoring Tools

Local Access (Port Forwarding)

Prometheus

Grafana

Production Access

Available Dashboards

1. Cluster Overview Dashboard

2. Application Performance Dashboard

3. Socket.IO Metrics Dashboard

4. Pod Performance Dashboard

5. Redis Metrics Dashboard

6. Experiment Metrics Dashboard

Key Metrics Explained

Application Metrics

Request Latency

Error Rate

Active Connections

Resource Metrics

CPU Usage

Memory Usage

Network Traffic

Setting Up Alerts

Alert Configuration

Common Alerts

High CPU Usage

Memory Pressure

Pod Restarts

Socket.IO Connection Issues

Alert Notifications

Custom Metrics

Adding Application Metrics

Backend Metrics (Node.js)

Socket.IO Metrics

Prometheus Scraping Configuration

Debugging with Metrics

Performance Issues

Connection Issues

Scaling Decisions

Best Practices

Dashboard Design

Query Optimization

Alert Management

Troubleshooting Monitoring

Prometheus Issues

Grafana Issues

Missing Metrics

Advanced Topics

Long-term Storage

Federation

Custom Exporters

Resources