Skip to main content

Sprout Monitoring Stack

A comprehensive monitoring and logging solution for the Sprout platform using Prometheus, Grafana, Loki, and Alertmanager. This stack provides real-time visibility into system health, performance metrics, centralized logging, and automated alerting.

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ ETL Fastify │ │ NestJS Monolith│ │ NestJS Worker │
│ Service │ │ Application │ │ (Background) │
│ (Port 3001) │ │ (Port 3000) │ │ (Port 4001) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
└───────────────────────┼───────────────────────┘

┌─────────────────┐
│ Prometheus │
│ (Port 9090) │
└─────────────────┘

┌─────────────────┐
│ Pushgateway │
│ (Port 9091) │
└─────────────────┘

┌─────────────────┐
│ Grafana │
│ (Port 3000) │
└─────────────────┘

┌─────────────────┐
│ Alertmanager │
│ (Port 9093) │
└─────────────────┘

Services Monitored

HTTP Metrics Services

  • ETL Fastify Service (Port 3001): Fastify-based ETL service with /metrics endpoint
  • NestJS Monolith (Port 3000): Main NestJS application with /metrics endpoint

Pushgateway Services

  • NestJS Worker (Port 4001): Background job processor that pushes metrics to Pushgateway
  • ETL Worker (Port 4002): ETL background processor (if it exposes HTTP metrics)

Optional Services

  • Redis (Port 6379): Cache and job queue (requires Redis exporter)
  • MeiliSearch (Port 7700): Search engine (if it exposes metrics)

Key Features

Metrics Collection

  • HTTP Scraping: Direct metrics collection from services with /metrics endpoints
  • Pushgateway: Batch job and worker metrics collection via Prometheus Pushgateway
  • Custom Labels: Service-specific labeling for better metric organization
  • Relabeling: Automatic service identification and categorization

Dashboards

  • System Health Overview: Service status, request rates, and error rates
  • Performance Metrics: Response times, processing durations, and throughput
  • Worker Monitoring: Background job processing rates and active job counts
  • Business Metrics: Assigned purchase requests, inventory matches, and cost analysis
  • Resource Utilization: CPU, memory, and system resource monitoring

Alerting

  • Service Down Alerts: Immediate notification when services become unavailable
  • High Error Rate Alerts: Alert when error rates exceed thresholds
  • Performance Degradation: Alert on slow response times and processing delays
  • Worker Health: Monitor background job processing and failures

Note: All alerting is now managed in Grafana. Prometheus no longer loads alert rules from etl-alerts.yml.

Setup Instructions

1. Configure Service URLs

Edit config/services.yaml to match your deployment environment:

services:
etl_fastify:
host: "your-etl-host.com" # Change from host.docker.internal
port: 3001
enabled: true

nestjs_monolith:
host: "your-nestjs-host.com" # Change from host.docker.internal
port: 3000
enabled: true

nestjs_worker:
host: "your-worker-host.com" # Change from host.docker.internal
port: 4001
enabled: false # Uses pushgateway, not HTTP metrics
pushgateway: true

2. Generate Prometheus Configuration

./scripts/generate-prometheus-config.sh

This script reads config/services.yaml and generates the appropriate Prometheus configuration.

3. Deploy the Stack

Option A: Local Development

docker-compose up -d

Option B: Production Deployment

./scripts/deploy-monitoring-droplet.sh

This script:

  • Creates a DigitalOcean droplet
  • Sets up Docker and Docker Compose
  • Configures Cloudflare Tunnel for secure access
  • Deploys the monitoring stack
  • Sets up SSL certificates

Configuration

Service Configuration (config/services.yaml)

The services configuration file allows you to:

  • Enable/disable monitoring for specific services
  • Configure service URLs and ports
  • Specify metrics and health check endpoints
  • Mark services that use pushgateway instead of HTTP metrics

Environment-Specific Settings

environments:
development:
host_prefix: "host.docker.internal"
production:
host_prefix: "your-production-host.com"
staging:
host_prefix: "your-staging-host.com"

current_environment: "development"

Prometheus Configuration

The Prometheus configuration includes:

  • Service discovery for HTTP metrics endpoints
  • Pushgateway configuration for batch jobs
  • Relabeling rules for proper service identification
  • Alerting rule integration

Dashboards

Main Dashboard: Sprout System Health

URL: http://your-grafana-host:3000/d/sprout-system-health

Panels Include:

  • Service status indicators
  • Request rates and success rates
  • Response time percentiles
  • Error rates by service
  • Job queue sizes and processing rates
  • Worker metrics from pushgateway
  • Business metrics (assigned purchases, inventory matches)
  • Cost analysis and performance trends

Dashboard Features

  • Real-time Updates: 30-second refresh intervals
  • Service Filtering: Template variables for service and integration filtering
  • Threshold Alerts: Color-coded indicators for performance issues
  • Historical Trends: Time-series data for trend analysis

Alerting

Alert Rules (Grafana)

  • All alert rules are now managed in Grafana via the UI or Grafana provisioning. There is no longer an etl-alerts.yml file for Prometheus.

Alertmanager Configuration

  • Email Notifications: Configure SMTP settings for email alerts
  • Slack Integration: Webhook-based Slack notifications
  • PagerDuty: Integration with PagerDuty for incident management
  • Escalation Policies: Multi-level alert escalation

Pushgateway Integration

How Pushgateway Works

The NestJS worker and other background services use Prometheus Pushgateway to report metrics:

  1. Metric Collection: Workers collect metrics during job processing
  2. Push to Gateway: Metrics are pushed to Pushgateway when jobs complete
  3. Prometheus Scraping: Prometheus scrapes Pushgateway to collect these metrics
  4. Dashboard Display: Grafana displays pushgateway metrics alongside HTTP metrics

Worker Metrics Available

  • worker_jobs_processed_total: Total jobs processed
  • worker_jobs_failed_total: Total job failures
  • worker_active_jobs: Currently active jobs
  • worker_job_duration_seconds: Job processing duration
  • worker_queue_size: Current queue size

Configuration

Pushgateway is automatically configured in the Prometheus setup with:

  • Proper relabeling for service identification
  • Honor labels to preserve worker-specific metadata
  • Regular scraping intervals for real-time updates

Troubleshooting

Common Issues

  1. Services Not Appearing

    • Check service URLs in config/services.yaml
    • Verify services are running and accessible
    • Check firewall settings
  2. No Metrics Data

    • Verify /metrics endpoints are working
    • Check Prometheus targets page
    • Review service logs for metric collection issues
  3. Pushgateway Metrics Missing

    • Verify workers are pushing to correct Pushgateway URL
    • Check Pushgateway logs for connection issues
    • Ensure proper metric naming conventions
  4. Dashboard Not Loading

    • Check Grafana datasource configuration
    • Verify Prometheus is accessible from Grafana
    • Review browser console for JavaScript errors

Debug Commands

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check Pushgateway metrics
curl http://localhost:9091/metrics

# Check service metrics
curl http://your-service:port/metrics

# View Prometheus logs
docker-compose logs prometheus

# View Grafana logs
docker-compose logs grafana

Log Locations

  • Prometheus: /var/log/prometheus/
  • Grafana: /var/log/grafana/
  • Alertmanager: /var/log/alertmanager/
  • Pushgateway: /var/log/pushgateway/

Security Considerations

Network Security

  • Internal Network: All monitoring services run on internal network
  • Cloudflare Tunnel: Secure external access via Cloudflare Tunnel
  • SSL/TLS: Automatic SSL certificate management
  • Authentication: Grafana login required for dashboard access

Data Protection

  • Metrics Retention: Configurable data retention policies
  • Access Control: Role-based access control in Grafana
  • Audit Logging: Comprehensive audit trails for all operations

Performance Optimization

Resource Requirements

  • Minimum: 2GB RAM, 2 vCPUs
  • Recommended: 4GB RAM, 4 vCPUs
  • Storage: 20GB+ for metrics retention

Scaling Considerations

  • Horizontal Scaling: Add more Prometheus instances for high-volume metrics
  • Federation: Use Prometheus federation for multi-region monitoring
  • Remote Storage: Integrate with remote storage for long-term retention

Maintenance

Regular Tasks

  • Backup Configuration: Backup config/ and grafana/provisioning/
  • Update Dashboards: Regularly review and update dashboard panels
  • Review Alerts: Adjust alert thresholds based on historical data
  • Clean Old Data: Configure retention policies for old metrics

Updates

  • Docker Images: Regularly update monitoring stack images
  • Security Patches: Apply security updates promptly
  • Feature Updates: Review new monitoring features and integrations

Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review service logs for error messages
  3. Verify configuration files are correct
  4. Test individual components (Prometheus, Grafana, etc.)

Contributing

To add new services or metrics:

  1. Update config/services.yaml
  2. Run ./scripts/generate-prometheus-config.sh
  3. Add relevant dashboard panels
  4. Update alerting rules if needed
  5. Test the configuration locally before deploying