Sprout Monitoring Stack

A comprehensive monitoring and logging solution for the Sprout platform using Prometheus, Grafana, Loki, and Alertmanager. This stack provides real-time visibility into system health, performance metrics, centralized logging, and automated alerting.

Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   ETL Fastify   │    │  NestJS Monolith│    │  NestJS Worker  │
│   Service       │    │  Application    │    │  (Background)   │
│   (Port 3001)   │    │  (Port 3000)    │    │  (Port 4001)    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │   Prometheus    │
                    │   (Port 9090)   │
                    └─────────────────┘
                                 │
                    ┌─────────────────┐
                    │   Pushgateway   │
                    │   (Port 9091)   │
                    └─────────────────┘
                                 │
                    ┌─────────────────┐
                    │    Grafana      │
                    │   (Port 3000)   │
                    └─────────────────┘
                                 │
                    ┌─────────────────┐
                    │  Alertmanager   │
                    │   (Port 9093)   │
                    └─────────────────┘

Services Monitored

HTTP Metrics Services

ETL Fastify Service (Port 3001): Fastify-based ETL service with /metrics endpoint
NestJS Monolith (Port 3000): Main NestJS application with /metrics endpoint

Pushgateway Services

NestJS Worker (Port 4001): Background job processor that pushes metrics to Pushgateway
ETL Worker (Port 4002): ETL background processor (if it exposes HTTP metrics)

Optional Services

Redis (Port 6379): Cache and job queue (requires Redis exporter)
MeiliSearch (Port 7700): Search engine (if it exposes metrics)

Key Features

Metrics Collection

HTTP Scraping: Direct metrics collection from services with /metrics endpoints
Pushgateway: Batch job and worker metrics collection via Prometheus Pushgateway
Custom Labels: Service-specific labeling for better metric organization
Relabeling: Automatic service identification and categorization

Dashboards

System Health Overview: Service status, request rates, and error rates
Performance Metrics: Response times, processing durations, and throughput
Worker Monitoring: Background job processing rates and active job counts
Business Metrics: Assigned purchase requests, inventory matches, and cost analysis
Resource Utilization: CPU, memory, and system resource monitoring

Alerting

Service Down Alerts: Immediate notification when services become unavailable
High Error Rate Alerts: Alert when error rates exceed thresholds
Performance Degradation: Alert on slow response times and processing delays
Worker Health: Monitor background job processing and failures

Note: All alerting is now managed in Grafana. Prometheus no longer loads alert rules from etl-alerts.yml.

Setup Instructions

1. Configure Service URLs

Edit config/services.yaml to match your deployment environment:

services:
  etl_fastify:
    host: "your-etl-host.com"  # Change from host.docker.internal
    port: 3001
    enabled: true
    
  nestjs_monolith:
    host: "your-nestjs-host.com"  # Change from host.docker.internal
    port: 3000
    enabled: true
    
  nestjs_worker:
    host: "your-worker-host.com"  # Change from host.docker.internal
    port: 4001
    enabled: false  # Uses pushgateway, not HTTP metrics
    pushgateway: true

2. Generate Prometheus Configuration

./scripts/generate-prometheus-config.sh

This script reads config/services.yaml and generates the appropriate Prometheus configuration.

3. Deploy the Stack

Option A: Local Development

docker-compose up -d

Option B: Production Deployment

./scripts/deploy-monitoring-droplet.sh

This script:

Creates a DigitalOcean droplet
Sets up Docker and Docker Compose
Configures Cloudflare Tunnel for secure access
Deploys the monitoring stack
Sets up SSL certificates

Configuration

Service Configuration (`config/services.yaml`)

The services configuration file allows you to:

Enable/disable monitoring for specific services
Configure service URLs and ports
Specify metrics and health check endpoints
Mark services that use pushgateway instead of HTTP metrics

Environment-Specific Settings

environments:
  development:
    host_prefix: "host.docker.internal"
  production:
    host_prefix: "your-production-host.com"
  staging:
    host_prefix: "your-staging-host.com"

current_environment: "development"

Prometheus Configuration

The Prometheus configuration includes:

Service discovery for HTTP metrics endpoints
Pushgateway configuration for batch jobs
Relabeling rules for proper service identification
Alerting rule integration

Dashboards

Main Dashboard: Sprout System Health

URL: http://your-grafana-host:3000/d/sprout-system-health

Panels Include:

Service status indicators
Request rates and success rates
Response time percentiles
Error rates by service
Job queue sizes and processing rates
Worker metrics from pushgateway
Business metrics (assigned purchases, inventory matches)
Cost analysis and performance trends

Dashboard Features

Real-time Updates: 30-second refresh intervals
Service Filtering: Template variables for service and integration filtering
Threshold Alerts: Color-coded indicators for performance issues
Historical Trends: Time-series data for trend analysis

Alerting

Alert Rules (Grafana)

All alert rules are now managed in Grafana via the UI or Grafana provisioning. There is no longer an etl-alerts.yml file for Prometheus.

Alertmanager Configuration

Email Notifications: Configure SMTP settings for email alerts
Slack Integration: Webhook-based Slack notifications
PagerDuty: Integration with PagerDuty for incident management
Escalation Policies: Multi-level alert escalation

Pushgateway Integration

How Pushgateway Works

The NestJS worker and other background services use Prometheus Pushgateway to report metrics:

Metric Collection: Workers collect metrics during job processing
Push to Gateway: Metrics are pushed to Pushgateway when jobs complete
Prometheus Scraping: Prometheus scrapes Pushgateway to collect these metrics
Dashboard Display: Grafana displays pushgateway metrics alongside HTTP metrics

Worker Metrics Available

worker_jobs_processed_total: Total jobs processed
worker_jobs_failed_total: Total job failures
worker_active_jobs: Currently active jobs
worker_job_duration_seconds: Job processing duration
worker_queue_size: Current queue size

Configuration

Pushgateway is automatically configured in the Prometheus setup with:

Proper relabeling for service identification
Honor labels to preserve worker-specific metadata
Regular scraping intervals for real-time updates

Troubleshooting

Common Issues

Services Not Appearing
- Check service URLs in config/services.yaml
- Verify services are running and accessible
- Check firewall settings
No Metrics Data
- Verify /metrics endpoints are working
- Check Prometheus targets page
- Review service logs for metric collection issues
Pushgateway Metrics Missing
- Verify workers are pushing to correct Pushgateway URL
- Check Pushgateway logs for connection issues
- Ensure proper metric naming conventions
Dashboard Not Loading
- Check Grafana datasource configuration
- Verify Prometheus is accessible from Grafana
- Review browser console for JavaScript errors

Debug Commands

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets

# Check Pushgateway metrics
curl http://localhost:9091/metrics

# Check service metrics
curl http://your-service:port/metrics

# View Prometheus logs
docker-compose logs prometheus

# View Grafana logs
docker-compose logs grafana

Log Locations

Prometheus: /var/log/prometheus/
Grafana: /var/log/grafana/
Alertmanager: /var/log/alertmanager/
Pushgateway: /var/log/pushgateway/

Security Considerations

Network Security

Internal Network: All monitoring services run on internal network
Cloudflare Tunnel: Secure external access via Cloudflare Tunnel
SSL/TLS: Automatic SSL certificate management
Authentication: Grafana login required for dashboard access

Data Protection

Metrics Retention: Configurable data retention policies
Access Control: Role-based access control in Grafana
Audit Logging: Comprehensive audit trails for all operations

Performance Optimization

Resource Requirements

Minimum: 2GB RAM, 2 vCPUs
Recommended: 4GB RAM, 4 vCPUs
Storage: 20GB+ for metrics retention

Scaling Considerations

Horizontal Scaling: Add more Prometheus instances for high-volume metrics
Federation: Use Prometheus federation for multi-region monitoring
Remote Storage: Integrate with remote storage for long-term retention

Maintenance

Regular Tasks

Backup Configuration: Backup config/ and grafana/provisioning/
Update Dashboards: Regularly review and update dashboard panels
Review Alerts: Adjust alert thresholds based on historical data
Clean Old Data: Configure retention policies for old metrics

Updates

Docker Images: Regularly update monitoring stack images
Security Patches: Apply security updates promptly
Feature Updates: Review new monitoring features and integrations

Support

For issues and questions:

Check the troubleshooting section above
Review service logs for error messages
Verify configuration files are correct
Test individual components (Prometheus, Grafana, etc.)

Contributing

To add new services or metrics:

Update config/services.yaml
Run ./scripts/generate-prometheus-config.sh
Add relevant dashboard panels
Update alerting rules if needed
Test the configuration locally before deploying

Architecture​

Services Monitored​

HTTP Metrics Services​

Pushgateway Services​

Optional Services​

Key Features​

Metrics Collection​

Dashboards​

Alerting​

Setup Instructions​

1. Configure Service URLs​

2. Generate Prometheus Configuration​

3. Deploy the Stack​

Option A: Local Development​

Option B: Production Deployment​

Configuration​

Service Configuration (config/services.yaml)​

Environment-Specific Settings​

Prometheus Configuration​

Dashboards​

Main Dashboard: Sprout System Health​

Dashboard Features​

Alerting​

Alert Rules (Grafana)​

Alertmanager Configuration​

Pushgateway Integration​

How Pushgateway Works​

Worker Metrics Available​

Configuration​

Troubleshooting​

Common Issues​

Debug Commands​

Log Locations​

Security Considerations​

Network Security​

Data Protection​

Performance Optimization​

Resource Requirements​

Scaling Considerations​

Maintenance​

Regular Tasks​

Updates​

Support​

Contributing​

Architecture

Services Monitored

HTTP Metrics Services

Pushgateway Services

Optional Services

Key Features

Metrics Collection

Dashboards

Alerting

Setup Instructions

1. Configure Service URLs

2. Generate Prometheus Configuration

3. Deploy the Stack

Option A: Local Development

Option B: Production Deployment

Configuration

Service Configuration (`config/services.yaml`)

Environment-Specific Settings

Prometheus Configuration

Dashboards

Main Dashboard: Sprout System Health

Dashboard Features

Alerting

Alert Rules (Grafana)

Alertmanager Configuration

Pushgateway Integration

How Pushgateway Works

Worker Metrics Available

Configuration

Troubleshooting

Common Issues

Debug Commands

Log Locations

Security Considerations

Network Security

Data Protection

Performance Optimization

Resource Requirements

Scaling Considerations

Maintenance

Regular Tasks

Updates

Support

Contributing