Sprout Monitoring Stack
A comprehensive monitoring and logging solution for the Sprout platform using Prometheus, Grafana, Loki, and Alertmanager. This stack provides real-time visibility into system health, performance metrics, centralized logging, and automated alerting.
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ ETL Fastify │ │ NestJS Monolith│ │ NestJS Worker │
│ Service │ │ Application │ │ (Background) │
│ (Port 3001) │ │ (Port 3000) │ │ (Port 4001) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ │ │
└───────────────────── ──┼───────────────────────┘
│
┌─────────────────┐
│ Prometheus │
│ (Port 9090) │
└─────────────────┘
│
┌─────────────────┐
│ Pushgateway │
│ (Port 9091) │
└─────────────────┘
│
┌─────────────────┐
│ Grafana │
│ (Port 3000) │
└─────────────────┘
│
┌─────────────────┐
│ Alertmanager │
│ (Port 9093) │
└─────────────────┘
Services Monitored
HTTP Metrics Services
- ETL Fastify Service (Port 3001): Fastify-based ETL service with
/metricsendpoint - NestJS Monolith (Port 3000): Main NestJS application with
/metricsendpoint
Pushgateway Services
- NestJS Worker (Port 4001): Background job processor that pushes metrics to Pushgateway
- ETL Worker (Port 4002): ETL background processor (if it exposes HTTP metrics)
Optional Services
- Redis (Port 6379): Cache and job queue (requires Redis exporter)
- MeiliSearch (Port 7700): Search engine (if it exposes metrics)
Key Features
Metrics Collection
- HTTP Scraping: Direct metrics collection from services with
/metricsendpoints - Pushgateway: Batch job and worker metrics collection via Prometheus Pushgateway
- Custom Labels: Service-specific labeling for better metric organization
- Relabeling: Automatic service identification and categorization
Dashboards
- System Health Overview: Service status, request rates, and error rates
- Performance Metrics: Response times, processing durations, and throughput
- Worker Monitoring: Background job processing rates and active job counts
- Business Metrics: Assigned purchase requests, inventory matches, and cost analysis
- Resource Utilization: CPU, memory, and system resource monitoring
Alerting
- Service Down Alerts: Immediate notification when services become unavailable
- High Error Rate Alerts: Alert when error rates exceed thresholds
- Performance Degradation: Alert on slow response times and processing delays
- Worker Health: Monitor background job processing and failures
Note: All alerting is now managed in Grafana. Prometheus no longer loads alert rules from etl-alerts.yml.
Setup Instructions
1. Configure Service URLs
Edit config/services.yaml to match your deployment environment:
services:
etl_fastify:
host: "your-etl-host.com" # Change from host.docker.internal
port: 3001
enabled: true
nestjs_monolith:
host: "your-nestjs-host.com" # Change from host.docker.internal
port: 3000
enabled: true
nestjs_worker:
host: "your-worker-host.com" # Change from host.docker.internal
port: 4001
enabled: false # Uses pushgateway, not HTTP metrics
pushgateway: true
2. Generate Prometheus Configuration
./scripts/generate-prometheus-config.sh
This script reads config/services.yaml and generates the appropriate Prometheus configuration.
3. Deploy the Stack
Option A: Local Development
docker-compose up -d
Option B: Production Deployment
./scripts/deploy-monitoring-droplet.sh
This script:
- Creates a DigitalOcean droplet
- Sets up Docker and Docker Compose
- Configures Cloudflare Tunnel for secure access
- Deploys the monitoring stack
- Sets up SSL certificates
Configuration
Service Configuration (config/services.yaml)
The services configuration file allows you to:
- Enable/disable monitoring for specific services
- Configure service URLs and ports
- Specify metrics and health check endpoints
- Mark services that use pushgateway instead of HTTP metrics
Environment-Specific Settings
environments:
development:
host_prefix: "host.docker.internal"
production:
host_prefix: "your-production-host.com"
staging:
host_prefix: "your-staging-host.com"
current_environment: "development"
Prometheus Configuration
The Prometheus configuration includes:
- Service discovery for HTTP metrics endpoints
- Pushgateway configuration for batch jobs
- Relabeling rules for proper service identification
- Alerting rule integration
Dashboards
Main Dashboard: Sprout System Health
URL: http://your-grafana-host:3000/d/sprout-system-health
Panels Include:
- Service status indicators
- Request rates and success rates
- Response time percentiles
- Error rates by service
- Job queue sizes and processing rates
- Worker metrics from pushgateway
- Business metrics (assigned purchases, inventory matches)
- Cost analysis and performance trends
Dashboard Features
- Real-time Updates: 30-second refresh intervals
- Service Filtering: Template variables for service and integration filtering
- Threshold Alerts: Color-coded indicators for performance issues
- Historical Trends: Time-series data for trend analysis
Alerting
Alert Rules (Grafana)
- All alert rules are now managed in Grafana via the UI or Grafana provisioning. There is no longer an etl-alerts.yml file for Prometheus.
Alertmanager Configuration
- Email Notifications: Configure SMTP settings for email alerts
- Slack Integration: Webhook-based Slack notifications
- PagerDuty: Integration with PagerDuty for incident management
- Escalation Policies: Multi-level alert escalation
Pushgateway Integration
How Pushgateway Works
The NestJS worker and other background services use Prometheus Pushgateway to report metrics:
- Metric Collection: Workers collect metrics during job processing
- Push to Gateway: Metrics are pushed to Pushgateway when jobs complete
- Prometheus Scraping: Prometheus scrapes Pushgateway to collect these metrics
- Dashboard Display: Grafana displays pushgateway metrics alongside HTTP metrics
Worker Metrics Available
worker_jobs_processed_total: Total jobs processedworker_jobs_failed_total: Total job failuresworker_active_jobs: Currently active jobsworker_job_duration_seconds: Job processing durationworker_queue_size: Current queue size
Configuration
Pushgateway is automatically configured in the Prometheus setup with:
- Proper relabeling for service identification
- Honor labels to preserve worker-specific metadata
- Regular scraping intervals for real-time updates
Troubleshooting
Common Issues
-
Services Not Appearing
- Check service URLs in
config/services.yaml - Verify services are running and accessible
- Check firewall settings
- Check service URLs in
-
No Metrics Data
- Verify
/metricsendpoints are working - Check Prometheus targets page
- Review service logs for metric collection issues
- Verify
-
Pushgateway Metrics Missing
- Verify workers are pushing to correct Pushgateway URL
- Check Pushgateway logs for connection issues
- Ensure proper metric naming conventions
-
Dashboard Not Loading
- Check Grafana datasource configuration
- Verify Prometheus is accessible from Grafana
- Review browser console for JavaScript errors
Debug Commands
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets
# Check Pushgateway metrics
curl http://localhost:9091/metrics
# Check service metrics
curl http://your-service:port/metrics
# View Prometheus logs
docker-compose logs prometheus
# View Grafana logs
docker-compose logs grafana
Log Locations
- Prometheus:
/var/log/prometheus/ - Grafana:
/var/log/grafana/ - Alertmanager:
/var/log/alertmanager/ - Pushgateway:
/var/log/pushgateway/
Security Considerations
Network Security
- Internal Network: All monitoring services run on internal network
- Cloudflare Tunnel: Secure external access via Cloudflare Tunnel
- SSL/TLS: Automatic SSL certificate management
- Authentication: Grafana login required for dashboard access
Data Protection
- Metrics Retention: Configurable data retention policies
- Access Control: Role-based access control in Grafana
- Audit Logging: Comprehensive audit trails for all operations
Performance Optimization
Resource Requirements
- Minimum: 2GB RAM, 2 vCPUs
- Recommended: 4GB RAM, 4 vCPUs
- Storage: 20GB+ for metrics retention
Scaling Considerations
- Horizontal Scaling: Add more Prometheus instances for high-volume metrics
- Federation: Use Prometheus federation for multi-region monitoring
- Remote Storage: Integrate with remote storage for long-term retention
Maintenance
Regular Tasks
- Backup Configuration: Backup
config/andgrafana/provisioning/ - Update Dashboards: Regularly review and update dashboard panels
- Review Alerts: Adjust alert thresholds based on historical data
- Clean Old Data: Configure retention policies for old metrics
Updates
- Docker Images: Regularly update monitoring stack images
- Security Patches: Apply security updates promptly
- Feature Updates: Review new monitoring features and integrations
Support
For issues and questions:
- Check the troubleshooting section above
- Review service logs for error messages
- Verify configuration files are correct
- Test individual components (Prometheus, Grafana, etc.)
Contributing
To add new services or metrics:
- Update
config/services.yaml - Run
./scripts/generate-prometheus-config.sh - Add relevant dashboard panels
- Update alerting rules if needed
- Test the configuration locally before deploying