sprout_monitoring Architecture
Comprehensive monitoring stack using Prometheus, Grafana, and Loki.
Overview
The sprout_monitoring repository provides:
- Prometheus for metrics collection
- Grafana for visualization and dashboards
- Loki for centralized logging
- Alertmanager for alerting
Architecture
Components
Prometheus
Purpose: Metrics collection and storage
Configuration:
- Scrapes HTTP endpoints (
/metrics) - Collects from Pushgateway
- Stores time-series data
- Alert rule evaluation
Targets:
- sprout_backend API (port 3000)
- sprout_etl API (port 3001)
- Pushgateway (port 9091) for worker metrics
Grafana
Purpose: Visualization and dashboards
Features:
- Pre-configured dashboards
- Real-time metrics visualization
- Log exploration
- Alert management
Dashboards:
- Sprout System Health
- ETL System Monitoring
- Inventory Monitoring
- TickOps Quota Monitoring
Loki
Purpose: Centralized logging
Features:
- Aggregates logs from services
- Log querying and exploration
- Integration with Grafana
Pushgateway
Purpose: Metrics from batch jobs and workers
Usage:
- Background workers push metrics
- Batch jobs report completion
- Prometheus scrapes Pushgateway
Alertmanager
Purpose: Alert routing and notification
Features:
- Alert deduplication
- Grouping and routing
- Notification channels (email, Slack, etc.)
Service Discovery
Services are configured in config/services.yaml:
services:
etl_fastify:
host: "host.docker.internal"
port: 3001
enabled: true
nestjs_monolith:
host: "host.docker.internal"
port: 3000
enabled: true
nestjs_worker:
host: "host.docker.internal"
port: 4001
enabled: false
pushgateway: true
Metrics Collection
HTTP Metrics
Services expose Prometheus metrics at /metrics:
- Request duration
- Success rates
- Error counts
- Business metrics
Pushgateway Metrics
Workers push metrics to Pushgateway:
- Job processing duration
- Job success/failure counts
- Queue sizes
- Active jobs
Dashboards
Sprout System Health
URL: /d/sprout-system-health
Panels:
- Service status indicators
- Request rates and success rates
- Response time percentiles
- Error rates by service
- Job queue sizes
- Worker metrics
- Business metrics
ETL System Monitoring
URL: /d/etl-system-monitoring
Panels:
- ETL pipeline health
- Processing rates
- Error tracking
- Queue metrics
Inventory Monitoring
URL: /d/inventory-monitoring
Panels:
- Inventory counts
- Matching rates
- Price changes
- Purchase processing
Alerting
Alert Rules
Configured in Grafana (not Prometheus):
- Service down alerts
- High error rate alerts
- Performance degradation
- Worker health issues
Notification Channels
- Slack webhooks
- PagerDuty
- Custom webhooks
Configuration
Prometheus Config
Generated from config/services.yaml:
./scripts/generate-prometheus-config.sh
Grafana Provisioning
- Datasources:
grafana/provisioning/datasources/ - Dashboards:
grafana/provisioning/dashboards/ - Alerting:
grafana/provisioning/alerting/
Deployment
Local Development
docker-compose up -d
Production
./scripts/deploy-monitoring-droplet.sh
This script:
- Creates DigitalOcean droplet
- Sets up Docker
- Configures Cloudflare Tunnel
- Deploys monitoring stack
Access
Local
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- Loki: http://localhost:3100
Production
- Access via Cloudflare Tunnel
- SSL certificates configured
- Authentication required
Maintenance
Backup Configuration
# Backup configs
cp -r config/ backups/
cp -r grafana/provisioning/ backups/
Update Dashboards
- Edit dashboard JSON in
grafana/provisioning/dashboards/ - Restart Grafana or reload provisioning
Review Alerts
- Adjust thresholds based on historical data
- Test alert channels regularly
- Review alert noise