Sprout Properties Scraping Architecture
A modern, serverless scraping architecture built on Cloudflare Workers that processes Skybox inventory data by scraping event listings, managing competitor purchases, and updating ticket prices in real-time.
Overview
This project implements a distributed scraping system using Cloudflare Workers, leveraging:
- Durable Objects for persistent processing and state management
- KV Storage for efficient data caching and rate limiting
- Queues for reliable task scheduling and processing
- TypeScript for type safety and maintainability
For a comprehensive technical review of the architecture, design decisions, and areas of improvement, please refer to docs/review.md.
Core Components
- Scraping Worker: Entry point handling scheduled tasks and HTTP requests
- Durable Objects: Manages persistent state and retries for inventory processing
- Services Layer:
FirecrawlService: Web scraping with structured actionsInventoryProcessorService: Core logic for data processing and price updatesMeilisearchService: Inventory retrieval and searchRateLimiterService: API call throttling with KV-backed storageSkyboxService: Real-time inventory updates via Skybox API
Features
- Scheduled Processing: Automated inventory processing via cron triggers
- Intelligent Rate Limiting: KV-backed throttling with exponential backoff
- Distributed Processing: Queue-based architecture for scalability
- State Management: Persistent state using Durable Objects
- Search Integration: Meilisearch for efficient inventory retrieval
- Benchmark Tools: Analysis of optimal scraping parameters
Getting Started
Prerequisites
- Node.js (v18 or later)
- Wrangler CLI
- Cloudflare account with Workers, KV, and Durable Objects enabled
- API keys for:
- Firecrawl
- Meilisearch
- Skybox
Installation
-
Clone and install dependencies:
git clone <repository-url>
cd cloudflare-workers
npm install -
Configure environment:
cp .env.example .dev.varsEdit
.dev.varswith your API keys:FIRECRAWL_API_KEYMEILISEARCH_HOSTandMEILISEARCH_MASTER_KEYSKYBOX_ACCOUNT,SKYBOX_API_TOKEN,SKYBOX_APPLICATION_TOKEN
-
Set up Cloudflare resources:
# Create KV namespaces
npx wrangler kv:namespace create SCRAPING_STATE
npx wrangler kv:namespace create RATE_LIMITER_KVUpdate the namespace IDs in
wrangler.json.
Development
Start local development server:
npm run dev
Deployment
Deploy to Cloudflare Workers:
npm run deploy
Usage
Manual Processing
Trigger processing manually:
curl -X POST https://<worker-url>/trigger-scraping \
-H 'Content-Type: application/json' \
-d '{"limit": 10}'
Documentation
- Technical Review - Detailed architecture review and analysis