Sprout Properties Scraping Architecture

A modern, serverless scraping architecture built on Cloudflare Workers that processes Skybox inventory data by scraping event listings, managing competitor purchases, and updating ticket prices in real-time.

Overview

This project implements a distributed scraping system using Cloudflare Workers, leveraging:

Durable Objects for persistent processing and state management
KV Storage for efficient data caching and rate limiting
Queues for reliable task scheduling and processing
TypeScript for type safety and maintainability

For a comprehensive technical review of the architecture, design decisions, and areas of improvement, please refer to docs/review.md.

Core Components

Scraping Worker: Entry point handling scheduled tasks and HTTP requests
Durable Objects: Manages persistent state and retries for inventory processing
Services Layer:
- FirecrawlService: Web scraping with structured actions
- InventoryProcessorService: Core logic for data processing and price updates
- MeilisearchService: Inventory retrieval and search
- RateLimiterService: API call throttling with KV-backed storage
- SkyboxService: Real-time inventory updates via Skybox API

Features

Scheduled Processing: Automated inventory processing via cron triggers
Intelligent Rate Limiting: KV-backed throttling with exponential backoff
Distributed Processing: Queue-based architecture for scalability
State Management: Persistent state using Durable Objects
Search Integration: Meilisearch for efficient inventory retrieval
Benchmark Tools: Analysis of optimal scraping parameters

Getting Started

Prerequisites

Node.js (v18 or later)
Wrangler CLI
Cloudflare account with Workers, KV, and Durable Objects enabled
API keys for:
- Firecrawl
- Meilisearch
- Skybox

Installation

Clone and install dependencies:

git clone <repository-url>
cd cloudflare-workers
npm install

Configure environment:
```
cp .env.example .dev.vars
```
Edit .dev.vars with your API keys:
- FIRECRAWL_API_KEY
- MEILISEARCH_HOST and MEILISEARCH_MASTER_KEY
- SKYBOX_ACCOUNT, SKYBOX_API_TOKEN, SKYBOX_APPLICATION_TOKEN

Set up Cloudflare resources:

# Create KV namespaces
npx wrangler kv:namespace create SCRAPING_STATE
npx wrangler kv:namespace create RATE_LIMITER_KV

Update the namespace IDs in wrangler.json.

Development

Start local development server:

npm run dev

Deployment

Deploy to Cloudflare Workers:

npm run deploy

Usage

Manual Processing

Trigger processing manually:

curl -X POST https://<worker-url>/trigger-scraping \
  -H 'Content-Type: application/json' \
  -d '{"limit": 10}'

Documentation

Technical Review - Detailed architecture review and analysis

Overview​

Core Components​

Features​

Getting Started​

Prerequisites​

Installation​

Development​

Deployment​

Usage​

Manual Processing​

Documentation​