Skip to main content

Sprout Properties Scraping Architecture

A modern, serverless scraping architecture built on Cloudflare Workers that processes Skybox inventory data by scraping event listings, managing competitor purchases, and updating ticket prices in real-time.

Overview

This project implements a distributed scraping system using Cloudflare Workers, leveraging:

  • Durable Objects for persistent processing and state management
  • KV Storage for efficient data caching and rate limiting
  • Queues for reliable task scheduling and processing
  • TypeScript for type safety and maintainability

For a comprehensive technical review of the architecture, design decisions, and areas of improvement, please refer to docs/review.md.

Core Components

  • Scraping Worker: Entry point handling scheduled tasks and HTTP requests
  • Durable Objects: Manages persistent state and retries for inventory processing
  • Services Layer:
    • FirecrawlService: Web scraping with structured actions
    • InventoryProcessorService: Core logic for data processing and price updates
    • MeilisearchService: Inventory retrieval and search
    • RateLimiterService: API call throttling with KV-backed storage
    • SkyboxService: Real-time inventory updates via Skybox API

Features

  • Scheduled Processing: Automated inventory processing via cron triggers
  • Intelligent Rate Limiting: KV-backed throttling with exponential backoff
  • Distributed Processing: Queue-based architecture for scalability
  • State Management: Persistent state using Durable Objects
  • Search Integration: Meilisearch for efficient inventory retrieval
  • Benchmark Tools: Analysis of optimal scraping parameters

Getting Started

Prerequisites

  • Node.js (v18 or later)
  • Wrangler CLI
  • Cloudflare account with Workers, KV, and Durable Objects enabled
  • API keys for:
    • Firecrawl
    • Meilisearch
    • Skybox

Installation

  1. Clone and install dependencies:

    git clone <repository-url>
    cd cloudflare-workers
    npm install
  2. Configure environment:

    cp .env.example .dev.vars

    Edit .dev.vars with your API keys:

    • FIRECRAWL_API_KEY
    • MEILISEARCH_HOST and MEILISEARCH_MASTER_KEY
    • SKYBOX_ACCOUNT, SKYBOX_API_TOKEN, SKYBOX_APPLICATION_TOKEN
  3. Set up Cloudflare resources:

    # Create KV namespaces
    npx wrangler kv:namespace create SCRAPING_STATE
    npx wrangler kv:namespace create RATE_LIMITER_KV

    Update the namespace IDs in wrangler.json.

Development

Start local development server:

npm run dev

Deployment

Deploy to Cloudflare Workers:

npm run deploy

Usage

Manual Processing

Trigger processing manually:

curl -X POST https://<worker-url>/trigger-scraping \
-H 'Content-Type: application/json' \
-d '{"limit": 10}'

Documentation