Recipe: Requests tracking

Track scheduled/received requests with parent linkage and response time.

1) Enable (settings.py)

ITEM_PIPELINES = {
    'scrapy_item_ingest.RequestsPipeline': 300,
}

# Database
DB_URL = 'postgresql://user:password@localhost:5432/database'
# or discrete fields
# DB_HOST = 'localhost'
# DB_PORT = 5432
# DB_USER = 'user'
# DB_PASSWORD = 'password'
# DB_NAME = 'database'

CREATE_TABLES = True

2) What it logs

  • URL, method

  • status_code (when response received)

  • response_time (seconds)

  • fingerprint

  • parent_id / parent_url (when known)

3) Run

scrapy crawl your_spider

Expected

  • Rows in job_requests with parent_id filled when parent known.

  • Items table untouched in this recipe (no ItemsPipeline).

Tip

Start with a small crawl to verify parent linkage before scaling.