Recipe: Requests tracking
Track scheduled/received requests with parent linkage and response time.
1) Enable (settings.py)
ITEM_PIPELINES = {
'scrapy_item_ingest.RequestsPipeline': 300,
}
# Database
DB_URL = 'postgresql://user:password@localhost:5432/database'
# or discrete fields
# DB_HOST = 'localhost'
# DB_PORT = 5432
# DB_USER = 'user'
# DB_PASSWORD = 'password'
# DB_NAME = 'database'
CREATE_TABLES = True
2) What it logs
URL, method
status_code (when response received)
response_time (seconds)
fingerprint
parent_id / parent_url (when known)
3) Run
scrapy crawl your_spider
Expected
Rows in
job_requestswithparent_idfilled when parent known.Items table untouched in this recipe (no ItemsPipeline).
Tip
Start with a small crawl to verify parent linkage before scaling.