Changelog
All notable changes to Scrapy Item Ingest will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
### Added - Advanced batch processing with configurable batch sizes - Connection pooling optimization for high-throughput scenarios - Redis-based job queue system for distributed crawling - Comprehensive monitoring and metrics collection - Webhook notifications for real-time updates - Custom serialization handlers for complex data types
### Deprecated - Legacy table naming conventions (will be removed in v2.0.0)
### Security - Enhanced database connection security with SSL support - Input validation improvements to prevent SQL injection
[0.2.4] - 2025-11-24
### Fixed - Update getting error logs and insertions logic [scrapy_item_ingest.pipelines.items]
[0.2.3] - 2025-11-21
### Fixed - Removed extra loging from [scrapy_item_ingest.pipelines.items]
[0.2.2] - 2025-11-21
### Changed - Simplified the logging extension to only attach to the root logger, which prevents log duplication and captures all log sources. - Removed complex and unnecessary logging settings (LOG_DB_LOGGERS, LOG_DB_EXCLUDE_LOGGERS, LOG_DB_EXCLUDE_PATTERNS, LOG_DB_DEDUP_TTL, LOG_DB_CAPTURE_LEVEL). The extension now relies on the standard Scrapy LOG_LEVEL.
### Fixed - Resolved an issue where logs from the root logger were not being captured. - Fixed a log duplication issue caused by attaching the database handler to multiple loggers in the same hierarchy.
[0.2.0] - 2025-11-11
### Added - Automatic DSN normalization for PostgreSQL DB_URL to safely handle special characters in credentials (e.g., @, $) - Unified DatabaseConnection singleton API used across pipelines and extensions (connect/execute/commit/rollback/close) - Logging extension now capable of capturing Scrapy framework logs (Zyte-like) in addition to spider logs - Console-like formatter in DB logs honoring LOG_FORMAT and LOG_DATEFORMAT - Fine-grained logging controls for DB persistence:
Allowlist by logger namespaces via LOG_DB_LOGGERS
Exclude by namespaces via LOG_DB_EXCLUDE_LOGGERS
Exclude by message substrings via LOG_DB_EXCLUDE_PATTERNS
Batch size via LOG_DB_BATCH_SIZE
Duplicate suppression via LOG_DB_DEDUP_TTL
### Changed - Attached the DB log handler only to the spider’s base logger and the top-level scrapy logger to avoid propagation duplicates - Applied optional LOG_DB_CAPTURE_LEVEL (default falls back to LOG_DB_LEVEL) to increase capture detail for DB without changing console verbosity - Normalized schema for logs to consistently use level (instead of type) - Simplified and streamlined documentation and README; reduced pages to essentials
### Fixed - Import errors in external integrations expecting DatabaseConnection by providing a compatibility alias to DBConnection - Eliminated repeated DB logging errors by throttling after the first failure - Reduced noise by default: excluded scrapy.core.scraper and messages containing Scraped from < from DB persistence
### Settings (new/updated) - LOG_DB_LEVEL: minimum level stored in DB (default: DEBUG) - LOG_DB_CAPTURE_LEVEL: capture level for attached loggers (DB only) - LOG_DB_LOGGERS: additional allowed logger prefixes (defaults always include [spider.name, ‘scrapy’]) - LOG_DB_EXCLUDE_LOGGERS: logger namespaces to exclude (default: [‘scrapy.core.scraper’]) - LOG_DB_EXCLUDE_PATTERNS: message substrings to exclude (default: [‘Scraped from <’]) - LOG_DB_BATCH_SIZE: DB insert batch size - LOG_DB_DEDUP_TTL: seconds to suppress duplicates
[0.1.1] - 2025-07-21
### Added - Core Pipeline Functionality
DbInsertPipeline - Combined pipeline for items and requests
ItemsPipeline - Standalone items processing pipeline
RequestsPipeline - Standalone requests tracking pipeline
BasePipeline - Base class for custom implementations
Database Integration - PostgreSQL database support with automatic table creation - JSONB storage for flexible item data structure - Request tracking with parent-child relationships - Performance optimized with proper indexing
Logging Extension - LoggingExtension - Comprehensive spider event logging - Real-time log storage in database - Support for all Python log levels - Spider lifecycle event tracking
Configuration Management - Flexible settings validation - Environment-based configuration - Multi-environment support (dev, staging, production) - Automatic fallback to spider name for job IDs
Database Schema - job_items table for scraped data storage - job_requests table for request/response tracking - job_logs table for spider events and messages - Foreign key relationships and proper constraints
Utility Functions - Item serialization with datetime and Decimal support - Request fingerprinting for deduplication - Database connection management - Data validation and cleaning utilities
Production Features - Docker container support - Kubernetes deployment configurations - Monitoring and alerting integration - High-availability database setup
Developer Tools - Comprehensive test suite with pytest - Development environment setup - Code quality tools (Black, flake8, mypy) - Pre-commit hooks configuration
### Documentation - Complete ReadTheDocs documentation - Installation and quick start guides - API reference for all components - Production deployment examples - Troubleshooting guide - Contributing guidelines
### Technical Details
Database Schema:
-- Items table
CREATE TABLE job_items (
id BIGSERIAL PRIMARY KEY,
item JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
job_id INTEGER NOT NULL
);
-- Requests table
CREATE TABLE job_requests (
id BIGSERIAL PRIMARY KEY,
url VARCHAR(200) NOT NULL,
method VARCHAR(10) NOT NULL,
status_code INTEGER,
response_time FLOAT,
fingerprint VARCHAR(255),
parent_url VARCHAR(255),
created_at TIMESTAMPTZ NOT NULL,
job_id INTEGER NOT NULL,
parent_id BIGINT,
FOREIGN KEY (parent_id) REFERENCES job_requests(id)
);
-- Logs table
CREATE TABLE job_logs (
id BIGSERIAL PRIMARY KEY,
job_id INTEGER NOT NULL,
type VARCHAR(50) NOT NULL,
message TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
Configuration Options:
DB_URL - PostgreSQL connection string (required)
CREATE_TABLES - Auto-create tables (default: True)
JOB_ID - Job identifier (default: spider name)
DB_SETTINGS - Advanced database configuration
TABLE_NAMES - Custom table name mapping
Pipeline Integration:
# Basic setup
ITEM_PIPELINES = {
'scrapy_item_ingest.DbInsertPipeline': 300,
}
EXTENSIONS = {
'scrapy_item_ingest.LoggingExtension': 500,
}
Key Features:
Real-time Data Storage: Items and requests stored as they’re processed
Flexible Data Structure: JSONB storage supports any item structure
Request Tracking: Complete request/response lifecycle tracking
Performance Optimized: Connection pooling and batch processing
Production Ready: Docker, Kubernetes, and monitoring support
Developer Friendly: Comprehensive documentation and testing
### Breaking Changes None (initial release)
### Migration Guide Not applicable (initial release)
—
## Release Notes Template
For future releases, use this template:
```markdown [X.Y.Z] - YYYY-MM-DD ——————–
### Added - New features and capabilities
### Changed - Changes to existing functionality
### Deprecated - Features marked for removal in future versions
### Removed - Features removed in this version
### Fixed - Bug fixes and corrections
### Security - Security-related improvements
### Breaking Changes - Changes that break backward compatibility
### Migration Guide - Instructions for upgrading from previous versions ```
## Changelog Guidelines
### Categories
Added - for new features Changed - for changes in existing functionality Deprecated - for soon-to-be removed features Removed - for now removed features Fixed - for any bug fixes Security - in case of vulnerabilities
### Format
Use past tense for all entries
Include issue/PR references where applicable
Group related changes under subheadings
Provide migration instructions for breaking changes
Include code examples for significant new features
### Examples
``markdown ### Added - New `BatchProcessor class for high-performance item processing (#123) - Support for MySQL databases in addition to PostgreSQL (#145) - Real-time metrics collection via Prometheus integration (#167)
### Changed - Improved error handling in database connections with automatic retry (#134) - Updated default batch size from 100 to 500 items for better performance (#156)
### Fixed - Fixed memory leak in long-running spiders (#142) - Resolved issue with Unicode characters in item serialization (#158)
### Breaking Changes - Renamed table_prefix setting to table_names for consistency - Changed default job ID format from timestamp to spider name