Changelog ========= All notable changes to Scrapy Item Ingest will be documented in this file. The format is based on `Keep a Changelog `_, and this project adheres to `Semantic Versioning `_. [Unreleased] ------------ ### Added - Advanced batch processing with configurable batch sizes - Connection pooling optimization for high-throughput scenarios - Redis-based job queue system for distributed crawling - Comprehensive monitoring and metrics collection - Webhook notifications for real-time updates - Custom serialization handlers for complex data types ### Deprecated - Legacy table naming conventions (will be removed in v2.0.0) ### Security - Enhanced database connection security with SSL support - Input validation improvements to prevent SQL injection [0.2.4] - 2025-11-24 -------------------- ### Fixed - Update getting error logs and insertions logic [scrapy_item_ingest.pipelines.items] [0.2.3] - 2025-11-21 -------------------- ### Fixed - Removed extra loging from [scrapy_item_ingest.pipelines.items] [0.2.2] - 2025-11-21 -------------------- ### Changed - Simplified the logging extension to only attach to the root logger, which prevents log duplication and captures all log sources. - Removed complex and unnecessary logging settings (`LOG_DB_LOGGERS`, `LOG_DB_EXCLUDE_LOGGERS`, `LOG_DB_EXCLUDE_PATTERNS`, `LOG_DB_DEDUP_TTL`, `LOG_DB_CAPTURE_LEVEL`). The extension now relies on the standard Scrapy `LOG_LEVEL`. ### Fixed - Resolved an issue where logs from the `root` logger were not being captured. - Fixed a log duplication issue caused by attaching the database handler to multiple loggers in the same hierarchy. [0.2.0] - 2025-11-11 ------------------- ### Added - Automatic DSN normalization for PostgreSQL `DB_URL` to safely handle special characters in credentials (e.g., `@`, `$`) - Unified `DatabaseConnection` singleton API used across pipelines and extensions (`connect/execute/commit/rollback/close`) - Logging extension now capable of capturing Scrapy framework logs (Zyte-like) in addition to spider logs - Console-like formatter in DB logs honoring `LOG_FORMAT` and `LOG_DATEFORMAT` - Fine-grained logging controls for DB persistence: - Allowlist by logger namespaces via `LOG_DB_LOGGERS` - Exclude by namespaces via `LOG_DB_EXCLUDE_LOGGERS` - Exclude by message substrings via `LOG_DB_EXCLUDE_PATTERNS` - Batch size via `LOG_DB_BATCH_SIZE` - Duplicate suppression via `LOG_DB_DEDUP_TTL` ### Changed - Attached the DB log handler only to the spider’s base logger and the top-level `scrapy` logger to avoid propagation duplicates - Applied optional `LOG_DB_CAPTURE_LEVEL` (default falls back to `LOG_DB_LEVEL`) to increase capture detail for DB without changing console verbosity - Normalized schema for logs to consistently use `level` (instead of `type`) - Simplified and streamlined documentation and README; reduced pages to essentials ### Fixed - Import errors in external integrations expecting `DatabaseConnection` by providing a compatibility alias to `DBConnection` - Eliminated repeated DB logging errors by throttling after the first failure - Reduced noise by default: excluded `scrapy.core.scraper` and messages containing `Scraped from <` from DB persistence ### Settings (new/updated) - `LOG_DB_LEVEL`: minimum level stored in DB (default: `DEBUG`) - `LOG_DB_CAPTURE_LEVEL`: capture level for attached loggers (DB only) - `LOG_DB_LOGGERS`: additional allowed logger prefixes (defaults always include `[spider.name, 'scrapy']`) - `LOG_DB_EXCLUDE_LOGGERS`: logger namespaces to exclude (default: `['scrapy.core.scraper']`) - `LOG_DB_EXCLUDE_PATTERNS`: message substrings to exclude (default: `['Scraped from <']`) - `LOG_DB_BATCH_SIZE`: DB insert batch size - `LOG_DB_DEDUP_TTL`: seconds to suppress duplicates [0.1.1] - 2025-07-21 ------------------- ### Added - **Core Pipeline Functionality** - `DbInsertPipeline` - Combined pipeline for items and requests - `ItemsPipeline` - Standalone items processing pipeline - `RequestsPipeline` - Standalone requests tracking pipeline - `BasePipeline` - Base class for custom implementations - **Database Integration** - PostgreSQL database support with automatic table creation - JSONB storage for flexible item data structure - Request tracking with parent-child relationships - Performance optimized with proper indexing - **Logging Extension** - `LoggingExtension` - Comprehensive spider event logging - Real-time log storage in database - Support for all Python log levels - Spider lifecycle event tracking - **Configuration Management** - Flexible settings validation - Environment-based configuration - Multi-environment support (dev, staging, production) - Automatic fallback to spider name for job IDs - **Database Schema** - `job_items` table for scraped data storage - `job_requests` table for request/response tracking - `job_logs` table for spider events and messages - Foreign key relationships and proper constraints - **Utility Functions** - Item serialization with datetime and Decimal support - Request fingerprinting for deduplication - Database connection management - Data validation and cleaning utilities - **Production Features** - Docker container support - Kubernetes deployment configurations - Monitoring and alerting integration - High-availability database setup - **Developer Tools** - Comprehensive test suite with pytest - Development environment setup - Code quality tools (Black, flake8, mypy) - Pre-commit hooks configuration ### Documentation - Complete ReadTheDocs documentation - Installation and quick start guides - API reference for all components - Production deployment examples - Troubleshooting guide - Contributing guidelines ### Technical Details **Database Schema:** .. code-block:: sql -- Items table CREATE TABLE job_items ( id BIGSERIAL PRIMARY KEY, item JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL, job_id INTEGER NOT NULL ); -- Requests table CREATE TABLE job_requests ( id BIGSERIAL PRIMARY KEY, url VARCHAR(200) NOT NULL, method VARCHAR(10) NOT NULL, status_code INTEGER, response_time FLOAT, fingerprint VARCHAR(255), parent_url VARCHAR(255), created_at TIMESTAMPTZ NOT NULL, job_id INTEGER NOT NULL, parent_id BIGINT, FOREIGN KEY (parent_id) REFERENCES job_requests(id) ); -- Logs table CREATE TABLE job_logs ( id BIGSERIAL PRIMARY KEY, job_id INTEGER NOT NULL, type VARCHAR(50) NOT NULL, message TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL ); **Configuration Options:** - `DB_URL` - PostgreSQL connection string (required) - `CREATE_TABLES` - Auto-create tables (default: True) - `JOB_ID` - Job identifier (default: spider name) - `DB_SETTINGS` - Advanced database configuration - `TABLE_NAMES` - Custom table name mapping **Pipeline Integration:** .. code-block:: python # Basic setup ITEM_PIPELINES = { 'scrapy_item_ingest.DbInsertPipeline': 300, } EXTENSIONS = { 'scrapy_item_ingest.LoggingExtension': 500, } **Key Features:** - **Real-time Data Storage**: Items and requests stored as they're processed - **Flexible Data Structure**: JSONB storage supports any item structure - **Request Tracking**: Complete request/response lifecycle tracking - **Performance Optimized**: Connection pooling and batch processing - **Production Ready**: Docker, Kubernetes, and monitoring support - **Developer Friendly**: Comprehensive documentation and testing ### Breaking Changes None (initial release) ### Migration Guide Not applicable (initial release) --- ## Release Notes Template For future releases, use this template: ```markdown [X.Y.Z] - YYYY-MM-DD -------------------- ### Added - New features and capabilities ### Changed - Changes to existing functionality ### Deprecated - Features marked for removal in future versions ### Removed - Features removed in this version ### Fixed - Bug fixes and corrections ### Security - Security-related improvements ### Breaking Changes - Changes that break backward compatibility ### Migration Guide - Instructions for upgrading from previous versions ``` ## Changelog Guidelines ### Categories **Added** - for new features **Changed** - for changes in existing functionality **Deprecated** - for soon-to-be removed features **Removed** - for now removed features **Fixed** - for any bug fixes **Security** - in case of vulnerabilities ### Format - Use past tense for all entries - Include issue/PR references where applicable - Group related changes under subheadings - Provide migration instructions for breaking changes - Include code examples for significant new features ### Examples ```markdown ### Added - New `BatchProcessor` class for high-performance item processing (#123) - Support for MySQL databases in addition to PostgreSQL (#145) - Real-time metrics collection via Prometheus integration (#167) ### Changed - Improved error handling in database connections with automatic retry (#134) - Updated default batch size from 100 to 500 items for better performance (#156) ### Fixed - Fixed memory leak in long-running spiders (#142) - Resolved issue with Unicode characters in item serialization (#158) ### Breaking Changes - Renamed `table_prefix` setting to `table_names` for consistency - Changed default job ID format from timestamp to spider name **Migration:** Update your settings.py: ```python # Old TABLE_PREFIX = 'custom_' # New TABLE_NAMES = { 'items': 'custom_items', 'requests': 'custom_requests', 'logs': 'custom_logs' } ``` ```