Changelog

All notable changes to Scrapy Item Ingest will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

### Added - Advanced batch processing with configurable batch sizes - Connection pooling optimization for high-throughput scenarios - Redis-based job queue system for distributed crawling - Comprehensive monitoring and metrics collection - Webhook notifications for real-time updates - Custom serialization handlers for complex data types

### Deprecated - Legacy table naming conventions (will be removed in v2.0.0)

### Security - Enhanced database connection security with SSL support - Input validation improvements to prevent SQL injection

[0.2.4] - 2025-11-24

### Fixed - Update getting error logs and insertions logic [scrapy_item_ingest.pipelines.items]

[0.2.3] - 2025-11-21

### Fixed - Removed extra loging from [scrapy_item_ingest.pipelines.items]

[0.2.2] - 2025-11-21

### Changed - Simplified the logging extension to only attach to the root logger, which prevents log duplication and captures all log sources. - Removed complex and unnecessary logging settings (LOG_DB_LOGGERS, LOG_DB_EXCLUDE_LOGGERS, LOG_DB_EXCLUDE_PATTERNS, LOG_DB_DEDUP_TTL, LOG_DB_CAPTURE_LEVEL). The extension now relies on the standard Scrapy LOG_LEVEL.

### Fixed - Resolved an issue where logs from the root logger were not being captured. - Fixed a log duplication issue caused by attaching the database handler to multiple loggers in the same hierarchy.

[0.2.0] - 2025-11-11

### Added - Automatic DSN normalization for PostgreSQL DB_URL to safely handle special characters in credentials (e.g., @, $) - Unified DatabaseConnection singleton API used across pipelines and extensions (connect/execute/commit/rollback/close) - Logging extension now capable of capturing Scrapy framework logs (Zyte-like) in addition to spider logs - Console-like formatter in DB logs honoring LOG_FORMAT and LOG_DATEFORMAT - Fine-grained logging controls for DB persistence:

  • Allowlist by logger namespaces via LOG_DB_LOGGERS

  • Exclude by namespaces via LOG_DB_EXCLUDE_LOGGERS

  • Exclude by message substrings via LOG_DB_EXCLUDE_PATTERNS

  • Batch size via LOG_DB_BATCH_SIZE

  • Duplicate suppression via LOG_DB_DEDUP_TTL

### Changed - Attached the DB log handler only to the spider’s base logger and the top-level scrapy logger to avoid propagation duplicates - Applied optional LOG_DB_CAPTURE_LEVEL (default falls back to LOG_DB_LEVEL) to increase capture detail for DB without changing console verbosity - Normalized schema for logs to consistently use level (instead of type) - Simplified and streamlined documentation and README; reduced pages to essentials

### Fixed - Import errors in external integrations expecting DatabaseConnection by providing a compatibility alias to DBConnection - Eliminated repeated DB logging errors by throttling after the first failure - Reduced noise by default: excluded scrapy.core.scraper and messages containing Scraped from < from DB persistence

### Settings (new/updated) - LOG_DB_LEVEL: minimum level stored in DB (default: DEBUG) - LOG_DB_CAPTURE_LEVEL: capture level for attached loggers (DB only) - LOG_DB_LOGGERS: additional allowed logger prefixes (defaults always include [spider.name, ‘scrapy’]) - LOG_DB_EXCLUDE_LOGGERS: logger namespaces to exclude (default: [‘scrapy.core.scraper’]) - LOG_DB_EXCLUDE_PATTERNS: message substrings to exclude (default: [‘Scraped from <’]) - LOG_DB_BATCH_SIZE: DB insert batch size - LOG_DB_DEDUP_TTL: seconds to suppress duplicates

[0.1.1] - 2025-07-21

### Added - Core Pipeline Functionality

  • DbInsertPipeline - Combined pipeline for items and requests

  • ItemsPipeline - Standalone items processing pipeline

  • RequestsPipeline - Standalone requests tracking pipeline

  • BasePipeline - Base class for custom implementations

  • Database Integration - PostgreSQL database support with automatic table creation - JSONB storage for flexible item data structure - Request tracking with parent-child relationships - Performance optimized with proper indexing

  • Logging Extension - LoggingExtension - Comprehensive spider event logging - Real-time log storage in database - Support for all Python log levels - Spider lifecycle event tracking

  • Configuration Management - Flexible settings validation - Environment-based configuration - Multi-environment support (dev, staging, production) - Automatic fallback to spider name for job IDs

  • Database Schema - job_items table for scraped data storage - job_requests table for request/response tracking - job_logs table for spider events and messages - Foreign key relationships and proper constraints

  • Utility Functions - Item serialization with datetime and Decimal support - Request fingerprinting for deduplication - Database connection management - Data validation and cleaning utilities

  • Production Features - Docker container support - Kubernetes deployment configurations - Monitoring and alerting integration - High-availability database setup

  • Developer Tools - Comprehensive test suite with pytest - Development environment setup - Code quality tools (Black, flake8, mypy) - Pre-commit hooks configuration

### Documentation - Complete ReadTheDocs documentation - Installation and quick start guides - API reference for all components - Production deployment examples - Troubleshooting guide - Contributing guidelines

### Technical Details

Database Schema:

-- Items table
CREATE TABLE job_items (
    id BIGSERIAL PRIMARY KEY,
    item JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL,
    job_id INTEGER NOT NULL
);

-- Requests table
CREATE TABLE job_requests (
    id BIGSERIAL PRIMARY KEY,
    url VARCHAR(200) NOT NULL,
    method VARCHAR(10) NOT NULL,
    status_code INTEGER,
    response_time FLOAT,
    fingerprint VARCHAR(255),
    parent_url VARCHAR(255),
    created_at TIMESTAMPTZ NOT NULL,
    job_id INTEGER NOT NULL,
    parent_id BIGINT,
    FOREIGN KEY (parent_id) REFERENCES job_requests(id)
);

-- Logs table
CREATE TABLE job_logs (
    id BIGSERIAL PRIMARY KEY,
    job_id INTEGER NOT NULL,
    type VARCHAR(50) NOT NULL,
    message TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL
);

Configuration Options:

  • DB_URL - PostgreSQL connection string (required)

  • CREATE_TABLES - Auto-create tables (default: True)

  • JOB_ID - Job identifier (default: spider name)

  • DB_SETTINGS - Advanced database configuration

  • TABLE_NAMES - Custom table name mapping

Pipeline Integration:

# Basic setup
ITEM_PIPELINES = {
    'scrapy_item_ingest.DbInsertPipeline': 300,
}

EXTENSIONS = {
    'scrapy_item_ingest.LoggingExtension': 500,
}

Key Features:

  • Real-time Data Storage: Items and requests stored as they’re processed

  • Flexible Data Structure: JSONB storage supports any item structure

  • Request Tracking: Complete request/response lifecycle tracking

  • Performance Optimized: Connection pooling and batch processing

  • Production Ready: Docker, Kubernetes, and monitoring support

  • Developer Friendly: Comprehensive documentation and testing

### Breaking Changes None (initial release)

### Migration Guide Not applicable (initial release)

## Release Notes Template

For future releases, use this template:

```markdown [X.Y.Z] - YYYY-MM-DD ——————–

### Added - New features and capabilities

### Changed - Changes to existing functionality

### Deprecated - Features marked for removal in future versions

### Removed - Features removed in this version

### Fixed - Bug fixes and corrections

### Security - Security-related improvements

### Breaking Changes - Changes that break backward compatibility

### Migration Guide - Instructions for upgrading from previous versions ```

## Changelog Guidelines

### Categories

Added - for new features Changed - for changes in existing functionality Deprecated - for soon-to-be removed features Removed - for now removed features Fixed - for any bug fixes Security - in case of vulnerabilities

### Format

  • Use past tense for all entries

  • Include issue/PR references where applicable

  • Group related changes under subheadings

  • Provide migration instructions for breaking changes

  • Include code examples for significant new features

### Examples

``markdown ### Added - New `BatchProcessor class for high-performance item processing (#123) - Support for MySQL databases in addition to PostgreSQL (#145) - Real-time metrics collection via Prometheus integration (#167)

### Changed - Improved error handling in database connections with automatic retry (#134) - Updated default batch size from 100 to 500 items for better performance (#156)

### Fixed - Fixed memory leak in long-running spiders (#142) - Resolved issue with Unicode characters in item serialization (#158)

### Breaking Changes - Renamed table_prefix setting to table_names for consistency - Changed default job ID format from timestamp to spider name

Migration: Update your settings.py: ```python # Old TABLE_PREFIX = ‘custom_

# New TABLE_NAMES = {

‘items’: ‘custom_items’, ‘requests’: ‘custom_requests’, ‘logs’: ‘custom_logs’

```