Changelog
=========
All notable changes to Scrapy Item Ingest will be documented in this file.
The format is based on `Keep a Changelog `_, and this project adheres to `Semantic Versioning `_.
[Unreleased]
------------
### Added
- Advanced batch processing with configurable batch sizes
- Connection pooling optimization for high-throughput scenarios
- Redis-based job queue system for distributed crawling
- Comprehensive monitoring and metrics collection
- Webhook notifications for real-time updates
- Custom serialization handlers for complex data types
### Deprecated
- Legacy table naming conventions (will be removed in v2.0.0)
### Security
- Enhanced database connection security with SSL support
- Input validation improvements to prevent SQL injection
[0.2.4] - 2025-11-24
--------------------
### Fixed
- Update getting error logs and insertions logic [scrapy_item_ingest.pipelines.items]
[0.2.3] - 2025-11-21
--------------------
### Fixed
- Removed extra loging from [scrapy_item_ingest.pipelines.items]
[0.2.2] - 2025-11-21
--------------------
### Changed
- Simplified the logging extension to only attach to the root logger, which prevents log duplication and captures all log sources.
- Removed complex and unnecessary logging settings (`LOG_DB_LOGGERS`, `LOG_DB_EXCLUDE_LOGGERS`, `LOG_DB_EXCLUDE_PATTERNS`, `LOG_DB_DEDUP_TTL`, `LOG_DB_CAPTURE_LEVEL`). The extension now relies on the standard Scrapy `LOG_LEVEL`.
### Fixed
- Resolved an issue where logs from the `root` logger were not being captured.
- Fixed a log duplication issue caused by attaching the database handler to multiple loggers in the same hierarchy.
[0.2.0] - 2025-11-11
-------------------
### Added
- Automatic DSN normalization for PostgreSQL `DB_URL` to safely handle special characters in credentials (e.g., `@`, `$`)
- Unified `DatabaseConnection` singleton API used across pipelines and extensions (`connect/execute/commit/rollback/close`)
- Logging extension now capable of capturing Scrapy framework logs (Zyte-like) in addition to spider logs
- Console-like formatter in DB logs honoring `LOG_FORMAT` and `LOG_DATEFORMAT`
- Fine-grained logging controls for DB persistence:
- Allowlist by logger namespaces via `LOG_DB_LOGGERS`
- Exclude by namespaces via `LOG_DB_EXCLUDE_LOGGERS`
- Exclude by message substrings via `LOG_DB_EXCLUDE_PATTERNS`
- Batch size via `LOG_DB_BATCH_SIZE`
- Duplicate suppression via `LOG_DB_DEDUP_TTL`
### Changed
- Attached the DB log handler only to the spider’s base logger and the top-level `scrapy` logger to avoid propagation duplicates
- Applied optional `LOG_DB_CAPTURE_LEVEL` (default falls back to `LOG_DB_LEVEL`) to increase capture detail for DB without changing console verbosity
- Normalized schema for logs to consistently use `level` (instead of `type`)
- Simplified and streamlined documentation and README; reduced pages to essentials
### Fixed
- Import errors in external integrations expecting `DatabaseConnection` by providing a compatibility alias to `DBConnection`
- Eliminated repeated DB logging errors by throttling after the first failure
- Reduced noise by default: excluded `scrapy.core.scraper` and messages containing `Scraped from <` from DB persistence
### Settings (new/updated)
- `LOG_DB_LEVEL`: minimum level stored in DB (default: `DEBUG`)
- `LOG_DB_CAPTURE_LEVEL`: capture level for attached loggers (DB only)
- `LOG_DB_LOGGERS`: additional allowed logger prefixes (defaults always include `[spider.name, 'scrapy']`)
- `LOG_DB_EXCLUDE_LOGGERS`: logger namespaces to exclude (default: `['scrapy.core.scraper']`)
- `LOG_DB_EXCLUDE_PATTERNS`: message substrings to exclude (default: `['Scraped from <']`)
- `LOG_DB_BATCH_SIZE`: DB insert batch size
- `LOG_DB_DEDUP_TTL`: seconds to suppress duplicates
[0.1.1] - 2025-07-21
-------------------
### Added
- **Core Pipeline Functionality**
- `DbInsertPipeline` - Combined pipeline for items and requests
- `ItemsPipeline` - Standalone items processing pipeline
- `RequestsPipeline` - Standalone requests tracking pipeline
- `BasePipeline` - Base class for custom implementations
- **Database Integration**
- PostgreSQL database support with automatic table creation
- JSONB storage for flexible item data structure
- Request tracking with parent-child relationships
- Performance optimized with proper indexing
- **Logging Extension**
- `LoggingExtension` - Comprehensive spider event logging
- Real-time log storage in database
- Support for all Python log levels
- Spider lifecycle event tracking
- **Configuration Management**
- Flexible settings validation
- Environment-based configuration
- Multi-environment support (dev, staging, production)
- Automatic fallback to spider name for job IDs
- **Database Schema**
- `job_items` table for scraped data storage
- `job_requests` table for request/response tracking
- `job_logs` table for spider events and messages
- Foreign key relationships and proper constraints
- **Utility Functions**
- Item serialization with datetime and Decimal support
- Request fingerprinting for deduplication
- Database connection management
- Data validation and cleaning utilities
- **Production Features**
- Docker container support
- Kubernetes deployment configurations
- Monitoring and alerting integration
- High-availability database setup
- **Developer Tools**
- Comprehensive test suite with pytest
- Development environment setup
- Code quality tools (Black, flake8, mypy)
- Pre-commit hooks configuration
### Documentation
- Complete ReadTheDocs documentation
- Installation and quick start guides
- API reference for all components
- Production deployment examples
- Troubleshooting guide
- Contributing guidelines
### Technical Details
**Database Schema:**
.. code-block:: sql
-- Items table
CREATE TABLE job_items (
id BIGSERIAL PRIMARY KEY,
item JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
job_id INTEGER NOT NULL
);
-- Requests table
CREATE TABLE job_requests (
id BIGSERIAL PRIMARY KEY,
url VARCHAR(200) NOT NULL,
method VARCHAR(10) NOT NULL,
status_code INTEGER,
response_time FLOAT,
fingerprint VARCHAR(255),
parent_url VARCHAR(255),
created_at TIMESTAMPTZ NOT NULL,
job_id INTEGER NOT NULL,
parent_id BIGINT,
FOREIGN KEY (parent_id) REFERENCES job_requests(id)
);
-- Logs table
CREATE TABLE job_logs (
id BIGSERIAL PRIMARY KEY,
job_id INTEGER NOT NULL,
type VARCHAR(50) NOT NULL,
message TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL
);
**Configuration Options:**
- `DB_URL` - PostgreSQL connection string (required)
- `CREATE_TABLES` - Auto-create tables (default: True)
- `JOB_ID` - Job identifier (default: spider name)
- `DB_SETTINGS` - Advanced database configuration
- `TABLE_NAMES` - Custom table name mapping
**Pipeline Integration:**
.. code-block:: python
# Basic setup
ITEM_PIPELINES = {
'scrapy_item_ingest.DbInsertPipeline': 300,
}
EXTENSIONS = {
'scrapy_item_ingest.LoggingExtension': 500,
}
**Key Features:**
- **Real-time Data Storage**: Items and requests stored as they're processed
- **Flexible Data Structure**: JSONB storage supports any item structure
- **Request Tracking**: Complete request/response lifecycle tracking
- **Performance Optimized**: Connection pooling and batch processing
- **Production Ready**: Docker, Kubernetes, and monitoring support
- **Developer Friendly**: Comprehensive documentation and testing
### Breaking Changes
None (initial release)
### Migration Guide
Not applicable (initial release)
---
## Release Notes Template
For future releases, use this template:
```markdown
[X.Y.Z] - YYYY-MM-DD
--------------------
### Added
- New features and capabilities
### Changed
- Changes to existing functionality
### Deprecated
- Features marked for removal in future versions
### Removed
- Features removed in this version
### Fixed
- Bug fixes and corrections
### Security
- Security-related improvements
### Breaking Changes
- Changes that break backward compatibility
### Migration Guide
- Instructions for upgrading from previous versions
```
## Changelog Guidelines
### Categories
**Added** - for new features
**Changed** - for changes in existing functionality
**Deprecated** - for soon-to-be removed features
**Removed** - for now removed features
**Fixed** - for any bug fixes
**Security** - in case of vulnerabilities
### Format
- Use past tense for all entries
- Include issue/PR references where applicable
- Group related changes under subheadings
- Provide migration instructions for breaking changes
- Include code examples for significant new features
### Examples
```markdown
### Added
- New `BatchProcessor` class for high-performance item processing (#123)
- Support for MySQL databases in addition to PostgreSQL (#145)
- Real-time metrics collection via Prometheus integration (#167)
### Changed
- Improved error handling in database connections with automatic retry (#134)
- Updated default batch size from 100 to 500 items for better performance (#156)
### Fixed
- Fixed memory leak in long-running spiders (#142)
- Resolved issue with Unicode characters in item serialization (#158)
### Breaking Changes
- Renamed `table_prefix` setting to `table_names` for consistency
- Changed default job ID format from timestamp to spider name
**Migration:** Update your settings.py:
```python
# Old
TABLE_PREFIX = 'custom_'
# New
TABLE_NAMES = {
'items': 'custom_items',
'requests': 'custom_requests',
'logs': 'custom_logs'
}
```
```