Contributing
Thank you for your interest in contributing to Scrapy Item Ingest! This guide will help you get started with contributing to the project.
Getting Started
Development Setup
Fork and clone the repository:
git clone https://github.com/fawadss1/scrapy_item_ingest.git cd scrapy_item_ingest
Create a virtual environment:
python -m venv venv # On Windows: venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
Install development dependencies:
pip install -e . pip install -r requirements-dev.txt
Set up pre-commit hooks:
pre-commit installSet up test database:
# Create test database createdb scrapy_test # Run initial tests pytest tests/
Development Environment
Required Tools:
Python 3.7+
PostgreSQL 12+
Git
Code editor (VS Code, PyCharm, etc.)
Recommended Tools:
Docker (for testing different environments)
Redis (for advanced features testing)
pgAdmin or similar (for database inspection)
Environment Variables:
# .env.development
TEST_DATABASE_URL=postgresql://postgres:password@localhost:5432/scrapy_test
DEBUG=true
LOG_LEVEL=DEBUG
Project Structure
Understanding the Codebase
scrapy_item_ingest/
├── __init__.py # Package initialization and exports
├── config/ # Configuration management
│ ├── __init__.py
│ └── settings.py # Settings validation and defaults
├── database/ # Database operations
│ ├── __init__.py
│ ├── connection.py # Connection management
│ └── schema.py # Table creation and management
├── extensions/ # Scrapy extensions
│ ├── __init__.py
│ ├── base.py # Base extension class
│ └── logging.py # Logging extension
├── pipelines/ # Scrapy pipelines
│ ├── __init__.py
│ ├── base.py # Base pipeline class
│ ├── items.py # Items pipeline
│ ├── main.py # Combined pipeline
│ └── requests.py # Requests pipeline
└── utils/ # Utility functions
├── __init__.py
├── fingerprint.py # Request fingerprinting
└── serialization.py # Data serialization
Key Components:
Pipelines: Core functionality for processing items and requests
Extensions: Additional features like logging and monitoring
Database: Connection management and schema operations
Config: Settings validation and configuration management
Utils: Helper functions and utilities
Code Style and Standards
Coding Standards
We follow PEP 8 with some additional guidelines:
Line length: Maximum 88 characters (Black formatter)
Imports: Use absolute imports, group by standard/third-party/local
Docstrings: Use Google-style docstrings
Type hints: Use type hints where appropriate
Variable names: Use descriptive names, avoid abbreviations
Example Code Style:
from typing import Dict, List, Optional
import logging
from scrapy import Spider
from scrapy.item import Item
logger = logging.getLogger(__name__)
class ItemsPipeline:
"""Pipeline for storing scraped items in database.
This pipeline handles the storage of scraped items into PostgreSQL
database with automatic serialization and error handling.
Args:
settings: Scrapy settings object containing configuration
Attributes:
db_url: Database connection string
job_id: Unique identifier for the crawl job
"""
def __init__(self, settings: Dict[str, Any]) -> None:
self.db_url: str = settings.get('DB_URL')
self.job_id: Optional[str] = settings.get('JOB_ID')
self._connection: Optional[Connection] = None
def process_item(self, item: Item, spider: Spider) -> Item:
"""Process and store item in database.
Args:
item: The scraped item to process
spider: The spider that scraped the item
Returns:
The processed item
Raises:
DatabaseError: If database operation fails
"""
try:
serialized_item = self._serialize_item(item)
self._store_item(serialized_item, spider)
return item
except Exception as e:
logger.error(f"Failed to process item: {e}")
raise
Documentation Standards
Code Comments: Explain why, not what
Docstrings: Document all public functions, classes, and methods
Type Hints: Use type hints for function signatures
Examples: Include usage examples in docstrings
Docstring Example:
def serialize_item(item: Union[Item, Dict]) -> Dict[str, Any]:
"""Serialize Scrapy item to JSON-compatible format.
Converts Scrapy Item objects and dictionaries to a format that can
be safely serialized to JSON. Handles datetime objects, Decimal
numbers, and other non-JSON-serializable types.
Args:
item: The item to serialize. Can be a Scrapy Item or dictionary.
Returns:
A dictionary with all values converted to JSON-serializable types.
Raises:
SerializationError: If the item contains objects that cannot be
serialized or converted to a compatible format.
Example:
>>> from datetime import datetime
>>> item = {'name': 'Product', 'created': datetime.now()}
>>> serialized = serialize_item(item)
>>> isinstance(serialized['created'], str)
True
"""
Testing Guidelines
Test Structure
We use pytest for testing with the following structure:
tests/
├── conftest.py # Pytest configuration and fixtures
├── unit/ # Unit tests
│ ├── test_pipelines.py
│ ├── test_extensions.py
│ ├── test_serialization.py
│ └── test_config.py
├── integration/ # Integration tests
│ ├── test_database.py
│ └── test_scrapy_integration.py
└── fixtures/ # Test data and fixtures
├── sample_items.json
└── test_responses.html
Writing Tests
Unit Test Example:
import pytest
from unittest.mock import Mock, patch
from scrapy_item_ingest.pipelines.items import ItemsPipeline
class TestItemsPipeline:
@pytest.fixture
def pipeline(self, mock_settings):
return ItemsPipeline(mock_settings)
@pytest.fixture
def mock_settings(self):
return {
'DB_URL': 'postgresql://test:test@localhost:5432/test_db',
'CREATE_TABLES': True,
'JOB_ID': 'test_job'
}
@pytest.fixture
def sample_item(self):
return {
'title': 'Test Product',
'price': 29.99,
'url': 'https://example.com/product/123'
}
def test_process_item_success(self, pipeline, sample_item):
spider = Mock()
spider.name = 'test_spider'
with patch.object(pipeline, '_store_item') as mock_store:
result = pipeline.process_item(sample_item, spider)
assert result == sample_item
mock_store.assert_called_once()
def test_process_item_database_error(self, pipeline, sample_item):
spider = Mock()
with patch.object(pipeline, '_store_item', side_effect=Exception("DB Error")):
with pytest.raises(Exception):
pipeline.process_item(sample_item, spider)
Integration Test Example:
import pytest
import psycopg2
from scrapy_item_ingest.pipelines.main import DbInsertPipeline
@pytest.mark.integration
class TestDatabaseIntegration:
@pytest.fixture(scope='class')
def test_database(self):
# Setup test database
db_url = 'postgresql://test:test@localhost:5432/test_scrapy'
conn = psycopg2.connect(db_url)
yield db_url
# Cleanup
conn.close()
def test_end_to_end_pipeline(self, test_database):
settings = {
'DB_URL': test_database,
'CREATE_TABLES': True,
'JOB_ID': 'integration_test'
}
pipeline = DbInsertPipeline(settings)
spider = Mock()
spider.name = 'test_spider'
# Test pipeline functionality
pipeline.open_spider(spider)
item = {'title': 'Test Item', 'url': 'https://example.com'}
result = pipeline.process_item(item, spider)
assert result == item
pipeline.close_spider(spider)
Running Tests
# Run all tests
pytest
# Run specific test file
pytest tests/unit/test_pipelines.py
# Run with coverage
pytest --cov=scrapy_item_ingest --cov-report=html
# Run only integration tests
pytest -m integration
# Run tests in parallel
pytest -n auto
Test Coverage
We aim for 90%+ test coverage. Check coverage with:
pytest --cov=scrapy_item_ingest --cov-report=term-missing
Contribution Workflow
Making Changes
Create a feature branch:
git checkout -b feature/your-feature-name
Make your changes: - Write code following our style guidelines - Add or update tests - Update documentation if needed
Run tests and linting:
# Run tests pytest # Run linting flake8 scrapy_item_ingest/ black --check scrapy_item_ingest/ mypy scrapy_item_ingest/
Commit your changes:
git add . git commit -m "feat: add new feature description"
Push and create pull request:
git push origin feature/your-feature-name
Commit Message Format
We use conventional commits format:
type(scope): description
[optional body]
[optional footer]
Types: - feat: New feature - fix: Bug fix - docs: Documentation changes - style: Code style changes (formatting, etc.) - refactor: Code refactoring - test: Adding or updating tests - chore: Maintenance tasks
Examples:
feat(pipelines): add batch processing support
fix(database): handle connection timeout errors
docs(api): update pipeline documentation
test(integration): add database integration tests
Pull Request Guidelines
Before submitting:
[ ] Tests pass locally
[ ] Code follows style guidelines
[ ] Documentation is updated
[ ] CHANGELOG.md is updated (for significant changes)
[ ] Commit messages follow conventional format
Pull Request Template:
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] Tests pass
Release Process
Versioning
We use Semantic Versioning (SemVer):
MAJOR: Incompatible API changes
MINOR: New functionality (backward compatible)
PATCH: Bug fixes (backward compatible)
Release Checklist
Update version in setup.py
Update CHANGELOG.md
Create release tag
Build and publish to PyPI
Update documentation
# Create release
git tag v1.2.0
git push origin v1.2.0
# Build package
python setup.py sdist bdist_wheel
# Upload to PyPI
twine upload dist/*
Getting Help
Development Questions: - GitHub Discussions - Open an issue with question label
Bug Reports: - Use GitHub Issues - Include minimal reproduction case - Provide environment details
Feature Requests: - Open GitHub Issue with enhancement label - Describe use case and expected behavior
Code of Conduct
We are committed to providing a welcoming and inclusive environment for all contributors. Please read and follow our Code of Conduct.
Our Standards:
Be respectful and inclusive
Welcome newcomers and help them learn
Focus on constructive feedback
Respect different viewpoints and experiences
Unacceptable Behavior:
Harassment or discrimination
Personal attacks or trolling
Publishing private information
Inappropriate sexual content
Reporting:
Report unacceptable behavior to the maintainers at admin@yourproject.com.
Recognition
Contributors are recognized in:
CONTRIBUTORS.md file
Release notes
Project documentation
Thank you for contributing to Scrapy Item Ingest! 🎉