Contributing

Thank you for your interest in contributing to Scrapy Item Ingest! This guide will help you get started with contributing to the project.

Getting Started

Development Setup

  1. Fork and clone the repository:

    git clone https://github.com/fawadss1/scrapy_item_ingest.git
    cd scrapy_item_ingest
    
  2. Create a virtual environment:

    python -m venv venv
    # On Windows:
    venv\Scripts\activate
    # On macOS/Linux:
    source venv/bin/activate
    
  3. Install development dependencies:

    pip install -e .
    pip install -r requirements-dev.txt
    
  4. Set up pre-commit hooks:

    pre-commit install
    
  5. Set up test database:

    # Create test database
    createdb scrapy_test
    
    # Run initial tests
    pytest tests/
    

Development Environment

Required Tools:

  • Python 3.7+

  • PostgreSQL 12+

  • Git

  • Code editor (VS Code, PyCharm, etc.)

Recommended Tools:

  • Docker (for testing different environments)

  • Redis (for advanced features testing)

  • pgAdmin or similar (for database inspection)

Environment Variables:

# .env.development
TEST_DATABASE_URL=postgresql://postgres:password@localhost:5432/scrapy_test
DEBUG=true
LOG_LEVEL=DEBUG

Project Structure

Understanding the Codebase

scrapy_item_ingest/
├── __init__.py              # Package initialization and exports
├── config/                  # Configuration management
│   ├── __init__.py
│   └── settings.py         # Settings validation and defaults
├── database/               # Database operations
│   ├── __init__.py
│   ├── connection.py       # Connection management
│   └── schema.py          # Table creation and management
├── extensions/             # Scrapy extensions
│   ├── __init__.py
│   ├── base.py            # Base extension class
│   └── logging.py         # Logging extension
├── pipelines/              # Scrapy pipelines
│   ├── __init__.py
│   ├── base.py            # Base pipeline class
│   ├── items.py           # Items pipeline
│   ├── main.py            # Combined pipeline
│   └── requests.py        # Requests pipeline
└── utils/                  # Utility functions
    ├── __init__.py
    ├── fingerprint.py      # Request fingerprinting
    └── serialization.py    # Data serialization

Key Components:

  • Pipelines: Core functionality for processing items and requests

  • Extensions: Additional features like logging and monitoring

  • Database: Connection management and schema operations

  • Config: Settings validation and configuration management

  • Utils: Helper functions and utilities

Code Style and Standards

Coding Standards

We follow PEP 8 with some additional guidelines:

  • Line length: Maximum 88 characters (Black formatter)

  • Imports: Use absolute imports, group by standard/third-party/local

  • Docstrings: Use Google-style docstrings

  • Type hints: Use type hints where appropriate

  • Variable names: Use descriptive names, avoid abbreviations

Example Code Style:

from typing import Dict, List, Optional
import logging
from scrapy import Spider
from scrapy.item import Item

logger = logging.getLogger(__name__)


class ItemsPipeline:
    """Pipeline for storing scraped items in database.

    This pipeline handles the storage of scraped items into PostgreSQL
    database with automatic serialization and error handling.

    Args:
        settings: Scrapy settings object containing configuration

    Attributes:
        db_url: Database connection string
        job_id: Unique identifier for the crawl job
    """

    def __init__(self, settings: Dict[str, Any]) -> None:
        self.db_url: str = settings.get('DB_URL')
        self.job_id: Optional[str] = settings.get('JOB_ID')
        self._connection: Optional[Connection] = None

    def process_item(self, item: Item, spider: Spider) -> Item:
        """Process and store item in database.

        Args:
            item: The scraped item to process
            spider: The spider that scraped the item

        Returns:
            The processed item

        Raises:
            DatabaseError: If database operation fails
        """
        try:
            serialized_item = self._serialize_item(item)
            self._store_item(serialized_item, spider)
            return item
        except Exception as e:
            logger.error(f"Failed to process item: {e}")
            raise

Documentation Standards

  • Code Comments: Explain why, not what

  • Docstrings: Document all public functions, classes, and methods

  • Type Hints: Use type hints for function signatures

  • Examples: Include usage examples in docstrings

Docstring Example:

def serialize_item(item: Union[Item, Dict]) -> Dict[str, Any]:
    """Serialize Scrapy item to JSON-compatible format.

    Converts Scrapy Item objects and dictionaries to a format that can
    be safely serialized to JSON. Handles datetime objects, Decimal
    numbers, and other non-JSON-serializable types.

    Args:
        item: The item to serialize. Can be a Scrapy Item or dictionary.

    Returns:
        A dictionary with all values converted to JSON-serializable types.

    Raises:
        SerializationError: If the item contains objects that cannot be
            serialized or converted to a compatible format.

    Example:
        >>> from datetime import datetime
        >>> item = {'name': 'Product', 'created': datetime.now()}
        >>> serialized = serialize_item(item)
        >>> isinstance(serialized['created'], str)
        True
    """

Testing Guidelines

Test Structure

We use pytest for testing with the following structure:

tests/
├── conftest.py              # Pytest configuration and fixtures
├── unit/                    # Unit tests
│   ├── test_pipelines.py
│   ├── test_extensions.py
│   ├── test_serialization.py
│   └── test_config.py
├── integration/             # Integration tests
│   ├── test_database.py
│   └── test_scrapy_integration.py
└── fixtures/                # Test data and fixtures
    ├── sample_items.json
    └── test_responses.html

Writing Tests

Unit Test Example:

import pytest
from unittest.mock import Mock, patch
from scrapy_item_ingest.pipelines.items import ItemsPipeline


class TestItemsPipeline:
    @pytest.fixture
    def pipeline(self, mock_settings):
        return ItemsPipeline(mock_settings)

    @pytest.fixture
    def mock_settings(self):
        return {
            'DB_URL': 'postgresql://test:test@localhost:5432/test_db',
            'CREATE_TABLES': True,
            'JOB_ID': 'test_job'
        }

    @pytest.fixture
    def sample_item(self):
        return {
            'title': 'Test Product',
            'price': 29.99,
            'url': 'https://example.com/product/123'
        }

    def test_process_item_success(self, pipeline, sample_item):
        spider = Mock()
        spider.name = 'test_spider'

        with patch.object(pipeline, '_store_item') as mock_store:
            result = pipeline.process_item(sample_item, spider)

            assert result == sample_item
            mock_store.assert_called_once()

    def test_process_item_database_error(self, pipeline, sample_item):
        spider = Mock()

        with patch.object(pipeline, '_store_item', side_effect=Exception("DB Error")):
            with pytest.raises(Exception):
                pipeline.process_item(sample_item, spider)

Integration Test Example:

import pytest
import psycopg2
from scrapy_item_ingest.pipelines.main import DbInsertPipeline


@pytest.mark.integration
class TestDatabaseIntegration:
    @pytest.fixture(scope='class')
    def test_database(self):
        # Setup test database
        db_url = 'postgresql://test:test@localhost:5432/test_scrapy'
        conn = psycopg2.connect(db_url)

        yield db_url

        # Cleanup
        conn.close()

    def test_end_to_end_pipeline(self, test_database):
        settings = {
            'DB_URL': test_database,
            'CREATE_TABLES': True,
            'JOB_ID': 'integration_test'
        }

        pipeline = DbInsertPipeline(settings)
        spider = Mock()
        spider.name = 'test_spider'

        # Test pipeline functionality
        pipeline.open_spider(spider)

        item = {'title': 'Test Item', 'url': 'https://example.com'}
        result = pipeline.process_item(item, spider)

        assert result == item

        pipeline.close_spider(spider)

Running Tests

# Run all tests
pytest

# Run specific test file
pytest tests/unit/test_pipelines.py

# Run with coverage
pytest --cov=scrapy_item_ingest --cov-report=html

# Run only integration tests
pytest -m integration

# Run tests in parallel
pytest -n auto

Test Coverage

We aim for 90%+ test coverage. Check coverage with:

pytest --cov=scrapy_item_ingest --cov-report=term-missing

Contribution Workflow

Making Changes

  1. Create a feature branch:

    git checkout -b feature/your-feature-name
    
  2. Make your changes: - Write code following our style guidelines - Add or update tests - Update documentation if needed

  3. Run tests and linting:

    # Run tests
    pytest
    
    # Run linting
    flake8 scrapy_item_ingest/
    black --check scrapy_item_ingest/
    mypy scrapy_item_ingest/
    
  4. Commit your changes:

    git add .
    git commit -m "feat: add new feature description"
    
  5. Push and create pull request:

    git push origin feature/your-feature-name
    

Commit Message Format

We use conventional commits format:

type(scope): description

[optional body]

[optional footer]

Types: - feat: New feature - fix: Bug fix - docs: Documentation changes - style: Code style changes (formatting, etc.) - refactor: Code refactoring - test: Adding or updating tests - chore: Maintenance tasks

Examples:

feat(pipelines): add batch processing support
fix(database): handle connection timeout errors
docs(api): update pipeline documentation
test(integration): add database integration tests

Pull Request Guidelines

Before submitting:

  • [ ] Tests pass locally

  • [ ] Code follows style guidelines

  • [ ] Documentation is updated

  • [ ] CHANGELOG.md is updated (for significant changes)

  • [ ] Commit messages follow conventional format

Pull Request Template:

## Description
Brief description of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] Tests pass

Release Process

Versioning

We use Semantic Versioning (SemVer):

  • MAJOR: Incompatible API changes

  • MINOR: New functionality (backward compatible)

  • PATCH: Bug fixes (backward compatible)

Release Checklist

  1. Update version in setup.py

  2. Update CHANGELOG.md

  3. Create release tag

  4. Build and publish to PyPI

  5. Update documentation

# Create release
git tag v1.2.0
git push origin v1.2.0

# Build package
python setup.py sdist bdist_wheel

# Upload to PyPI
twine upload dist/*

Getting Help

Development Questions: - GitHub Discussions - Open an issue with question label

Bug Reports: - Use GitHub Issues - Include minimal reproduction case - Provide environment details

Feature Requests: - Open GitHub Issue with enhancement label - Describe use case and expected behavior

Code of Conduct

We are committed to providing a welcoming and inclusive environment for all contributors. Please read and follow our Code of Conduct.

Our Standards:

  • Be respectful and inclusive

  • Welcome newcomers and help them learn

  • Focus on constructive feedback

  • Respect different viewpoints and experiences

Unacceptable Behavior:

  • Harassment or discrimination

  • Personal attacks or trolling

  • Publishing private information

  • Inappropriate sexual content

Reporting:

Report unacceptable behavior to the maintainers at admin@yourproject.com.

Recognition

Contributors are recognized in:

  • CONTRIBUTORS.md file

  • Release notes

  • Project documentation

Thank you for contributing to Scrapy Item Ingest! 🎉