Contributing
============

Thank you for your interest in contributing to Scrapy Item Ingest! This guide will help you get started with contributing to the project.

Getting Started
---------------

Development Setup
~~~~~~~~~~~~~~~~

1. **Fork and clone the repository:**

   .. code-block:: bash

      git clone https://github.com/fawadss1/scrapy_item_ingest.git
      cd scrapy_item_ingest

2. **Create a virtual environment:**

   .. code-block:: bash

      python -m venv venv
      # On Windows:
      venv\Scripts\activate
      # On macOS/Linux:
      source venv/bin/activate

3. **Install development dependencies:**

   .. code-block:: bash

      pip install -e .
      pip install -r requirements-dev.txt

4. **Set up pre-commit hooks:**

   .. code-block:: bash

      pre-commit install

5. **Set up test database:**

   .. code-block:: bash

      # Create test database
      createdb scrapy_test

      # Run initial tests
      pytest tests/

Development Environment
~~~~~~~~~~~~~~~~~~~~~~

**Required Tools:**

* Python 3.7+
* PostgreSQL 12+
* Git
* Code editor (VS Code, PyCharm, etc.)

**Recommended Tools:**

* Docker (for testing different environments)
* Redis (for advanced features testing)
* pgAdmin or similar (for database inspection)

**Environment Variables:**

.. code-block:: bash

   # .env.development
   TEST_DATABASE_URL=postgresql://postgres:password@localhost:5432/scrapy_test
   DEBUG=true
   LOG_LEVEL=DEBUG

Project Structure
----------------

Understanding the Codebase
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: text

   scrapy_item_ingest/
   ├── __init__.py              # Package initialization and exports
   ├── config/                  # Configuration management
   │   ├── __init__.py
   │   └── settings.py         # Settings validation and defaults
   ├── database/               # Database operations
   │   ├── __init__.py
   │   ├── connection.py       # Connection management
   │   └── schema.py          # Table creation and management
   ├── extensions/             # Scrapy extensions
   │   ├── __init__.py
   │   ├── base.py            # Base extension class
   │   └── logging.py         # Logging extension
   ├── pipelines/              # Scrapy pipelines
   │   ├── __init__.py
   │   ├── base.py            # Base pipeline class
   │   ├── items.py           # Items pipeline
   │   ├── main.py            # Combined pipeline
   │   └── requests.py        # Requests pipeline
   └── utils/                  # Utility functions
       ├── __init__.py
       ├── fingerprint.py      # Request fingerprinting
       └── serialization.py    # Data serialization

**Key Components:**

* **Pipelines**: Core functionality for processing items and requests
* **Extensions**: Additional features like logging and monitoring
* **Database**: Connection management and schema operations
* **Config**: Settings validation and configuration management
* **Utils**: Helper functions and utilities

Code Style and Standards
-----------------------

Coding Standards
~~~~~~~~~~~~~~~

We follow PEP 8 with some additional guidelines:

* **Line length**: Maximum 88 characters (Black formatter)
* **Imports**: Use absolute imports, group by standard/third-party/local
* **Docstrings**: Use Google-style docstrings
* **Type hints**: Use type hints where appropriate
* **Variable names**: Use descriptive names, avoid abbreviations

**Example Code Style:**

.. code-block:: python

   from typing import Dict, List, Optional
   import logging
   from scrapy import Spider
   from scrapy.item import Item

   logger = logging.getLogger(__name__)


   class ItemsPipeline:
       """Pipeline for storing scraped items in database.

       This pipeline handles the storage of scraped items into PostgreSQL
       database with automatic serialization and error handling.

       Args:
           settings: Scrapy settings object containing configuration

       Attributes:
           db_url: Database connection string
           job_id: Unique identifier for the crawl job
       """

       def __init__(self, settings: Dict[str, Any]) -> None:
           self.db_url: str = settings.get('DB_URL')
           self.job_id: Optional[str] = settings.get('JOB_ID')
           self._connection: Optional[Connection] = None

       def process_item(self, item: Item, spider: Spider) -> Item:
           """Process and store item in database.

           Args:
               item: The scraped item to process
               spider: The spider that scraped the item

           Returns:
               The processed item

           Raises:
               DatabaseError: If database operation fails
           """
           try:
               serialized_item = self._serialize_item(item)
               self._store_item(serialized_item, spider)
               return item
           except Exception as e:
               logger.error(f"Failed to process item: {e}")
               raise

Documentation Standards
~~~~~~~~~~~~~~~~~~~~~~

* **Code Comments**: Explain why, not what
* **Docstrings**: Document all public functions, classes, and methods
* **Type Hints**: Use type hints for function signatures
* **Examples**: Include usage examples in docstrings

**Docstring Example:**

.. code-block:: python

   def serialize_item(item: Union[Item, Dict]) -> Dict[str, Any]:
       """Serialize Scrapy item to JSON-compatible format.

       Converts Scrapy Item objects and dictionaries to a format that can
       be safely serialized to JSON. Handles datetime objects, Decimal
       numbers, and other non-JSON-serializable types.

       Args:
           item: The item to serialize. Can be a Scrapy Item or dictionary.

       Returns:
           A dictionary with all values converted to JSON-serializable types.

       Raises:
           SerializationError: If the item contains objects that cannot be
               serialized or converted to a compatible format.

       Example:
           >>> from datetime import datetime
           >>> item = {'name': 'Product', 'created': datetime.now()}
           >>> serialized = serialize_item(item)
           >>> isinstance(serialized['created'], str)
           True
       """

Testing Guidelines
-----------------

Test Structure
~~~~~~~~~~~~~

We use pytest for testing with the following structure:

.. code-block:: text

   tests/
   ├── conftest.py              # Pytest configuration and fixtures
   ├── unit/                    # Unit tests
   │   ├── test_pipelines.py
   │   ├── test_extensions.py
   │   ├── test_serialization.py
   │   └── test_config.py
   ├── integration/             # Integration tests
   │   ├── test_database.py
   │   └── test_scrapy_integration.py
   └── fixtures/                # Test data and fixtures
       ├── sample_items.json
       └── test_responses.html

Writing Tests
~~~~~~~~~~~~

**Unit Test Example:**

.. code-block:: python

   import pytest
   from unittest.mock import Mock, patch
   from scrapy_item_ingest.pipelines.items import ItemsPipeline


   class TestItemsPipeline:
       @pytest.fixture
       def pipeline(self, mock_settings):
           return ItemsPipeline(mock_settings)

       @pytest.fixture
       def mock_settings(self):
           return {
               'DB_URL': 'postgresql://test:test@localhost:5432/test_db',
               'CREATE_TABLES': True,
               'JOB_ID': 'test_job'
           }

       @pytest.fixture
       def sample_item(self):
           return {
               'title': 'Test Product',
               'price': 29.99,
               'url': 'https://example.com/product/123'
           }

       def test_process_item_success(self, pipeline, sample_item):
           spider = Mock()
           spider.name = 'test_spider'

           with patch.object(pipeline, '_store_item') as mock_store:
               result = pipeline.process_item(sample_item, spider)

               assert result == sample_item
               mock_store.assert_called_once()

       def test_process_item_database_error(self, pipeline, sample_item):
           spider = Mock()

           with patch.object(pipeline, '_store_item', side_effect=Exception("DB Error")):
               with pytest.raises(Exception):
                   pipeline.process_item(sample_item, spider)

**Integration Test Example:**

.. code-block:: python

   import pytest
   import psycopg2
   from scrapy_item_ingest.pipelines.main import DbInsertPipeline


   @pytest.mark.integration
   class TestDatabaseIntegration:
       @pytest.fixture(scope='class')
       def test_database(self):
           # Setup test database
           db_url = 'postgresql://test:test@localhost:5432/test_scrapy'
           conn = psycopg2.connect(db_url)

           yield db_url

           # Cleanup
           conn.close()

       def test_end_to_end_pipeline(self, test_database):
           settings = {
               'DB_URL': test_database,
               'CREATE_TABLES': True,
               'JOB_ID': 'integration_test'
           }

           pipeline = DbInsertPipeline(settings)
           spider = Mock()
           spider.name = 'test_spider'

           # Test pipeline functionality
           pipeline.open_spider(spider)

           item = {'title': 'Test Item', 'url': 'https://example.com'}
           result = pipeline.process_item(item, spider)

           assert result == item

           pipeline.close_spider(spider)

Running Tests
~~~~~~~~~~~~

.. code-block:: bash

   # Run all tests
   pytest

   # Run specific test file
   pytest tests/unit/test_pipelines.py

   # Run with coverage
   pytest --cov=scrapy_item_ingest --cov-report=html

   # Run only integration tests
   pytest -m integration

   # Run tests in parallel
   pytest -n auto

Test Coverage
~~~~~~~~~~~~

We aim for 90%+ test coverage. Check coverage with:

.. code-block:: bash

   pytest --cov=scrapy_item_ingest --cov-report=term-missing

Contribution Workflow
--------------------

Making Changes
~~~~~~~~~~~~~

1. **Create a feature branch:**

   .. code-block:: bash

      git checkout -b feature/your-feature-name

2. **Make your changes:**
   - Write code following our style guidelines
   - Add or update tests
   - Update documentation if needed

3. **Run tests and linting:**

   .. code-block:: bash

      # Run tests
      pytest

      # Run linting
      flake8 scrapy_item_ingest/
      black --check scrapy_item_ingest/
      mypy scrapy_item_ingest/

4. **Commit your changes:**

   .. code-block:: bash

      git add .
      git commit -m "feat: add new feature description"

5. **Push and create pull request:**

   .. code-block:: bash

      git push origin feature/your-feature-name

Commit Message Format
~~~~~~~~~~~~~~~~~~~

We use conventional commits format:

.. code-block:: text

   type(scope): description

   [optional body]

   [optional footer]

**Types:**
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
- `style`: Code style changes (formatting, etc.)
- `refactor`: Code refactoring
- `test`: Adding or updating tests
- `chore`: Maintenance tasks

**Examples:**

.. code-block:: text

   feat(pipelines): add batch processing support
   fix(database): handle connection timeout errors
   docs(api): update pipeline documentation
   test(integration): add database integration tests

Pull Request Guidelines
~~~~~~~~~~~~~~~~~~~~~~

**Before submitting:**

- [ ] Tests pass locally
- [ ] Code follows style guidelines
- [ ] Documentation is updated
- [ ] CHANGELOG.md is updated (for significant changes)
- [ ] Commit messages follow conventional format

**Pull Request Template:**

.. code-block:: markdown

   ## Description
   Brief description of changes

   ## Type of Change
   - [ ] Bug fix
   - [ ] New feature
   - [ ] Breaking change
   - [ ] Documentation update

   ## Testing
   - [ ] Unit tests added/updated
   - [ ] Integration tests added/updated
   - [ ] Manual testing completed

   ## Checklist
   - [ ] Code follows style guidelines
   - [ ] Self-review completed
   - [ ] Documentation updated
   - [ ] Tests pass

Release Process
--------------

Versioning
~~~~~~~~~

We use Semantic Versioning (SemVer):

- **MAJOR**: Incompatible API changes
- **MINOR**: New functionality (backward compatible)
- **PATCH**: Bug fixes (backward compatible)

Release Checklist
~~~~~~~~~~~~~~~~

1. Update version in `setup.py`
2. Update `CHANGELOG.md`
3. Create release tag
4. Build and publish to PyPI
5. Update documentation

.. code-block:: bash

   # Create release
   git tag v1.2.0
   git push origin v1.2.0

   # Build package
   python setup.py sdist bdist_wheel

   # Upload to PyPI
   twine upload dist/*

Getting Help
-----------

**Development Questions:**
- GitHub Discussions
- Open an issue with `question` label

**Bug Reports:**
- Use GitHub Issues
- Include minimal reproduction case
- Provide environment details

**Feature Requests:**
- Open GitHub Issue with `enhancement` label
- Describe use case and expected behavior

Code of Conduct
---------------

We are committed to providing a welcoming and inclusive environment for all contributors. Please read and follow our Code of Conduct.

**Our Standards:**

- Be respectful and inclusive
- Welcome newcomers and help them learn
- Focus on constructive feedback
- Respect different viewpoints and experiences

**Unacceptable Behavior:**

- Harassment or discrimination
- Personal attacks or trolling
- Publishing private information
- Inappropriate sexual content

**Reporting:**

Report unacceptable behavior to the maintainers at admin@yourproject.com.

Recognition
----------

Contributors are recognized in:

- `CONTRIBUTORS.md` file
- Release notes
- Project documentation

Thank you for contributing to Scrapy Item Ingest! 🎉