Troubleshooting

This guide covers common issues, error messages, and solutions when using Scrapy Item Ingest in development and production environments.

Common Installation Issues

PostgreSQL Connection Errors

Error: psycopg2.OperationalError: could not connect to server

Symptoms: - Spider fails to start with database connection error - Error occurs during pipeline initialization

Solutions:

  1. Check PostgreSQL service status:

    # Ubuntu/Debian
    sudo systemctl status postgresql
    sudo systemctl start postgresql
    
    # macOS with Homebrew
    brew services start postgresql
    
    # Windows
    net start postgresql-x64-15
    
  2. Verify database connection string:

    # Test connection manually
    import psycopg2
    
    try:
        conn = psycopg2.connect(
            "postgresql://username:password@localhost:5432/database_name"
        )
        print("✅ Connection successful")
        conn.close()
    except Exception as e:
        print(f"❌ Connection failed: {e}")
    
  3. Check firewall and network settings:

    # Test port connectivity
    telnet localhost 5432
    
    # Or using nc
    nc -zv localhost 5432
    
  4. Verify PostgreSQL configuration:

    # Check postgresql.conf
    sudo nano /etc/postgresql/15/main/postgresql.conf
    
    # Ensure these settings:
    listen_addresses = '*'  # or 'localhost'
    port = 5432
    

Error: psycopg2.OperationalError: FATAL: password authentication failed

Solutions:

  1. Reset PostgreSQL password:

    sudo -u postgres psql
    ALTER USER postgres PASSWORD 'newpassword';
    
  2. Check pg_hba.conf authentication method:

    sudo nano /etc/postgresql/15/main/pg_hba.conf
    
    # Change to:
    local   all             all                                     md5
    host    all             all             127.0.0.1/32            md5
    
  3. Restart PostgreSQL after changes:

    sudo systemctl restart postgresql
    

Table Creation Issues

Error: relation "job_items" does not exist

Symptoms: - Spider runs but fails when trying to store items - Error occurs when CREATE_TABLES = False

Solutions:

  1. Enable automatic table creation:

    # settings.py
    CREATE_TABLES = True
    
  2. Manually create tables:

    -- Connect to your database and run:
    CREATE TABLE job_items (
        id BIGSERIAL PRIMARY KEY,
        item JSONB NOT NULL,
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        job_id INTEGER NOT NULL
    );
    
    CREATE TABLE job_requests (
        id BIGSERIAL PRIMARY KEY,
        url VARCHAR(200) NOT NULL,
        method VARCHAR(10) NOT NULL,
        status_code INTEGER,
        response_time FLOAT,
        fingerprint VARCHAR(255),
        parent_url VARCHAR(255),
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
        job_id INTEGER NOT NULL,
        parent_id BIGINT,
        FOREIGN KEY (parent_id) REFERENCES job_requests(id)
    );
    
    CREATE TABLE job_logs (
        id BIGSERIAL PRIMARY KEY,
        job_id INTEGER NOT NULL,
        type VARCHAR(50) NOT NULL,
        message TEXT NOT NULL,
        created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
    );
    
  3. Check database permissions:

    -- Ensure user has table creation privileges
    GRANT CREATE ON DATABASE your_database TO your_user;
    GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_user;
    

Runtime Issues

Performance Issues

Problem: Slow spider performance or database bottlenecks

Symptoms: - Very slow item processing - Long response times - Database connection timeouts

Solutions:

  1. Optimize database connections:

    # settings.py
    DB_SETTINGS = {
        'pool_size': 20,
        'max_overflow': 30,
        'pool_timeout': 30,
    }
    
  2. Tune Scrapy concurrency:

    # Start with lower values and increase gradually
    CONCURRENT_REQUESTS = 16
    CONCURRENT_REQUESTS_PER_DOMAIN = 8
    DOWNLOAD_DELAY = 0.5
    
  3. Enable autothrottle:

    AUTOTHROTTLE_ENABLED = True
    AUTOTHROTTLE_START_DELAY = 1
    AUTOTHROTTLE_MAX_DELAY = 10
    AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
    
  4. Optimize database queries:

    -- Add indexes for better performance
    CREATE INDEX CONCURRENTLY idx_job_items_job_id ON job_items(job_id);
    CREATE INDEX CONCURRENTLY idx_job_items_created_at ON job_items(created_at);
    

Configuration Problems

Pipeline Order Issues

Error: Items not being processed correctly or pipelines not running

Solutions:

  1. Check pipeline order:

    # Correct order (lower numbers run first)
    ITEM_PIPELINES = {
        'myproject.pipelines.ValidationPipeline': 200,
        'scrapy_item_ingest.DbInsertPipeline': 300,
        'myproject.pipelines.NotificationPipeline': 400,
    }
    
  2. Ensure pipeline returns items:

    def process_item(self, item, spider):
        # Process the item
        processed_item = self.do_processing(item)
    
        # MUST return the item for next pipeline
        return processed_item
    

Job ID Configuration Issues

Problem: Items not grouped correctly or job_id is null

Solutions:

  1. Explicitly set JOB_ID:

    # settings.py
    JOB_ID = 'my_specific_job_001'
    
  2. Check spider attribute:

    # In your spider
    class MySpider(scrapy.Spider):
        name = 'my_spider'
    
        def __init__(self, job_id=None, *args, **kwargs):
            super().__init__(*args, **kwargs)
            if job_id:
                self.job_id = job_id
    
  3. Verify job_id is being set:

    # Add logging to check job_id
    def open_spider(self, spider):
        job_id = getattr(spider, 'job_id', spider.name)
        spider.logger.info(f"Using job_id: {job_id}")
    

Data Quality Issues

JSON Serialization Errors

Error: TypeError: Object of type X is not JSON serializable

Solutions:

  1. Use proper field types:

    # Convert datetime objects
    from datetime import datetime
    
    item['scraped_at'] = datetime.now().isoformat()
    
    # Convert Decimal to float
    from decimal import Decimal
    price = Decimal('29.99')
    item['price'] = float(price)
    
  2. Custom serialization:

    import json
    from datetime import datetime, date
    from decimal import Decimal
    
    class CustomJSONEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, (datetime, date)):
                return obj.isoformat()
            elif isinstance(obj, Decimal):
                return float(obj)
            return super().default(obj)
    
  3. Clean data before yielding:

    def clean_item(self, item):
        """Clean item data for JSON serialization"""
        cleaned = {}
        for key, value in item.items():
            if isinstance(value, (str, int, float, bool, list, dict)):
                cleaned[key] = value
            elif value is None:
                cleaned[key] = None
            else:
                cleaned[key] = str(value)
        return cleaned
    

JSONB Query Issues

Problem: Can’t query JSONB fields effectively

Solutions:

  1. Use proper JSONB operators:

    -- Extract text values
    SELECT item->>'name' as product_name FROM job_items;
    
    -- Extract numeric values
    SELECT (item->>'price')::FLOAT as price FROM job_items;
    
    -- Check for key existence
    SELECT * FROM job_items WHERE item ? 'price';
    
    -- Query nested objects
    SELECT * FROM job_items WHERE item->'metadata'->>'source' = 'website';
    
  2. Create functional indexes:

    -- Index for frequently queried fields
    CREATE INDEX idx_items_name ON job_items ((item->>'name'));
    CREATE INDEX idx_items_price ON job_items (((item->>'price')::FLOAT));
    

Monitoring and Debugging

Enable Debug Logging

  1. Enable detailed logging:

    # settings.py
    LOG_LEVEL = 'DEBUG'
    
    # Custom logging configuration
    LOGGING = {
        'version': 1,
        'disable_existing_loggers': False,
        'formatters': {
            'verbose': {
                'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}',
                'style': '{',
            },
        },
        'handlers': {
            'file': {
                'level': 'DEBUG',
                'class': 'logging.FileHandler',
                'filename': 'scrapy_debug.log',
                'formatter': 'verbose',
            },
        },
        'loggers': {
            'scrapy_item_ingest': {
                'handlers': ['file'],
                'level': 'DEBUG',
                'propagate': True,
            },
        },
    }
    
  2. Add debug information to pipelines:

    def process_item(self, item, spider):
        spider.logger.debug(f"Processing item: {item}")
        spider.logger.debug(f"Using job_id: {getattr(spider, 'job_id', 'unknown')}")
    
        # Process item
        return item
    

Database Connection Debugging

  1. Test database connectivity:

    # test_db.py
    import psycopg2
    import sys
    
    def test_database_connection(db_url):
        try:
            conn = psycopg2.connect(db_url)
            cursor = conn.cursor()
    
            # Test basic operations
            cursor.execute("SELECT version();")
            version = cursor.fetchone()[0]
            print(f"✅ Connected to: {version}")
    
            # Test table access
            cursor.execute("SELECT COUNT(*) FROM job_items;")
            count = cursor.fetchone()[0]
            print(f"✅ Items in database: {count}")
    
            conn.close()
            return True
    
        except Exception as e:
            print(f"❌ Database test failed: {e}")
            return False
    
    if __name__ == "__main__":
        db_url = sys.argv[1] if len(sys.argv) > 1 else "postgresql://user:pass@localhost/db"
        test_database_connection(db_url)
    
  2. Monitor database connections:

    -- Check active connections
    SELECT
        pid,
        usename,
        application_name,
        client_addr,
        state,
        query_start,
        query
    FROM pg_stat_activity
    WHERE datname = 'your_database_name';
    

Performance Profiling

  1. Profile spider performance:

    # Add profiling to spider
    import cProfile
    import pstats
    
    class ProfilingSpider(scrapy.Spider):
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.profiler = cProfile.Profile()
    
        def spider_opened(self, spider):
            self.profiler.enable()
    
        def spider_closed(self, spider):
            self.profiler.disable()
            stats = pstats.Stats(self.profiler)
            stats.sort_stats('cumulative')
            stats.print_stats(20)  # Top 20 functions
    
  2. Monitor memory usage:

    import psutil
    import os
    
    class MemoryMonitoringExtension:
        def item_scraped(self, item, response, spider):
            process = psutil.Process(os.getpid())
            memory_mb = process.memory_info().rss / 1024 / 1024
            spider.logger.info(f"Memory usage: {memory_mb:.2f} MB")
    

Common Error Messages and Solutions

ImportError: No module named 'scrapy_item_ingest'

Solution: .. code-block:: bash

pip install scrapy-item-ingest # Or if developing: pip install -e .

AttributeError: 'Spider' object has no attribute 'job_id'

Solution: .. code-block:: python

# Ensure job_id is set in settings or spider JOB_ID = ‘your_job_id’ # Or handle missing job_id gracefully: job_id = getattr(spider, ‘job_id’, spider.name)

psycopg2.errors.UndefinedTable: relation "job_items" does not exist

Solution: .. code-block:: python

# Enable table creation CREATE_TABLES = True

Getting Help

When reporting issues, please include:

  1. Environment information: - Python version - Scrapy version - PostgreSQL version - Operating system

  2. Configuration: - Relevant settings.py content - Pipeline configuration - Database connection string (without credentials)

  3. Error logs: - Complete error traceback - Relevant log messages - Spider output

  4. Minimal reproduction case: - Simple spider that reproduces the issue - Sample data if relevant

Support Channels: - GitHub Issues: https://github.com/fawadss1/scrapy_item_ingest/issues - Documentation: This documentation site

Next Steps