Troubleshooting

This guide covers common issues, error messages, and solutions when using Scrapy Item Ingest in development and production environments.

Common Installation Issues

PostgreSQL Connection Errors

Error: psycopg2.OperationalError: could not connect to server

Symptoms: - Spider fails to start with database connection error - Error occurs during pipeline initialization

Solutions:

Check PostgreSQL service status:

# Ubuntu/Debian
sudo systemctl status postgresql
sudo systemctl start postgresql

# macOS with Homebrew
brew services start postgresql

# Windows
net start postgresql-x64-15

Verify database connection string:

# Test connection manually
import psycopg2

try:
    conn = psycopg2.connect(
        "postgresql://username:password@localhost:5432/database_name"
    )
    print("✅ Connection successful")
    conn.close()
except Exception as e:
    print(f"❌ Connection failed: {e}")

Check firewall and network settings:

# Test port connectivity
telnet localhost 5432

# Or using nc
nc -zv localhost 5432

Verify PostgreSQL configuration:

# Check postgresql.conf
sudo nano /etc/postgresql/15/main/postgresql.conf

# Ensure these settings:
listen_addresses = '*'  # or 'localhost'
port = 5432

Error: psycopg2.OperationalError: FATAL: password authentication failed

Solutions:

Reset PostgreSQL password:

sudo -u postgres psql
ALTER USER postgres PASSWORD 'newpassword';

Check pg_hba.conf authentication method:

sudo nano /etc/postgresql/15/main/pg_hba.conf

# Change to:
local   all             all                                     md5
host    all             all             127.0.0.1/32            md5

Restart PostgreSQL after changes:
```
sudo systemctl restart postgresql
```

Table Creation Issues

Error: relation "job_items" does not exist

Symptoms: - Spider runs but fails when trying to store items - Error occurs when CREATE_TABLES = False

Solutions:

Enable automatic table creation:
```
# settings.py
CREATE_TABLES = True
```

Manually create tables:

-- Connect to your database and run:
CREATE TABLE job_items (
    id BIGSERIAL PRIMARY KEY,
    item JSONB NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    job_id INTEGER NOT NULL
);

CREATE TABLE job_requests (
    id BIGSERIAL PRIMARY KEY,
    url VARCHAR(200) NOT NULL,
    method VARCHAR(10) NOT NULL,
    status_code INTEGER,
    response_time FLOAT,
    fingerprint VARCHAR(255),
    parent_url VARCHAR(255),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    job_id INTEGER NOT NULL,
    parent_id BIGINT,
    FOREIGN KEY (parent_id) REFERENCES job_requests(id)
);

CREATE TABLE job_logs (
    id BIGSERIAL PRIMARY KEY,
    job_id INTEGER NOT NULL,
    type VARCHAR(50) NOT NULL,
    message TEXT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Check database permissions:

-- Ensure user has table creation privileges
GRANT CREATE ON DATABASE your_database TO your_user;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_user;

Runtime Issues

Performance Issues

Problem: Slow spider performance or database bottlenecks

Symptoms: - Very slow item processing - Long response times - Database connection timeouts

Solutions:

Optimize database connections:

# settings.py
DB_SETTINGS = {
    'pool_size': 20,
    'max_overflow': 30,
    'pool_timeout': 30,
}

Tune Scrapy concurrency:

# Start with lower values and increase gradually
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
DOWNLOAD_DELAY = 0.5

Enable autothrottle:

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

Optimize database queries:

-- Add indexes for better performance
CREATE INDEX CONCURRENTLY idx_job_items_job_id ON job_items(job_id);
CREATE INDEX CONCURRENTLY idx_job_items_created_at ON job_items(created_at);

Configuration Problems

Pipeline Order Issues

Error: Items not being processed correctly or pipelines not running

Solutions:

Check pipeline order:

# Correct order (lower numbers run first)
ITEM_PIPELINES = {
    'myproject.pipelines.ValidationPipeline': 200,
    'scrapy_item_ingest.DbInsertPipeline': 300,
    'myproject.pipelines.NotificationPipeline': 400,
}

Ensure pipeline returns items:

def process_item(self, item, spider):
    # Process the item
    processed_item = self.do_processing(item)

    # MUST return the item for next pipeline
    return processed_item

Job ID Configuration Issues

Problem: Items not grouped correctly or job_id is null

Solutions:

Explicitly set JOB_ID:

# settings.py
JOB_ID = 'my_specific_job_001'

Check spider attribute:

# In your spider
class MySpider(scrapy.Spider):
    name = 'my_spider'

    def __init__(self, job_id=None, *args, **kwargs):
        super().__init__(*args, **kwargs)
        if job_id:
            self.job_id = job_id

Verify job_id is being set:

# Add logging to check job_id
def open_spider(self, spider):
    job_id = getattr(spider, 'job_id', spider.name)
    spider.logger.info(f"Using job_id: {job_id}")

Data Quality Issues

JSON Serialization Errors

Error: TypeError: Object of type X is not JSON serializable

Solutions:

Use proper field types:

# Convert datetime objects
from datetime import datetime

item['scraped_at'] = datetime.now().isoformat()

# Convert Decimal to float
from decimal import Decimal
price = Decimal('29.99')
item['price'] = float(price)

Custom serialization:

import json
from datetime import datetime, date
from decimal import Decimal

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, (datetime, date)):
            return obj.isoformat()
        elif isinstance(obj, Decimal):
            return float(obj)
        return super().default(obj)

Clean data before yielding:

def clean_item(self, item):
    """Clean item data for JSON serialization"""
    cleaned = {}
    for key, value in item.items():
        if isinstance(value, (str, int, float, bool, list, dict)):
            cleaned[key] = value
        elif value is None:
            cleaned[key] = None
        else:
            cleaned[key] = str(value)
    return cleaned

JSONB Query Issues

Problem: Can’t query JSONB fields effectively

Solutions:

Use proper JSONB operators:

-- Extract text values
SELECT item->>'name' as product_name FROM job_items;

-- Extract numeric values
SELECT (item->>'price')::FLOAT as price FROM job_items;

-- Check for key existence
SELECT * FROM job_items WHERE item ? 'price';

-- Query nested objects
SELECT * FROM job_items WHERE item->'metadata'->>'source' = 'website';

Create functional indexes:

-- Index for frequently queried fields
CREATE INDEX idx_items_name ON job_items ((item->>'name'));
CREATE INDEX idx_items_price ON job_items (((item->>'price')::FLOAT));

Monitoring and Debugging

Enable Debug Logging

Enable detailed logging:

# settings.py
LOG_LEVEL = 'DEBUG'

# Custom logging configuration
LOGGING = {
    'version': 1,
    'disable_existing_loggers': False,
    'formatters': {
        'verbose': {
            'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}',
            'style': '{',
        },
    },
    'handlers': {
        'file': {
            'level': 'DEBUG',
            'class': 'logging.FileHandler',
            'filename': 'scrapy_debug.log',
            'formatter': 'verbose',
        },
    },
    'loggers': {
        'scrapy_item_ingest': {
            'handlers': ['file'],
            'level': 'DEBUG',
            'propagate': True,
        },
    },
}

Add debug information to pipelines:

def process_item(self, item, spider):
    spider.logger.debug(f"Processing item: {item}")
    spider.logger.debug(f"Using job_id: {getattr(spider, 'job_id', 'unknown')}")

    # Process item
    return item

Database Connection Debugging

Test database connectivity:

# test_db.py
import psycopg2
import sys

def test_database_connection(db_url):
    try:
        conn = psycopg2.connect(db_url)
        cursor = conn.cursor()

        # Test basic operations
        cursor.execute("SELECT version();")
        version = cursor.fetchone()[0]
        print(f"✅ Connected to: {version}")

        # Test table access
        cursor.execute("SELECT COUNT(*) FROM job_items;")
        count = cursor.fetchone()[0]
        print(f"✅ Items in database: {count}")

        conn.close()
        return True

    except Exception as e:
        print(f"❌ Database test failed: {e}")
        return False

if __name__ == "__main__":
    db_url = sys.argv[1] if len(sys.argv) > 1 else "postgresql://user:pass@localhost/db"
    test_database_connection(db_url)

Monitor database connections:

-- Check active connections
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    query_start,
    query
FROM pg_stat_activity
WHERE datname = 'your_database_name';

Performance Profiling

Profile spider performance:

# Add profiling to spider
import cProfile
import pstats

class ProfilingSpider(scrapy.Spider):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.profiler = cProfile.Profile()

    def spider_opened(self, spider):
        self.profiler.enable()

    def spider_closed(self, spider):
        self.profiler.disable()
        stats = pstats.Stats(self.profiler)
        stats.sort_stats('cumulative')
        stats.print_stats(20)  # Top 20 functions

Monitor memory usage:

import psutil
import os

class MemoryMonitoringExtension:
    def item_scraped(self, item, response, spider):
        process = psutil.Process(os.getpid())
        memory_mb = process.memory_info().rss / 1024 / 1024
        spider.logger.info(f"Memory usage: {memory_mb:.2f} MB")

Common Error Messages and Solutions

`ImportError: No module named 'scrapy_item_ingest'`

Solution: .. code-block:: bash

pip install scrapy-item-ingest # Or if developing: pip install -e .

`AttributeError: 'Spider' object has no attribute 'job_id'`

Solution: .. code-block:: python

# Ensure job_id is set in settings or spider JOB_ID = ‘your_job_id’ # Or handle missing job_id gracefully: job_id = getattr(spider, ‘job_id’, spider.name)

`psycopg2.errors.UndefinedTable: relation "job_items" does not exist`

Solution: .. code-block:: python

# Enable table creation CREATE_TABLES = True

Getting Help

When reporting issues, please include:

Environment information: - Python version - Scrapy version - PostgreSQL version - Operating system
Configuration: - Relevant settings.py content - Pipeline configuration - Database connection string (without credentials)
Error logs: - Complete error traceback - Relevant log messages - Spider output
Minimal reproduction case: - Simple spider that reproduces the issue - Sample data if relevant

Support Channels: - GitHub Issues: https://github.com/fawadss1/scrapy_item_ingest/issues - Documentation: This documentation site

Next Steps

Pipelines API Reference - Detailed API reference
Contributing - Contributing guidelines
Running with Parameters - Basic setup examples

Troubleshooting

Common Installation Issues

PostgreSQL Connection Errors

Table Creation Issues

Runtime Issues

Memory-Related Problems

Performance Issues

Configuration Problems

Pipeline Order Issues

Job ID Configuration Issues

Data Quality Issues

JSON Serialization Errors

JSONB Query Issues

Monitoring and Debugging

Enable Debug Logging

Database Connection Debugging

Performance Profiling

Common Error Messages and Solutions

ImportError: No module named 'scrapy_item_ingest'

AttributeError: 'Spider' object has no attribute 'job_id'

psycopg2.errors.UndefinedTable: relation "job_items" does not exist

Getting Help

Next Steps

`ImportError: No module named 'scrapy_item_ingest'`

`AttributeError: 'Spider' object has no attribute 'job_id'`

`psycopg2.errors.UndefinedTable: relation "job_items" does not exist`