Troubleshooting
This guide covers common issues, error messages, and solutions when using Scrapy Item Ingest in development and production environments.
Common Installation Issues
PostgreSQL Connection Errors
Error: psycopg2.OperationalError: could not connect to server
Symptoms: - Spider fails to start with database connection error - Error occurs during pipeline initialization
Solutions:
Check PostgreSQL service status:
# Ubuntu/Debian sudo systemctl status postgresql sudo systemctl start postgresql # macOS with Homebrew brew services start postgresql # Windows net start postgresql-x64-15
Verify database connection string:
# Test connection manually import psycopg2 try: conn = psycopg2.connect( "postgresql://username:password@localhost:5432/database_name" ) print("✅ Connection successful") conn.close() except Exception as e: print(f"❌ Connection failed: {e}")
Check firewall and network settings:
# Test port connectivity telnet localhost 5432 # Or using nc nc -zv localhost 5432
Verify PostgreSQL configuration:
# Check postgresql.conf sudo nano /etc/postgresql/15/main/postgresql.conf # Ensure these settings: listen_addresses = '*' # or 'localhost' port = 5432
Error: psycopg2.OperationalError: FATAL: password authentication failed
Solutions:
Reset PostgreSQL password:
sudo -u postgres psql ALTER USER postgres PASSWORD 'newpassword';
Check pg_hba.conf authentication method:
sudo nano /etc/postgresql/15/main/pg_hba.conf # Change to: local all all md5 host all all 127.0.0.1/32 md5
Restart PostgreSQL after changes:
sudo systemctl restart postgresql
Table Creation Issues
Error: relation "job_items" does not exist
Symptoms: - Spider runs but fails when trying to store items - Error occurs when CREATE_TABLES = False
Solutions:
Enable automatic table creation:
# settings.py CREATE_TABLES = True
Manually create tables:
-- Connect to your database and run: CREATE TABLE job_items ( id BIGSERIAL PRIMARY KEY, item JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), job_id INTEGER NOT NULL ); CREATE TABLE job_requests ( id BIGSERIAL PRIMARY KEY, url VARCHAR(200) NOT NULL, method VARCHAR(10) NOT NULL, status_code INTEGER, response_time FLOAT, fingerprint VARCHAR(255), parent_url VARCHAR(255), created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), job_id INTEGER NOT NULL, parent_id BIGINT, FOREIGN KEY (parent_id) REFERENCES job_requests(id) ); CREATE TABLE job_logs ( id BIGSERIAL PRIMARY KEY, job_id INTEGER NOT NULL, type VARCHAR(50) NOT NULL, message TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() );
Check database permissions:
-- Ensure user has table creation privileges GRANT CREATE ON DATABASE your_database TO your_user; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_user;
Runtime Issues
Performance Issues
Problem: Slow spider performance or database bottlenecks
Symptoms: - Very slow item processing - Long response times - Database connection timeouts
Solutions:
Optimize database connections:
# settings.py DB_SETTINGS = { 'pool_size': 20, 'max_overflow': 30, 'pool_timeout': 30, }
Tune Scrapy concurrency:
# Start with lower values and increase gradually CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_DELAY = 0.5
Enable autothrottle:
AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10 AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
Optimize database queries:
-- Add indexes for better performance CREATE INDEX CONCURRENTLY idx_job_items_job_id ON job_items(job_id); CREATE INDEX CONCURRENTLY idx_job_items_created_at ON job_items(created_at);
Configuration Problems
Pipeline Order Issues
Error: Items not being processed correctly or pipelines not running
Solutions:
Check pipeline order:
# Correct order (lower numbers run first) ITEM_PIPELINES = { 'myproject.pipelines.ValidationPipeline': 200, 'scrapy_item_ingest.DbInsertPipeline': 300, 'myproject.pipelines.NotificationPipeline': 400, }
Ensure pipeline returns items:
def process_item(self, item, spider): # Process the item processed_item = self.do_processing(item) # MUST return the item for next pipeline return processed_item
Job ID Configuration Issues
Problem: Items not grouped correctly or job_id is null
Solutions:
Explicitly set JOB_ID:
# settings.py JOB_ID = 'my_specific_job_001'
Check spider attribute:
# In your spider class MySpider(scrapy.Spider): name = 'my_spider' def __init__(self, job_id=None, *args, **kwargs): super().__init__(*args, **kwargs) if job_id: self.job_id = job_id
Verify job_id is being set:
# Add logging to check job_id def open_spider(self, spider): job_id = getattr(spider, 'job_id', spider.name) spider.logger.info(f"Using job_id: {job_id}")
Data Quality Issues
JSON Serialization Errors
Error: TypeError: Object of type X is not JSON serializable
Solutions:
Use proper field types:
# Convert datetime objects from datetime import datetime item['scraped_at'] = datetime.now().isoformat() # Convert Decimal to float from decimal import Decimal price = Decimal('29.99') item['price'] = float(price)
Custom serialization:
import json from datetime import datetime, date from decimal import Decimal class CustomJSONEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, (datetime, date)): return obj.isoformat() elif isinstance(obj, Decimal): return float(obj) return super().default(obj)
Clean data before yielding:
def clean_item(self, item): """Clean item data for JSON serialization""" cleaned = {} for key, value in item.items(): if isinstance(value, (str, int, float, bool, list, dict)): cleaned[key] = value elif value is None: cleaned[key] = None else: cleaned[key] = str(value) return cleaned
JSONB Query Issues
Problem: Can’t query JSONB fields effectively
Solutions:
Use proper JSONB operators:
-- Extract text values SELECT item->>'name' as product_name FROM job_items; -- Extract numeric values SELECT (item->>'price')::FLOAT as price FROM job_items; -- Check for key existence SELECT * FROM job_items WHERE item ? 'price'; -- Query nested objects SELECT * FROM job_items WHERE item->'metadata'->>'source' = 'website';
Create functional indexes:
-- Index for frequently queried fields CREATE INDEX idx_items_name ON job_items ((item->>'name')); CREATE INDEX idx_items_price ON job_items (((item->>'price')::FLOAT));
Monitoring and Debugging
Enable Debug Logging
Enable detailed logging:
# settings.py LOG_LEVEL = 'DEBUG' # Custom logging configuration LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'verbose': { 'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}', 'style': '{', }, }, 'handlers': { 'file': { 'level': 'DEBUG', 'class': 'logging.FileHandler', 'filename': 'scrapy_debug.log', 'formatter': 'verbose', }, }, 'loggers': { 'scrapy_item_ingest': { 'handlers': ['file'], 'level': 'DEBUG', 'propagate': True, }, }, }
Add debug information to pipelines:
def process_item(self, item, spider): spider.logger.debug(f"Processing item: {item}") spider.logger.debug(f"Using job_id: {getattr(spider, 'job_id', 'unknown')}") # Process item return item
Database Connection Debugging
Test database connectivity:
# test_db.py import psycopg2 import sys def test_database_connection(db_url): try: conn = psycopg2.connect(db_url) cursor = conn.cursor() # Test basic operations cursor.execute("SELECT version();") version = cursor.fetchone()[0] print(f"✅ Connected to: {version}") # Test table access cursor.execute("SELECT COUNT(*) FROM job_items;") count = cursor.fetchone()[0] print(f"✅ Items in database: {count}") conn.close() return True except Exception as e: print(f"❌ Database test failed: {e}") return False if __name__ == "__main__": db_url = sys.argv[1] if len(sys.argv) > 1 else "postgresql://user:pass@localhost/db" test_database_connection(db_url)
Monitor database connections:
-- Check active connections SELECT pid, usename, application_name, client_addr, state, query_start, query FROM pg_stat_activity WHERE datname = 'your_database_name';
Performance Profiling
Profile spider performance:
# Add profiling to spider import cProfile import pstats class ProfilingSpider(scrapy.Spider): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.profiler = cProfile.Profile() def spider_opened(self, spider): self.profiler.enable() def spider_closed(self, spider): self.profiler.disable() stats = pstats.Stats(self.profiler) stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions
Monitor memory usage:
import psutil import os class MemoryMonitoringExtension: def item_scraped(self, item, response, spider): process = psutil.Process(os.getpid()) memory_mb = process.memory_info().rss / 1024 / 1024 spider.logger.info(f"Memory usage: {memory_mb:.2f} MB")
Common Error Messages and Solutions
ImportError: No module named 'scrapy_item_ingest'
Solution: .. code-block:: bash
pip install scrapy-item-ingest # Or if developing: pip install -e .
AttributeError: 'Spider' object has no attribute 'job_id'
Solution: .. code-block:: python
# Ensure job_id is set in settings or spider JOB_ID = ‘your_job_id’ # Or handle missing job_id gracefully: job_id = getattr(spider, ‘job_id’, spider.name)
psycopg2.errors.UndefinedTable: relation "job_items" does not exist
Solution: .. code-block:: python
# Enable table creation CREATE_TABLES = True
Getting Help
When reporting issues, please include:
Environment information: - Python version - Scrapy version - PostgreSQL version - Operating system
Configuration: - Relevant settings.py content - Pipeline configuration - Database connection string (without credentials)
Error logs: - Complete error traceback - Relevant log messages - Spider output
Minimal reproduction case: - Simple spider that reproduces the issue - Sample data if relevant
Support Channels: - GitHub Issues: https://github.com/fawadss1/scrapy_item_ingest/issues - Documentation: This documentation site
Next Steps
Pipelines API Reference - Detailed API reference
Contributing - Contributing guidelines
Running with Parameters - Basic setup examples