Troubleshooting =============== This guide covers common issues, error messages, and solutions when using Scrapy Item Ingest in development and production environments. Common Installation Issues -------------------------- PostgreSQL Connection Errors ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Error**: ``psycopg2.OperationalError: could not connect to server`` **Symptoms:** - Spider fails to start with database connection error - Error occurs during pipeline initialization **Solutions:** 1. **Check PostgreSQL service status:** .. code-block:: bash # Ubuntu/Debian sudo systemctl status postgresql sudo systemctl start postgresql # macOS with Homebrew brew services start postgresql # Windows net start postgresql-x64-15 2. **Verify database connection string:** .. code-block:: python # Test connection manually import psycopg2 try: conn = psycopg2.connect( "postgresql://username:password@localhost:5432/database_name" ) print("✅ Connection successful") conn.close() except Exception as e: print(f"❌ Connection failed: {e}") 3. **Check firewall and network settings:** .. code-block:: bash # Test port connectivity telnet localhost 5432 # Or using nc nc -zv localhost 5432 4. **Verify PostgreSQL configuration:** .. code-block:: bash # Check postgresql.conf sudo nano /etc/postgresql/15/main/postgresql.conf # Ensure these settings: listen_addresses = '*' # or 'localhost' port = 5432 **Error**: ``psycopg2.OperationalError: FATAL: password authentication failed`` **Solutions:** 1. **Reset PostgreSQL password:** .. code-block:: bash sudo -u postgres psql ALTER USER postgres PASSWORD 'newpassword'; 2. **Check pg_hba.conf authentication method:** .. code-block:: bash sudo nano /etc/postgresql/15/main/pg_hba.conf # Change to: local all all md5 host all all 127.0.0.1/32 md5 3. **Restart PostgreSQL after changes:** .. code-block:: bash sudo systemctl restart postgresql Table Creation Issues ~~~~~~~~~~~~~~~~~~~ **Error**: ``relation "job_items" does not exist`` **Symptoms:** - Spider runs but fails when trying to store items - Error occurs when `CREATE_TABLES = False` **Solutions:** 1. **Enable automatic table creation:** .. code-block:: python # settings.py CREATE_TABLES = True 2. **Manually create tables:** .. code-block:: sql -- Connect to your database and run: CREATE TABLE job_items ( id BIGSERIAL PRIMARY KEY, item JSONB NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), job_id INTEGER NOT NULL ); CREATE TABLE job_requests ( id BIGSERIAL PRIMARY KEY, url VARCHAR(200) NOT NULL, method VARCHAR(10) NOT NULL, status_code INTEGER, response_time FLOAT, fingerprint VARCHAR(255), parent_url VARCHAR(255), created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), job_id INTEGER NOT NULL, parent_id BIGINT, FOREIGN KEY (parent_id) REFERENCES job_requests(id) ); CREATE TABLE job_logs ( id BIGSERIAL PRIMARY KEY, job_id INTEGER NOT NULL, type VARCHAR(50) NOT NULL, message TEXT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); 3. **Check database permissions:** .. code-block:: sql -- Ensure user has table creation privileges GRANT CREATE ON DATABASE your_database TO your_user; GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO your_user; Runtime Issues -------------- Memory-Related Problems ~~~~~~~~~~~~~~~~~~~~~ **Error**: ``MemoryError`` or ``killed`` during crawling **Symptoms:** - Spider stops unexpectedly - High memory usage in system monitor - Docker container gets killed **Solutions:** 1. **Enable memory monitoring:** .. code-block:: python # settings.py MEMUSAGE_ENABLED = True MEMUSAGE_LIMIT_MB = 2048 MEMUSAGE_WARNING_MB = 1536 2. **Optimize batch processing:** .. code-block:: python # Reduce batch sizes BATCH_SIZE = 100 # Instead of 1000 # Process items more frequently ITEM_BUFFER_SIZE = 50 3. **Use memory-efficient data structures:** .. code-block:: python # In your spider def parse(self, response): # Don't store large objects in memory item = { 'title': response.css('title::text').get(), 'url': response.url } # Avoid: item['full_html'] = response.text yield item 4. **Configure garbage collection:** .. code-block:: python # settings.py import gc # Force garbage collection more frequently gc.set_threshold(100, 10, 10) Performance Issues ~~~~~~~~~~~~~~~~ **Problem**: Slow spider performance or database bottlenecks **Symptoms:** - Very slow item processing - Long response times - Database connection timeouts **Solutions:** 1. **Optimize database connections:** .. code-block:: python # settings.py DB_SETTINGS = { 'pool_size': 20, 'max_overflow': 30, 'pool_timeout': 30, } 2. **Tune Scrapy concurrency:** .. code-block:: python # Start with lower values and increase gradually CONCURRENT_REQUESTS = 16 CONCURRENT_REQUESTS_PER_DOMAIN = 8 DOWNLOAD_DELAY = 0.5 3. **Enable autothrottle:** .. code-block:: python AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 10 AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 4. **Optimize database queries:** .. code-block:: sql -- Add indexes for better performance CREATE INDEX CONCURRENTLY idx_job_items_job_id ON job_items(job_id); CREATE INDEX CONCURRENTLY idx_job_items_created_at ON job_items(created_at); Configuration Problems --------------------- Pipeline Order Issues ~~~~~~~~~~~~~~~~~~~ **Error**: Items not being processed correctly or pipelines not running **Solutions:** 1. **Check pipeline order:** .. code-block:: python # Correct order (lower numbers run first) ITEM_PIPELINES = { 'myproject.pipelines.ValidationPipeline': 200, 'scrapy_item_ingest.DbInsertPipeline': 300, 'myproject.pipelines.NotificationPipeline': 400, } 2. **Ensure pipeline returns items:** .. code-block:: python def process_item(self, item, spider): # Process the item processed_item = self.do_processing(item) # MUST return the item for next pipeline return processed_item Job ID Configuration Issues ~~~~~~~~~~~~~~~~~~~~~~~~~ **Problem**: Items not grouped correctly or job_id is null **Solutions:** 1. **Explicitly set JOB_ID:** .. code-block:: python # settings.py JOB_ID = 'my_specific_job_001' 2. **Check spider attribute:** .. code-block:: python # In your spider class MySpider(scrapy.Spider): name = 'my_spider' def __init__(self, job_id=None, *args, **kwargs): super().__init__(*args, **kwargs) if job_id: self.job_id = job_id 3. **Verify job_id is being set:** .. code-block:: python # Add logging to check job_id def open_spider(self, spider): job_id = getattr(spider, 'job_id', spider.name) spider.logger.info(f"Using job_id: {job_id}") Data Quality Issues ------------------ JSON Serialization Errors ~~~~~~~~~~~~~~~~~~~~~~~~ **Error**: ``TypeError: Object of type X is not JSON serializable`` **Solutions:** 1. **Use proper field types:** .. code-block:: python # Convert datetime objects from datetime import datetime item['scraped_at'] = datetime.now().isoformat() # Convert Decimal to float from decimal import Decimal price = Decimal('29.99') item['price'] = float(price) 2. **Custom serialization:** .. code-block:: python import json from datetime import datetime, date from decimal import Decimal class CustomJSONEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, (datetime, date)): return obj.isoformat() elif isinstance(obj, Decimal): return float(obj) return super().default(obj) 3. **Clean data before yielding:** .. code-block:: python def clean_item(self, item): """Clean item data for JSON serialization""" cleaned = {} for key, value in item.items(): if isinstance(value, (str, int, float, bool, list, dict)): cleaned[key] = value elif value is None: cleaned[key] = None else: cleaned[key] = str(value) return cleaned JSONB Query Issues ~~~~~~~~~~~~~~~~ **Problem**: Can't query JSONB fields effectively **Solutions:** 1. **Use proper JSONB operators:** .. code-block:: sql -- Extract text values SELECT item->>'name' as product_name FROM job_items; -- Extract numeric values SELECT (item->>'price')::FLOAT as price FROM job_items; -- Check for key existence SELECT * FROM job_items WHERE item ? 'price'; -- Query nested objects SELECT * FROM job_items WHERE item->'metadata'->>'source' = 'website'; 2. **Create functional indexes:** .. code-block:: sql -- Index for frequently queried fields CREATE INDEX idx_items_name ON job_items ((item->>'name')); CREATE INDEX idx_items_price ON job_items (((item->>'price')::FLOAT)); Monitoring and Debugging ------------------------ Enable Debug Logging ~~~~~~~~~~~~~~~~~~~ 1. **Enable detailed logging:** .. code-block:: python # settings.py LOG_LEVEL = 'DEBUG' # Custom logging configuration LOGGING = { 'version': 1, 'disable_existing_loggers': False, 'formatters': { 'verbose': { 'format': '{levelname} {asctime} {module} {process:d} {thread:d} {message}', 'style': '{', }, }, 'handlers': { 'file': { 'level': 'DEBUG', 'class': 'logging.FileHandler', 'filename': 'scrapy_debug.log', 'formatter': 'verbose', }, }, 'loggers': { 'scrapy_item_ingest': { 'handlers': ['file'], 'level': 'DEBUG', 'propagate': True, }, }, } 2. **Add debug information to pipelines:** .. code-block:: python def process_item(self, item, spider): spider.logger.debug(f"Processing item: {item}") spider.logger.debug(f"Using job_id: {getattr(spider, 'job_id', 'unknown')}") # Process item return item Database Connection Debugging ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. **Test database connectivity:** .. code-block:: python # test_db.py import psycopg2 import sys def test_database_connection(db_url): try: conn = psycopg2.connect(db_url) cursor = conn.cursor() # Test basic operations cursor.execute("SELECT version();") version = cursor.fetchone()[0] print(f"✅ Connected to: {version}") # Test table access cursor.execute("SELECT COUNT(*) FROM job_items;") count = cursor.fetchone()[0] print(f"✅ Items in database: {count}") conn.close() return True except Exception as e: print(f"❌ Database test failed: {e}") return False if __name__ == "__main__": db_url = sys.argv[1] if len(sys.argv) > 1 else "postgresql://user:pass@localhost/db" test_database_connection(db_url) 2. **Monitor database connections:** .. code-block:: sql -- Check active connections SELECT pid, usename, application_name, client_addr, state, query_start, query FROM pg_stat_activity WHERE datname = 'your_database_name'; Performance Profiling ~~~~~~~~~~~~~~~~~~~~ 1. **Profile spider performance:** .. code-block:: python # Add profiling to spider import cProfile import pstats class ProfilingSpider(scrapy.Spider): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.profiler = cProfile.Profile() def spider_opened(self, spider): self.profiler.enable() def spider_closed(self, spider): self.profiler.disable() stats = pstats.Stats(self.profiler) stats.sort_stats('cumulative') stats.print_stats(20) # Top 20 functions 2. **Monitor memory usage:** .. code-block:: python import psutil import os class MemoryMonitoringExtension: def item_scraped(self, item, response, spider): process = psutil.Process(os.getpid()) memory_mb = process.memory_info().rss / 1024 / 1024 spider.logger.info(f"Memory usage: {memory_mb:.2f} MB") Common Error Messages and Solutions ---------------------------------- ``ImportError: No module named 'scrapy_item_ingest'`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Solution:** .. code-block:: bash pip install scrapy-item-ingest # Or if developing: pip install -e . ``AttributeError: 'Spider' object has no attribute 'job_id'`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Solution:** .. code-block:: python # Ensure job_id is set in settings or spider JOB_ID = 'your_job_id' # Or handle missing job_id gracefully: job_id = getattr(spider, 'job_id', spider.name) ``psycopg2.errors.UndefinedTable: relation "job_items" does not exist`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Solution:** .. code-block:: python # Enable table creation CREATE_TABLES = True Getting Help ----------- When reporting issues, please include: 1. **Environment information:** - Python version - Scrapy version - PostgreSQL version - Operating system 2. **Configuration:** - Relevant settings.py content - Pipeline configuration - Database connection string (without credentials) 3. **Error logs:** - Complete error traceback - Relevant log messages - Spider output 4. **Minimal reproduction case:** - Simple spider that reproduces the issue - Sample data if relevant **Support Channels:** - GitHub Issues: https://github.com/fawadss1/scrapy_item_ingest/issues - Documentation: This documentation site Next Steps ---------- * :doc:`../api/pipelines` - Detailed API reference * :doc:`../development/contributing` - Contributing guidelines * :doc:`../examples/basic-setup` - Basic setup examples