Configuration API Reference

Minimal, auto-generated docs for configuration.

Settings

class scrapy_item_ingest.config.settings.Settings(crawler_settings)[source]

Bases: object

Handles settings configuration for crawlers, providing access to default values, database table names, and other operational parameters defined in crawler settings.

This class facilitates the standardized management and retrieval of settings that are essential for database operations and crawler configurations. Its purpose is to provide default fallbacks and dynamically adapt to user-specified settings.

DEFAULT_ITEMS_TABLE = 'job_items'

DEFAULT_REQUESTS_TABLE = 'job_requests'

DEFAULT_LOGS_TABLE = 'job_logs'

DEFAULT_TIMEZONE = 'Asia/Karachi'

__init__(crawler_settings)[source]

property db_url

Provides access to the database URL from the crawler settings.

This property is used to retrieve the database URL defined in the crawler’s settings. It is helpful when a database configuration needs to be accessed dynamically.

Returns:: The database URL as defined in the crawler’s configuration
Return type:: str or None

property db_type

Retrieves the database type from the crawler settings.

This property fetches the value assigned to the key DB_TYPE within the crawler_settings. Defaults to ‘postgres’ if the key is not set.

Returns:: The database type as a string.
Return type:: str

property db_items_table: Return static table name for items

property db_requests_table

This property fetches the name of the database table used to store request information. It retrieves the value from crawler settings if defined; otherwise, it defaults to the value of DEFAULT_REQUESTS_TABLE.

Returns:: Name of the database table for storing requests.
Return type:: str

property db_logs_table

Retrieve the name of the database logs table.

This property fetches the value of the database logs table name provided in the crawler settings. If the value is not explicitly defined in the settings, it falls back to the default logs table.

Returns:: The name of the database logs table.
Return type:: Str

property create_tables

Retrieve the setting for creating database tables from crawler settings.

This property fetches the value of the ‘CREATE_TABLES’ option from the crawler settings. If the option is not specified in the settings, it defaults to True.

Returns:: Boolean value indicating whether to create tables.
Return type:: Bool

get_tz()[source]: Return the timezone string for the project. This checks for a ‘TIMEZONE’ setting in the crawler settings and falls back to the default (‘Asia/Karachi’). :returns: The timezone string (e.g., ‘Asia/Karachi’). :rtype: str

static get_identifier_column()[source]: Get the identifier column name

get_identifier_value(spider)[source]: Get the identifier value with smart fallback

Utilities

scrapy_item_ingest.config.settings.validate_settings(settings)[source]: Validate configuration settings

Notes

Configure DB via DB_URL or discrete fields (DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME).
Default tables: job_items, job_requests, job_logs.
See configuration and quickstart for practical examples.
type:

dict

default:

{‘size’: 100, ‘timeout’: 30}

Configuration:

BATCH_SETTINGS = { 'size': 1000, # Items per batch 'timeout': 60, # Batch timeout in seconds 'max_memory': 512, # Max memory per batch (MB) }
Methods:
scrapy_item_ingest.config.settings.validate()
Validate all configuration settings.

Returns:

List of validation errors

Return type:

list[str]

Raises:

ConfigurationError if critical settings are invalid

Example:

settings = Settings() errors = settings.validate() if errors: raise ConfigurationError(f"Configuration errors: {errors}")
scrapy_item_ingest.config.settings.from_crawler_settings(crawler_settings)

Create Settings instance from Scrapy crawler settings.

Parameters:

crawler_settings (scrapy.settings.Settings) – Scrapy settings object

Returns:

Configured Settings instance

Return type:

Settings

scrapy_item_ingest.config.settings.get_database_url()

Get database URL with environment variable substitution.

Returns:

Processed database URL

Return type:

str

Raises:

ConfigurationError if URL is invalid

scrapy_item_ingest.config.settings.get_table_name(table_type)

Get table name for specified table type.

Parameters:

table_type (str) – Type of table (‘items’, ‘requests’, ‘logs’)

Returns:

Table name

Return type:

str

Validation Functions

Settings Validation

scrapy_item_ingest.config.settings.validate_settings(settings)[source]

Validate configuration settings

Comprehensive validation of pipeline settings.

Parameters:: settings (Settings or scrapy.settings.Settings) – Settings object or Scrapy settings
Returns:: List of validation errors
Return type:: list[str]

Validation Checks:

Database URL format and connectivity
Required settings presence
Type validation for all settings
Range validation for numeric settings
Table name validity

Example Usage:

from scrapy_item_ingest.config import validate_settings

def check_configuration(settings):
    errors = validate_settings(settings)
    if errors:
        print("Configuration errors found:")
        for error in errors:
            print(f"  - {error}")
        return False
    return True

Configuration Loaders

Environment Configuration

File Configuration

Configuration Examples

Basic Configuration

# settings.py - Basic configuration
from scrapy_item_ingest.config import Settings

# Create settings instance
config = Settings()
config.DB_URL = 'postgresql://scrapy:password@localhost:5432/scrapy_data'
config.CREATE_TABLES = True
config.JOB_ID = 'basic_job_001'

# Validate configuration
errors = config.validate()
if errors:
    raise ValueError(f"Configuration errors: {errors}")

# Apply to Scrapy settings
DB_URL = config.DB_URL
CREATE_TABLES = config.CREATE_TABLES
JOB_ID = config.JOB_ID

Environment-Based Configuration

# settings.py - Environment-based configuration
import os
from scrapy_item_ingest.config import EnvironmentConfigLoader

# Load from environment
env_loader = EnvironmentConfigLoader()
env_config = env_loader.load()

# Apply environment configuration
DB_URL = env_config.get('db_url', 'postgresql://localhost:5432/scrapy')
CREATE_TABLES = env_config.get('create_tables', True)
JOB_ID = env_config.get('job_id', f'job_{int(time.time())}')

# Advanced settings from environment
if 'db_pool_size' in env_config:
    DB_SETTINGS = {
        'pool_size': env_config['db_pool_size'],
        'max_overflow': env_config.get('db_max_overflow', 30),
    }

Multi-Environment Configuration

# settings.py - Multi-environment setup
import os
from scrapy_item_ingest.config import Settings, FileConfigLoader

class ConfigurationManager:
    def __init__(self):
        self.environment = os.getenv('SCRAPY_ENV', 'development')
        self.base_config = self._load_base_config()
        self.env_config = self._load_environment_config()

    def _load_base_config(self):
        loader = FileConfigLoader()
        return loader.load_yaml('config/base.yaml')

    def _load_environment_config(self):
        loader = FileConfigLoader()
        config_file = f'config/{self.environment}.yaml'
        if os.path.exists(config_file):
            return loader.load_yaml(config_file)
        return {}

    def get_merged_config(self):
        # Merge base and environment configs
        config = {**self.base_config, **self.env_config}

        # Override with environment variables
        env_loader = EnvironmentConfigLoader()
        env_overrides = env_loader.load()
        config.update(env_overrides)

        return config

# Initialize configuration
config_manager = ConfigurationManager()
merged_config = config_manager.get_merged_config()

# Apply to Scrapy settings
DB_URL = merged_config['database']['url']
CREATE_TABLES = merged_config['database']['create_tables']
JOB_ID = merged_config['job']['id']

Dynamic Configuration

# settings.py - Dynamic configuration
from datetime import datetime
import socket

class DynamicConfiguration:
    @staticmethod
    def generate_job_id(spider_name, environment='dev'):
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        hostname = socket.gethostname()
        return f'{environment}_{spider_name}_{hostname}_{timestamp}'

    @staticmethod
    def get_database_url(environment='development'):
        db_configs = {
            'development': 'postgresql://dev:dev@localhost:5432/scrapy_dev',
            'staging': os.getenv('STAGING_DB_URL'),
            'production': os.getenv('PRODUCTION_DB_URL'),
        }
        return db_configs.get(environment)

    @staticmethod
    def get_performance_settings(environment='development'):
        settings_map = {
            'development': {
                'concurrent_requests': 8,
                'download_delay': 1,
                'batch_size': 100,
            },
            'staging': {
                'concurrent_requests': 16,
                'download_delay': 0.5,
                'batch_size': 500,
            },
            'production': {
                'concurrent_requests': 32,
                'download_delay': 0.1,
                'batch_size': 1000,
            }
        }
        return settings_map.get(environment, settings_map['development'])

# Apply dynamic configuration
env = os.getenv('SCRAPY_ENV', 'development')
dynamic_config = DynamicConfiguration()

DB_URL = dynamic_config.get_database_url(env)
JOB_ID = dynamic_config.generate_job_id('products', env)

perf_settings = dynamic_config.get_performance_settings(env)
CONCURRENT_REQUESTS = perf_settings['concurrent_requests']
DOWNLOAD_DELAY = perf_settings['download_delay']

Configuration Testing

Unit Testing Configuration

import unittest
from scrapy_item_ingest.config import Settings, validate_settings

class TestConfiguration(unittest.TestCase):
    def test_valid_configuration(self):
        settings = Settings()
        settings.DB_URL = 'postgresql://test:test@localhost:5432/test_db'
        settings.CREATE_TABLES = True
        settings.JOB_ID = 'test_job'

        errors = settings.validate()
        self.assertEqual(len(errors), 0)

    def test_invalid_database_url(self):
        settings = Settings()
        settings.DB_URL = 'invalid://url'

        errors = settings.validate()
        self.assertGreater(len(errors), 0)
        self.assertTrue(any('database' in error.lower() for error in errors))

    def test_missing_required_settings(self):
        settings = Settings()
        # Don't set DB_URL

        errors = settings.validate()
        self.assertGreater(len(errors), 0)
        self.assertTrue(any('DB_URL' in error for error in errors))

Integration Testing

def test_configuration_integration():
    # Test with actual Scrapy settings
    from scrapy.settings import Settings as ScrapySettings

    scrapy_settings = ScrapySettings()
    scrapy_settings.set('DB_URL', 'postgresql://test:test@localhost:5432/test')
    scrapy_settings.set('CREATE_TABLES', True)

    # Validate with our validator
    errors = validate_settings(scrapy_settings)
    assert len(errors) == 0, f"Configuration errors: {errors}"