Configuration API Reference
Minimal, auto-generated docs for configuration.
Settings
- class scrapy_item_ingest.config.settings.Settings(crawler_settings)[source]
Bases:
objectHandles settings configuration for crawlers, providing access to default values, database table names, and other operational parameters defined in crawler settings.
This class facilitates the standardized management and retrieval of settings that are essential for database operations and crawler configurations. Its purpose is to provide default fallbacks and dynamically adapt to user-specified settings.
- DEFAULT_ITEMS_TABLE = 'job_items'
- DEFAULT_REQUESTS_TABLE = 'job_requests'
- DEFAULT_LOGS_TABLE = 'job_logs'
- DEFAULT_TIMEZONE = 'Asia/Karachi'
- property db_url
Provides access to the database URL from the crawler settings.
This property is used to retrieve the database URL defined in the crawler’s settings. It is helpful when a database configuration needs to be accessed dynamically.
- Returns:
The database URL as defined in the crawler’s configuration
- Return type:
str or None
- property db_type
Retrieves the database type from the crawler settings.
This property fetches the value assigned to the key DB_TYPE within the crawler_settings. Defaults to ‘postgres’ if the key is not set.
- Returns:
The database type as a string.
- Return type:
- property db_items_table
Return static table name for items
- property db_requests_table
This property fetches the name of the database table used to store request information. It retrieves the value from crawler settings if defined; otherwise, it defaults to the value of DEFAULT_REQUESTS_TABLE.
- Returns:
Name of the database table for storing requests.
- Return type:
- property db_logs_table
Retrieve the name of the database logs table.
This property fetches the value of the database logs table name provided in the crawler settings. If the value is not explicitly defined in the settings, it falls back to the default logs table.
- Returns:
The name of the database logs table.
- Return type:
Str
- property create_tables
Retrieve the setting for creating database tables from crawler settings.
This property fetches the value of the ‘CREATE_TABLES’ option from the crawler settings. If the option is not specified in the settings, it defaults to True.
- Returns:
Boolean value indicating whether to create tables.
- Return type:
Bool
Utilities
Notes
Configure DB via DB_URL or discrete fields (DB_HOST, DB_PORT, DB_USER, DB_PASSWORD, DB_NAME).
Default tables: job_items, job_requests, job_logs.
See configuration and quickstart for practical examples.
- type:
dict
- default:
{‘size’: 100, ‘timeout’: 30}
Configuration:
BATCH_SETTINGS = { 'size': 1000, # Items per batch 'timeout': 60, # Batch timeout in seconds 'max_memory': 512, # Max memory per batch (MB) }
Methods:
- scrapy_item_ingest.config.settings.validate()
Validate all configuration settings.
- Returns:
List of validation errors
- Return type:
- Raises:
ConfigurationError if critical settings are invalid
Example:
settings = Settings() errors = settings.validate() if errors: raise ConfigurationError(f"Configuration errors: {errors}")
- scrapy_item_ingest.config.settings.from_crawler_settings(crawler_settings)
Create Settings instance from Scrapy crawler settings.
- Parameters:
crawler_settings (scrapy.settings.Settings) – Scrapy settings object
- Returns:
Configured Settings instance
- Return type:
- scrapy_item_ingest.config.settings.get_database_url()
Get database URL with environment variable substitution.
- Returns:
Processed database URL
- Return type:
- Raises:
ConfigurationError if URL is invalid
Validation Functions
Settings Validation
- scrapy_item_ingest.config.settings.validate_settings(settings)[source]
Validate configuration settings
Comprehensive validation of pipeline settings.
- Parameters:
settings (Settings or scrapy.settings.Settings) – Settings object or Scrapy settings
- Returns:
List of validation errors
- Return type:
Validation Checks:
Database URL format and connectivity
Required settings presence
Type validation for all settings
Range validation for numeric settings
Table name validity
Example Usage:
from scrapy_item_ingest.config import validate_settings def check_configuration(settings): errors = validate_settings(settings) if errors: print("Configuration errors found:") for error in errors: print(f" - {error}") return False return True
Configuration Loaders
Environment Configuration
File Configuration
Configuration Examples
Basic Configuration
# settings.py - Basic configuration
from scrapy_item_ingest.config import Settings
# Create settings instance
config = Settings()
config.DB_URL = 'postgresql://scrapy:password@localhost:5432/scrapy_data'
config.CREATE_TABLES = True
config.JOB_ID = 'basic_job_001'
# Validate configuration
errors = config.validate()
if errors:
raise ValueError(f"Configuration errors: {errors}")
# Apply to Scrapy settings
DB_URL = config.DB_URL
CREATE_TABLES = config.CREATE_TABLES
JOB_ID = config.JOB_ID
Environment-Based Configuration
# settings.py - Environment-based configuration
import os
from scrapy_item_ingest.config import EnvironmentConfigLoader
# Load from environment
env_loader = EnvironmentConfigLoader()
env_config = env_loader.load()
# Apply environment configuration
DB_URL = env_config.get('db_url', 'postgresql://localhost:5432/scrapy')
CREATE_TABLES = env_config.get('create_tables', True)
JOB_ID = env_config.get('job_id', f'job_{int(time.time())}')
# Advanced settings from environment
if 'db_pool_size' in env_config:
DB_SETTINGS = {
'pool_size': env_config['db_pool_size'],
'max_overflow': env_config.get('db_max_overflow', 30),
}
Multi-Environment Configuration
# settings.py - Multi-environment setup
import os
from scrapy_item_ingest.config import Settings, FileConfigLoader
class ConfigurationManager:
def __init__(self):
self.environment = os.getenv('SCRAPY_ENV', 'development')
self.base_config = self._load_base_config()
self.env_config = self._load_environment_config()
def _load_base_config(self):
loader = FileConfigLoader()
return loader.load_yaml('config/base.yaml')
def _load_environment_config(self):
loader = FileConfigLoader()
config_file = f'config/{self.environment}.yaml'
if os.path.exists(config_file):
return loader.load_yaml(config_file)
return {}
def get_merged_config(self):
# Merge base and environment configs
config = {**self.base_config, **self.env_config}
# Override with environment variables
env_loader = EnvironmentConfigLoader()
env_overrides = env_loader.load()
config.update(env_overrides)
return config
# Initialize configuration
config_manager = ConfigurationManager()
merged_config = config_manager.get_merged_config()
# Apply to Scrapy settings
DB_URL = merged_config['database']['url']
CREATE_TABLES = merged_config['database']['create_tables']
JOB_ID = merged_config['job']['id']
Dynamic Configuration
# settings.py - Dynamic configuration
from datetime import datetime
import socket
class DynamicConfiguration:
@staticmethod
def generate_job_id(spider_name, environment='dev'):
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
hostname = socket.gethostname()
return f'{environment}_{spider_name}_{hostname}_{timestamp}'
@staticmethod
def get_database_url(environment='development'):
db_configs = {
'development': 'postgresql://dev:dev@localhost:5432/scrapy_dev',
'staging': os.getenv('STAGING_DB_URL'),
'production': os.getenv('PRODUCTION_DB_URL'),
}
return db_configs.get(environment)
@staticmethod
def get_performance_settings(environment='development'):
settings_map = {
'development': {
'concurrent_requests': 8,
'download_delay': 1,
'batch_size': 100,
},
'staging': {
'concurrent_requests': 16,
'download_delay': 0.5,
'batch_size': 500,
},
'production': {
'concurrent_requests': 32,
'download_delay': 0.1,
'batch_size': 1000,
}
}
return settings_map.get(environment, settings_map['development'])
# Apply dynamic configuration
env = os.getenv('SCRAPY_ENV', 'development')
dynamic_config = DynamicConfiguration()
DB_URL = dynamic_config.get_database_url(env)
JOB_ID = dynamic_config.generate_job_id('products', env)
perf_settings = dynamic_config.get_performance_settings(env)
CONCURRENT_REQUESTS = perf_settings['concurrent_requests']
DOWNLOAD_DELAY = perf_settings['download_delay']
Configuration Testing
Unit Testing Configuration
import unittest
from scrapy_item_ingest.config import Settings, validate_settings
class TestConfiguration(unittest.TestCase):
def test_valid_configuration(self):
settings = Settings()
settings.DB_URL = 'postgresql://test:test@localhost:5432/test_db'
settings.CREATE_TABLES = True
settings.JOB_ID = 'test_job'
errors = settings.validate()
self.assertEqual(len(errors), 0)
def test_invalid_database_url(self):
settings = Settings()
settings.DB_URL = 'invalid://url'
errors = settings.validate()
self.assertGreater(len(errors), 0)
self.assertTrue(any('database' in error.lower() for error in errors))
def test_missing_required_settings(self):
settings = Settings()
# Don't set DB_URL
errors = settings.validate()
self.assertGreater(len(errors), 0)
self.assertTrue(any('DB_URL' in error for error in errors))
Integration Testing
def test_configuration_integration():
# Test with actual Scrapy settings
from scrapy.settings import Settings as ScrapySettings
scrapy_settings = ScrapySettings()
scrapy_settings.set('DB_URL', 'postgresql://test:test@localhost:5432/test')
scrapy_settings.set('CREATE_TABLES', True)
# Validate with our validator
errors = validate_settings(scrapy_settings)
assert len(errors) == 0, f"Configuration errors: {errors}"
See Also
Configuration - Configuration user guide
Advanced Configurations - Advanced configuration examples
Pipelines API Reference - Pipeline API reference