Utilities API Reference ====================== Minimal, auto-generated docs for utility modules. Serialization ------------- .. automodule:: scrapy_item_ingest.utils.serialization :members: Fingerprint ----------- .. automodule:: scrapy_item_ingest.utils.fingerprint :members: Time helpers ------------ .. automodule:: scrapy_item_ingest.utils.time :members: Notes ----- - These helpers are used by pipelines/extensions; they are safe to import in user code. - See `quickstart` and `examples` for practical usage. 'title': 200, 'description': 1000 } } validation_result = validate_item_structure(item, schema) Custom Serialization Classes ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: ItemSerializer :members: :undoc-members: Advanced serialization class with customizable behavior. **Methods:** .. method:: __init__(options=None) Initialize serializer with custom options. :param options: Serialization options :type options: dict **Options:** .. code-block:: python options = { 'include_none': False, # Include None values 'include_empty': False, # Include empty strings/lists 'datetime_format': 'iso', # 'iso', 'timestamp', or format string 'decimal_precision': 2, # Decimal places for numbers 'max_string_length': 1000, # Truncate long strings 'normalize_unicode': True, # Normalize Unicode strings } .. method:: serialize(item) Serialize item with configured options. :param item: Item to serialize :returns: Serialized item :rtype: dict .. method:: add_custom_handler(type_class, handler_func) Add custom serialization handler for specific types. :param type_class: Type to handle :type type_class: type :param handler_func: Handler function :type handler_func: callable **Example:** .. code-block:: python from dataclasses import dataclass @dataclass class CustomObject: value: str serializer = ItemSerializer() serializer.add_custom_handler( CustomObject, lambda obj: {'custom_value': obj.value} ) Request Fingerprinting --------------------- .. automodule:: scrapy_item_ingest.utils.fingerprint :members: :undoc-members: Fingerprint Functions ~~~~~~~~~~~~~~~~~~~ .. autofunction:: request_fingerprint Generate unique fingerprint for Scrapy requests. :param request: Scrapy request object :type request: scrapy.Request :param include_headers: Include headers in fingerprint :type include_headers: bool :returns: Unique request fingerprint :rtype: str **Algorithm:** * Uses SHA1 hash of normalized request data * Includes URL, method, and optionally headers * Excludes dynamic headers like User-Agent * Handles query parameter ordering **Example:** .. code-block:: python import scrapy from scrapy_item_ingest.utils.fingerprint import request_fingerprint request1 = scrapy.Request('https://example.com/page?a=1&b=2') request2 = scrapy.Request('https://example.com/page?b=2&a=1') fp1 = request_fingerprint(request1) fp2 = request_fingerprint(request2) # fp1 == fp2 (same fingerprint for equivalent requests) .. autofunction:: normalize_url Normalize URL for consistent fingerprinting. :param url: URL to normalize :type url: str :returns: Normalized URL :rtype: str **Normalization Steps:** * Converts to lowercase * Sorts query parameters * Removes fragment identifiers * Handles URL encoding consistently * Removes default ports .. autofunction:: url_fingerprint Generate fingerprint for URL only (without method/headers). :param url: URL to fingerprint :type url: str :returns: URL fingerprint :rtype: str Database Utilities ----------------- .. automodule:: scrapy_item_ingest.utils.database :members: :undoc-members: Connection Utilities ~~~~~~~~~~~~~~~~~~ .. autofunction:: test_connection Test database connection and return status. :param db_url: Database connection string :type db_url: str :returns: Connection test results :rtype: dict **Example:** .. code-block:: python from scrapy_item_ingest.utils.database import test_connection result = test_connection('postgresql://user:pass@localhost:5432/db') # Result: {'connected': True, 'version': '15.3', 'latency': 0.023} .. autofunction:: get_table_info Get information about database tables. :param connection: Database connection :type connection: psycopg2.connection :param table_name: Name of table to inspect :type table_name: str :returns: Table information :rtype: dict .. autofunction:: execute_with_retry Execute database operation with automatic retry logic. :param connection: Database connection :param query: SQL query to execute :param params: Query parameters :param max_retries: Maximum retry attempts :returns: Query results :raises: DatabaseError if all retries fail Query Builders ~~~~~~~~~~~~~ .. autoclass:: QueryBuilder :members: :undoc-members: Helper class for building database queries dynamically. **Example:** .. code-block:: python from scrapy_item_ingest.utils.database import QueryBuilder builder = QueryBuilder('job_items') query = (builder .select(['item', 'created_at']) .where('job_id = %s') .where('created_at > %s') .order_by('created_at DESC') .limit(100) .build()) .. autoclass:: BatchInserter :members: :undoc-members: Optimized batch insertion utility for high-performance scenarios. **Example:** .. code-block:: python from scrapy_item_ingest.utils.database import BatchInserter inserter = BatchInserter(connection, 'job_items', batch_size=1000) for item in items: inserter.add_item({ 'item': json.dumps(item), 'job_id': job_id, 'created_at': datetime.now() }) inserter.flush() # Insert remaining items Data Processing -------------- .. automodule:: scrapy_item_ingest.utils.processing :members: :undoc-members: Data Cleaning ~~~~~~~~~~~~ .. autofunction:: clean_text Clean and normalize text data. :param text: Text to clean :type text: str :param options: Cleaning options :type options: dict :returns: Cleaned text :rtype: str **Cleaning Options:** .. code-block:: python options = { 'strip_whitespace': True, # Remove leading/trailing whitespace 'normalize_spaces': True, # Normalize multiple spaces to single 'remove_empty_lines': True, # Remove empty lines 'normalize_unicode': True, # Normalize Unicode characters 'remove_html': False, # Strip HTML tags 'max_length': None, # Maximum text length } .. autofunction:: extract_numbers Extract numeric values from text. :param text: Text containing numbers :type text: str :param number_type: Type of numbers to extract :type number_type: str :returns: Extracted numbers :rtype: list **Example:** .. code-block:: python price_text = "Regular price: $29.99 (was $39.99)" numbers = extract_numbers(price_text, 'float') # Result: [29.99, 39.99] .. autofunction:: normalize_currency Normalize currency values to standard format. :param currency_text: Text containing currency :type currency_text: str :param target_currency: Target currency code :type target_currency: str :returns: Normalized currency value :rtype: dict Data Validation ~~~~~~~~~~~~~~ .. autoclass:: DataValidator :members: :undoc-members: Comprehensive data validation utility. **Example:** .. code-block:: python from scrapy_item_ingest.utils.processing import DataValidator validator = DataValidator() # Add validation rules validator.add_rule('email', r'^[^@]+@[^@]+\.[^@]+$') validator.add_rule('url', validator.is_valid_url) validator.add_rule('phone', validator.is_valid_phone) # Validate data data = { 'email': 'user@example.com', 'website': 'https://example.com', 'phone': '+1-555-123-4567' } is_valid, errors = validator.validate(data) Caching Utilities ---------------- .. automodule:: scrapy_item_ingest.utils.cache :members: :undoc-members: Redis Cache ~~~~~~~~~~ .. autoclass:: RedisCache :members: :undoc-members: Redis-based caching utility for scraped data. **Example:** .. code-block:: python from scrapy_item_ingest.utils.cache import RedisCache cache = RedisCache('redis://localhost:6379/0') # Cache item cache.set('item:123', item_data, ttl=3600) # Retrieve item cached_item = cache.get('item:123') # Check if exists if cache.exists('item:123'): print("Item found in cache") Memory Cache ~~~~~~~~~~~ .. autoclass:: MemoryCache :members: :undoc-members: In-memory caching for temporary data storage. **Example:** .. code-block:: python from scrapy_item_ingest.utils.cache import MemoryCache cache = MemoryCache(max_size=1000) cache.set('key', 'value', ttl=300) value = cache.get('key') Monitoring Utilities ------------------- .. automodule:: scrapy_item_ingest.utils.monitoring :members: :undoc-members: Performance Monitoring ~~~~~~~~~~~~~~~~~~~~~ .. autoclass:: PerformanceMonitor :members: :undoc-members: Monitor spider performance metrics. **Example:** .. code-block:: python from scrapy_item_ingest.utils.monitoring import PerformanceMonitor monitor = PerformanceMonitor() # Start monitoring monitor.start_timer('page_processing') # ... process page ... # Stop and record elapsed = monitor.stop_timer('page_processing') monitor.record_metric('items_processed', 1) # Get statistics stats = monitor.get_statistics() Health Checks ~~~~~~~~~~~~ .. autofunction:: check_database_health Perform comprehensive database health check. :param db_url: Database connection string :returns: Health check results :rtype: dict .. autofunction:: check_redis_health Check Redis connection and performance. :param redis_url: Redis connection string :returns: Health check results :rtype: dict Configuration Helpers --------------------- .. automodule:: scrapy_item_ingest.utils.config :members: :undoc-members: Environment Utilities ~~~~~~~~~~~~~~~~~~~~ .. autofunction:: load_env_file Load environment variables from .env file. :param file_path: Path to .env file :type file_path: str :returns: Loaded environment variables :rtype: dict .. autofunction:: get_env_with_fallback Get environment variable with multiple fallback options. :param keys: List of environment variable names to try :type keys: list :param default: Default value if none found :returns: Environment variable value :rtype: str **Example:** .. code-block:: python # Try multiple environment variable names db_url = get_env_with_fallback([ 'DATABASE_URL', 'SCRAPY_DB_URL', 'DB_CONNECTION_STRING' ], default='postgresql://localhost:5432/scrapy') Testing Utilities ---------------- .. automodule:: scrapy_item_ingest.utils.testing :members: :undoc-members: Test Data Generation ~~~~~~~~~~~~~~~~~~ .. autofunction:: generate_test_items Generate test items for pipeline testing. :param count: Number of items to generate :type count: int :param item_type: Type of items to generate :type item_type: str :returns: List of test items :rtype: list .. autofunction:: create_test_spider Create mock spider for testing. :param name: Spider name :type name: str :param settings: Spider settings :type settings: dict :returns: Mock spider instance :rtype: Mock Database Testing ~~~~~~~~~~~~~~~ .. autoclass:: TestDatabaseManager :members: :undoc-members: Manage test databases for unit testing. **Example:** .. code-block:: python from scrapy_item_ingest.utils.testing import TestDatabaseManager # In test setup db_manager = TestDatabaseManager() test_db_url = db_manager.create_test_database() # Run tests with test database # In test teardown db_manager.cleanup_test_database() See Also -------- * :doc:`pipelines` - Pipeline API reference * :doc:`extensions` - Extensions API reference * :doc:`../user-guide/advanced-usage` - Advanced usage patterns