Master Web Scraping with Scrapy: Fundamentals

Welcome to this comprehensive tutorial series on mastering web scraping with Scrapy! In this first part, we’ll establish the foundation by understanding web scraping concepts, setting up Scrapy, and building your first production-ready spider.

Series Overview

This tutorial series will guide you through:

Part 1: Scrapy Fundamentals (This tutorial)
Part 2: Advanced Scraping Techniques
Part 3: Anti-Detection and Scaling
Part 4: Data Processing and Storage
Part 5: Production Deployment

What You’ll Learn in This Part

Understanding web scraping ethics and legality
Setting up a professional Scrapy development environment
Building your first spider with proper architecture
Data extraction using selectors and XPath
Handling different data types and edge cases
Best practices for maintainable scraping code

Understanding Web Scraping

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It’s like having a robot that can visit web pages, read the content, and collect specific information according to your instructions.

When to Use Web Scraping

# Common use cases for web scraping:
use_cases = {
    "e_commerce": [
        "Price monitoring and comparison",
        "Product catalog aggregation",
        "Inventory tracking",
        "Competitor analysis"
    ],
    "real_estate": [
        "Property listings collection",
        "Market price analysis",
        "Investment opportunity identification"
    ],
    "news_media": [
        "News aggregation",
        "Sentiment analysis",
        "Content monitoring"
    ],
    "research": [
        "Academic paper collection",
        "Social media data analysis",
        "Market research"
    ]
}

Legal and Ethical Considerations

Before we dive into the technical aspects, let’s address the important legal and ethical considerations:

# Legal scraping checklist
legal_checklist = {
    "robots_txt": "Always check and respect robots.txt",
    "terms_of_service": "Review website terms before scraping",
    "rate_limiting": "Don't overload servers with requests",
    "personal_data": "Be careful with personal/sensitive information",
    "copyright": "Respect intellectual property rights",
    "public_data": "Focus on publicly available information"
}

# Ethical scraping principles
ethical_principles = [
    "Be respectful of website resources",
    "Don't impact site performance for other users",
    "Use scraped data responsibly",
    "Give attribution when appropriate",
    "Consider contacting site owners for large-scale scraping"
]

Setting Up Your Scrapy Environment

Step 1: Python Environment Setup

Let’s create a professional development environment:

# Create a virtual environment
python -m venv scrapy_env

# Activate the environment
# On Windows:
scrapy_env\Scripts\activate
# On macOS/Linux:
source scrapy_env/bin/activate

# Upgrade pip
pip install --upgrade pip

Step 2: Install Scrapy and Dependencies

# Install Scrapy with all recommended packages
pip install scrapy

# Install additional useful packages
pip install scrapy-splash        # For JavaScript rendering
pip install scrapy-user-agents   # For rotating user agents
pip install scrapy-rotating-proxies  # For proxy rotation
pip install itemadapter          # For item processing
pip install pymongo             # For MongoDB storage
pip install psycopg2-binary     # For PostgreSQL storage
pip install redis               # For Redis-based deduplication

# Development tools
pip install ipython             # Better REPL
pip install scrapy-shell        # Enhanced shell
pip install black               # Code formatting
pip install flake8             # Linting

# Save requirements
pip freeze > requirements.txt

Step 3: Create Project Structure

# Create a new Scrapy project
scrapy startproject webscraper

# Navigate to project directory
cd webscraper

# Create additional directories for organization
mkdir -p data/raw data/processed data/exports
mkdir -p logs
mkdir -p scripts
mkdir -p tests

Your project structure should look like this:

webscraper/
├── scrapy.cfg                 # Deploy configuration
├── requirements.txt           # Python dependencies
├── data/                      # Data storage
│   ├── raw/                  # Raw scraped data
│   ├── processed/            # Cleaned data
│   └── exports/              # Final exports
├── logs/                     # Log files
├── scripts/                  # Utility scripts
├── tests/                    # Test files
└── webscraper/               # Main package
    ├── __init__.py
    ├── items.py              # Item definitions
    ├── middlewares.py        # Custom middlewares
    ├── pipelines.py          # Data processing pipelines
    ├── settings.py           # Project settings
    └── spiders/              # Spider modules
        └── __init__.py

Building Your First Spider

Step 1: Define Data Items

First, let’s define what data we want to extract. We’ll build a spider for scraping e-commerce products:

import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags


def clean_price(value):
    """Clean price string and convert to float"""
    if value:
        # Remove currency symbols and whitespace
        cleaned = ''.join(char for char in value if char.isdigit() or char == '.')
        try:
            return float(cleaned)
        except ValueError:
            return None
    return None


def clean_text(value):
    """Clean text by removing extra whitespace and HTML tags"""
    if value:
        cleaned = remove_tags(value).strip()
        return ' '.join(cleaned.split())
    return None


class ProductItem(scrapy.Item):
    # Basic product information
    name = scrapy.Field(
        input_processor=MapCompose(clean_text),
        output_processor=TakeFirst()
    )

    price = scrapy.Field(
        input_processor=MapCompose(clean_price),
        output_processor=TakeFirst()
    )

    original_price = scrapy.Field(
        input_processor=MapCompose(clean_price),
        output_processor=TakeFirst()
    )

    currency = scrapy.Field(
        output_processor=TakeFirst()
    )

    description = scrapy.Field(
        input_processor=MapCompose(clean_text),
        output_processor=Join(' ')
    )

    # Product details
    brand = scrapy.Field(
        input_processor=MapCompose(clean_text),
        output_processor=TakeFirst()
    )

    category = scrapy.Field(
        input_processor=MapCompose(clean_text),
        output_processor=TakeFirst()
    )

    sku = scrapy.Field(
        output_processor=TakeFirst()
    )

    availability = scrapy.Field(
        output_processor=TakeFirst()
    )

    rating = scrapy.Field(
        output_processor=TakeFirst()
    )

    review_count = scrapy.Field(
        output_processor=TakeFirst()
    )

    # Images and media
    images = scrapy.Field()

    # Metadata
    url = scrapy.Field(
        output_processor=TakeFirst()
    )

    scraped_at = scrapy.Field(
        output_processor=TakeFirst()
    )

    # Additional fields for tracking
    source = scrapy.Field(
        output_processor=TakeFirst()
    )

Step 2: Create Your First Spider

Now let’s create a robust spider that demonstrates best practices:

import scrapy
from scrapy.loader import ItemLoader
from webscraper.items import ProductItem
from datetime import datetime
import json
import re


class EcommerceSpider(scrapy.Spider):
    name = 'ecommerce'
    allowed_domains = ['example-store.com']

    # Custom settings for this spider
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_START_DELAY': 0.5,
        'AUTOTHROTTLE_MAX_DELAY': 3,
        'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0,
        'FEEDS': {
            'data/raw/products_%(time)s.json': {
                'format': 'json',
                'encoding': 'utf8',
                'store_empty': False,
                'fields': None,
                'indent': 2,
            },
        }
    }

    def start_requests(self):
        """Generate initial requests"""
        start_urls = [
            'https://example-store.com/products',
            'https://example-store.com/categories/electronics',
            'https://example-store.com/categories/clothing',
        ]

        for url in start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse,
                meta={
                    'source': 'category_page',
                    'playwright': True,  # Enable JavaScript rendering if needed
                }
            )

    def parse(self, response):
        """Parse category pages and extract product links"""
        self.logger.info(f'Parsing category page: {response.url}')

        # Extract product links using CSS selectors
        product_links = response.css('.product-item a::attr(href)').getall()

        if not product_links:
            # Try alternative selectors
            product_links = response.css('.product-link::attr(href)').getall()

        # Follow product links
        for link in product_links:
            product_url = response.urljoin(link)
            yield scrapy.Request(
                url=product_url,
                callback=self.parse_product,
                meta={
                    'source': 'product_page',
                    'category_url': response.url
                }
            )

        # Follow pagination
        next_page = response.css('.pagination .next::attr(href)').get()
        if next_page:
            yield scrapy.Request(
                url=response.urljoin(next_page),
                callback=self.parse,
                meta=response.meta
            )

    def parse_product(self, response):
        """Parse individual product pages"""
        self.logger.info(f'Parsing product: {response.url}')

        # Create item loader for clean data extraction
        loader = ItemLoader(item=ProductItem(), response=response)

        # Basic product information
        loader.add_css('name', 'h1.product-title::text')
        loader.add_css('name', '.product-name::text')  # Fallback selector

        # Price extraction with multiple selectors
        loader.add_css('price', '.price-current::text')
        loader.add_css('price', '.current-price::text')
        loader.add_xpath('price', '//span[@class="price"]//text()')

        # Original price (if on sale)
        loader.add_css('original_price', '.price-original::text')
        loader.add_css('original_price', '.old-price::text')

        # Product description
        loader.add_css('description', '.product-description p::text')
        loader.add_xpath('description', '//div[@class="description"]//text()')

        # Product details
        loader.add_css('brand', '.brand-name::text')
        loader.add_css('category', '.breadcrumb li:last-child::text')
        loader.add_css('sku', '.product-sku::text')

        # Availability
        availability = response.css('.stock-status::text').get()
        if availability:
            loader.add_value('availability', 'in_stock' if 'in stock' in availability.lower() else 'out_of_stock')

        # Rating and reviews
        rating = response.css('.rating-value::text').get()
        if rating:
            loader.add_value('rating', float(rating))

        review_count_text = response.css('.review-count::text').get()
        if review_count_text:
            review_count = re.search(r'(\d+)', review_count_text)
            if review_count:
                loader.add_value('review_count', int(review_count.group(1)))

        # Images
        image_urls = response.css('.product-images img::attr(src)').getall()
        if image_urls:
            # Convert relative URLs to absolute
            absolute_urls = [response.urljoin(url) for url in image_urls]
            loader.add_value('images', absolute_urls)

        # Metadata
        loader.add_value('url', response.url)
        loader.add_value('scraped_at', datetime.now().isoformat())
        loader.add_value('source', response.meta.get('source', 'unknown'))

        # Extract structured data if available
        structured_data = self.extract_structured_data(response)
        if structured_data:
            self.update_loader_from_structured_data(loader, structured_data)

        yield loader.load_item()

    def extract_structured_data(self, response):
        """Extract JSON-LD structured data"""
        scripts = response.xpath('//script[@type="application/ld+json"]/text()').getall()

        for script in scripts:
            try:
                data = json.loads(script)
                if isinstance(data, dict) and data.get('@type') == 'Product':
                    return data
                elif isinstance(data, list):
                    for item in data:
                        if isinstance(item, dict) and item.get('@type') == 'Product':
                            return item
            except json.JSONDecodeError:
                continue

        return None

    def update_loader_from_structured_data(self, loader, data):
        """Update item loader with structured data"""
        if 'name' in data:
            loader.add_value('name', data['name'])

        if 'offers' in data and isinstance(data['offers'], dict):
            offer = data['offers']
            if 'price' in offer:
                loader.add_value('price', float(offer['price']))
            if 'priceCurrency' in offer:
                loader.add_value('currency', offer['priceCurrency'])
            if 'availability' in offer:
                availability = offer['availability'].split('/')[-1].lower()
                loader.add_value('availability', availability)

        if 'brand' in data:
            brand = data['brand']
            if isinstance(brand, dict) and 'name' in brand:
                loader.add_value('brand', brand['name'])
            elif isinstance(brand, str):
                loader.add_value('brand', brand)

        if 'aggregateRating' in data:
            rating_data = data['aggregateRating']
            if 'ratingValue' in rating_data:
                loader.add_value('rating', float(rating_data['ratingValue']))
            if 'reviewCount' in rating_data:
                loader.add_value('review_count', int(rating_data['reviewCount']))

Step 3: Data Processing Pipeline

Create a pipeline to process and validate extracted data:

from itemadapter import ItemAdapter
import logging
import json
from datetime import datetime


class ValidationPipeline:
    """Validate and clean scraped items"""

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Validate required fields
        required_fields = ['name', 'url']
        for field in required_fields:
            if not adapter.get(field):
                raise DropItem(f"Missing required field: {field}")

        # Clean and validate price
        price = adapter.get('price')
        if price is not None:
            if not isinstance(price, (int, float)) or price < 0:
                spider.logger.warning(f"Invalid price for {adapter['name']}: {price}")
                adapter['price'] = None

        # Validate rating
        rating = adapter.get('rating')
        if rating is not None:
            if not isinstance(rating, (int, float)) or not (0 <= rating <= 5):
                spider.logger.warning(f"Invalid rating for {adapter['name']}: {rating}")
                adapter['rating'] = None

        return item


class DuplicationFilterPipeline:
    """Filter out duplicate items"""

    def __init__(self):
        self.seen_items = set()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        # Create a unique identifier for the item
        identifier = f"{adapter['name']}_{adapter['url']}"

        if identifier in self.seen_items:
            raise DropItem(f"Duplicate item found: {adapter['name']}")
        else:
            self.seen_items.add(identifier)
            return item


class JsonWriterPipeline:
    """Write items to JSON file"""

    def __init__(self):
        self.file = None
        self.items = []

    def open_spider(self, spider):
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        filename = f"data/processed/{spider.name}_{timestamp}.json"
        self.file = open(filename, 'w', encoding='utf-8')
        spider.logger.info(f"Opened file: {filename}")

    def close_spider(self, spider):
        if self.file:
            json.dump(self.items, self.file, indent=2, ensure_ascii=False)
            self.file.close()
            spider.logger.info(f"Saved {len(self.items)} items")

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.items.append(dict(adapter))
        return item


class StatisticsPipeline:
    """Collect scraping statistics"""

    def __init__(self):
        self.stats = {
            'items_scraped': 0,
            'items_dropped': 0,
            'start_time': None,
            'end_time': None
        }

    def open_spider(self, spider):
        self.stats['start_time'] = datetime.now()
        spider.logger.info("Statistics collection started")

    def close_spider(self, spider):
        self.stats['end_time'] = datetime.now()
        duration = self.stats['end_time'] - self.stats['start_time']

        spider.logger.info("=== SCRAPING STATISTICS ===")
        spider.logger.info(f"Items scraped: {self.stats['items_scraped']}")
        spider.logger.info(f"Items dropped: {self.stats['items_dropped']}")
        spider.logger.info(f"Duration: {duration}")
        spider.logger.info(f"Items per minute: {self.stats['items_scraped'] / (duration.total_seconds() / 60):.2f}")

    def process_item(self, item, spider):
        self.stats['items_scraped'] += 1
        return item

Step 4: Configure Settings

Update your settings for optimal performance:

# Scrapy settings for webscraper project
BOT_NAME = 'webscraper'

SPIDER_MODULES = ['webscraper.spiders']
NEWSPIDER_MODULE = 'webscraper.spiders'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure pipelines
ITEM_PIPELINES = {
    'webscraper.pipelines.ValidationPipeline': 300,
    'webscraper.pipelines.DuplicationFilterPipeline': 400,
    'webscraper.pipelines.JsonWriterPipeline': 500,
    'webscraper.pipelines.StatisticsPipeline': 600,
}

# Configure delays and throttling
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True

# AutoThrottle settings
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = False

# User agent settings
USER_AGENT = 'webscraper (+http://www.yourdomain.com)'

# Configure caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_DIR = 'httpcache'

# Logging settings
LOG_LEVEL = 'INFO'
LOG_FILE = 'logs/scrapy.log'

# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Concurrent requests
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# Memory usage optimization
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 2048
MEMUSAGE_WARNING_MB = 1024

# Request and response size limits
DOWNLOAD_MAXSIZE = 1073741824  # 1GB
DOWNLOAD_WARNSIZE = 33554432   # 32MB

Running Your Spider

Basic Execution

# Run the spider
scrapy crawl ecommerce

# Run with custom settings
scrapy crawl ecommerce -s DOWNLOAD_DELAY=2

# Save output to specific file
scrapy crawl ecommerce -o products.json

# Run with custom log level
scrapy crawl ecommerce -L DEBUG

Advanced Execution with Parameters

# Create a script to run spiders with parameters
import subprocess
import sys
from datetime import datetime

def run_spider(spider_name, **kwargs):
    """Run spider with custom parameters"""

    cmd = ['scrapy', 'crawl', spider_name]

    # Add custom settings
    for key, value in kwargs.items():
        cmd.extend(['-s', f'{key}={value}'])

    # Add timestamp to output file
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f'data/raw/{spider_name}_{timestamp}.json'
    cmd.extend(['-o', output_file])

    print(f"Running command: {' '.join(cmd)}")

    try:
        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
        print("Spider completed successfully!")
        print(f"Output saved to: {output_file}")
        return True
    except subprocess.CalledProcessError as e:
        print(f"Spider failed with error: {e}")
        print(f"Error output: {e.stderr}")
        return False

if __name__ == "__main__":
    # Example usage
    run_spider(
        'ecommerce',
        DOWNLOAD_DELAY=1.5,
        CONCURRENT_REQUESTS=8,
        LOG_LEVEL='INFO'
    )

Testing Your Spider

Unit Tests

import unittest
from scrapy.http import HtmlResponse, Request
from webscraper.spiders.ecommerce_spider import EcommerceSpider

class TestEcommerceSpider(unittest.TestCase):

    def setUp(self):
        self.spider = EcommerceSpider()

    def test_parse_product(self):
        """Test product parsing"""
        # Sample HTML response
        html = """
        <html>
        <body>
            <h1 class="product-title">Test Product</h1>
            <span class="price-current">$99.99</span>
            <p class="product-description">This is a test product</p>
        </body>
        </html>
        """

        request = Request(url='http://example.com/product/1')
        response = HtmlResponse(
            url='http://example.com/product/1',
            request=request,
            body=html.encode('utf-8')
        )

        # Process the response
        items = list(self.spider.parse_product(response))

        # Assertions
        self.assertEqual(len(items), 1)
        item = items[0]
        self.assertEqual(item['name'], 'Test Product')
        self.assertEqual(item['price'], 99.99)

if __name__ == '__main__':
    unittest.main()

Best Practices and Tips

1. Robust Selector Strategy

def extract_with_fallbacks(response, selectors):
    """Extract data with multiple fallback selectors"""
    for selector in selectors:
        result = response.css(selector).get()
        if result:
            return result.strip()
    return None

# Usage example
price = extract_with_fallbacks(response, [
    '.price-current::text',
    '.current-price::text',
    '.price::text',
    '[data-price]::attr(data-price)'
])

2. Error Handling

def safe_extract_float(value, default=None):
    """Safely extract float from string"""
    if not value:
        return default

    try:
        # Clean the string
        cleaned = ''.join(char for char in str(value) if char.isdigit() or char in '.-')
        return float(cleaned)
    except (ValueError, TypeError):
        return default

3. Logging and Monitoring

# Add custom logging to your spider
import logging

class EcommerceSpider(scrapy.Spider):

    def __init__(self):
        self.stats = {
            'products_found': 0,
            'products_processed': 0,
            'errors': 0
        }

    def parse_product(self, response):
        try:
            self.stats['products_found'] += 1
            # ... processing logic ...
            self.stats['products_processed'] += 1

        except Exception as e:
            self.stats['errors'] += 1
            self.logger.error(f"Error processing {response.url}: {e}")

    def closed(self, reason):
        self.logger.info(f"Spider closed: {reason}")
        self.logger.info(f"Statistics: {self.stats}")

Summary and Next Steps

Congratulations! You’ve completed Part 1 of the Web Scraping with Scrapy series. You now have:

✅ Understanding of web scraping ethics and best practices
✅ Professional Scrapy environment setup
✅ Production-ready spider with proper architecture
✅ Data extraction using selectors and structured data
✅ Processing pipelines for data validation and storage
✅ Testing framework for reliable spiders

What’s Next?

In Part 2: Advanced Scraping Techniques, we’ll explore:

Handling JavaScript-heavy websites with Splash
Form submission and login handling
Advanced selector techniques and data extraction
Handling AJAX requests and dynamic content
Custom middleware development

Practice Exercise

Before moving to Part 2, try building a spider for your favorite e-commerce site:

Create a new spider targeting a simple e-commerce site
Extract product names, prices, and descriptions
Implement proper error handling and logging
Add data validation pipelines
Test your spider with different product categories

Resources

Happy scraping! 🕷️

Master Web Scraping - Part 1: Scrapy Fundamentals

Prerequisites