Master Web Scraping with Scrapy - Part 1

Master Web Scraping - Part 1: Scrapy Fundamentals

ST

Surendra Tamang

30 min read beginner

Prerequisites

  • Basic Python knowledge
  • Understanding of HTML/CSS selectors
  • Command line familiarity

Master Web Scraping with Scrapy: Fundamentals

Welcome to this comprehensive tutorial series on mastering web scraping with Scrapy! In this first part, we’ll establish the foundation by understanding web scraping concepts, setting up Scrapy, and building your first production-ready spider.

Series Overview

This tutorial series will guide you through:

  1. Part 1: Scrapy Fundamentals (This tutorial)
  2. Part 2: Advanced Scraping Techniques
  3. Part 3: Anti-Detection and Scaling
  4. Part 4: Data Processing and Storage
  5. Part 5: Production Deployment

What You’ll Learn in This Part

  • Understanding web scraping ethics and legality
  • Setting up a professional Scrapy development environment
  • Building your first spider with proper architecture
  • Data extraction using selectors and XPath
  • Handling different data types and edge cases
  • Best practices for maintainable scraping code

Understanding Web Scraping

What is Web Scraping?

Web scraping is the process of automatically extracting data from websites. It’s like having a robot that can visit web pages, read the content, and collect specific information according to your instructions.

When to Use Web Scraping

# Common use cases for web scraping:
use_cases = {
"e_commerce": [
"Price monitoring and comparison",
"Product catalog aggregation",
"Inventory tracking",
"Competitor analysis"
],
"real_estate": [
"Property listings collection",
"Market price analysis",
"Investment opportunity identification"
],
"news_media": [
"News aggregation",
"Sentiment analysis",
"Content monitoring"
],
"research": [
"Academic paper collection",
"Social media data analysis",
"Market research"
]
}

Before we dive into the technical aspects, let’s address the important legal and ethical considerations:

# Legal scraping checklist
legal_checklist = {
"robots_txt": "Always check and respect robots.txt",
"terms_of_service": "Review website terms before scraping",
"rate_limiting": "Don't overload servers with requests",
"personal_data": "Be careful with personal/sensitive information",
"copyright": "Respect intellectual property rights",
"public_data": "Focus on publicly available information"
}
# Ethical scraping principles
ethical_principles = [
"Be respectful of website resources",
"Don't impact site performance for other users",
"Use scraped data responsibly",
"Give attribution when appropriate",
"Consider contacting site owners for large-scale scraping"
]

Setting Up Your Scrapy Environment

Step 1: Python Environment Setup

Let’s create a professional development environment:

Terminal window
# Create a virtual environment
python -m venv scrapy_env
# Activate the environment
# On Windows:
scrapy_env\Scripts\activate
# On macOS/Linux:
source scrapy_env/bin/activate
# Upgrade pip
pip install --upgrade pip

Step 2: Install Scrapy and Dependencies

Terminal window
# Install Scrapy with all recommended packages
pip install scrapy
# Install additional useful packages
pip install scrapy-splash # For JavaScript rendering
pip install scrapy-user-agents # For rotating user agents
pip install scrapy-rotating-proxies # For proxy rotation
pip install itemadapter # For item processing
pip install pymongo # For MongoDB storage
pip install psycopg2-binary # For PostgreSQL storage
pip install redis # For Redis-based deduplication
# Development tools
pip install ipython # Better REPL
pip install scrapy-shell # Enhanced shell
pip install black # Code formatting
pip install flake8 # Linting
# Save requirements
pip freeze > requirements.txt

Step 3: Create Project Structure

Terminal window
# Create a new Scrapy project
scrapy startproject webscraper
# Navigate to project directory
cd webscraper
# Create additional directories for organization
mkdir -p data/raw data/processed data/exports
mkdir -p logs
mkdir -p scripts
mkdir -p tests

Your project structure should look like this:

webscraper/
├── scrapy.cfg # Deploy configuration
├── requirements.txt # Python dependencies
├── data/ # Data storage
│ ├── raw/ # Raw scraped data
│ ├── processed/ # Cleaned data
│ └── exports/ # Final exports
├── logs/ # Log files
├── scripts/ # Utility scripts
├── tests/ # Test files
└── webscraper/ # Main package
├── __init__.py
├── items.py # Item definitions
├── middlewares.py # Custom middlewares
├── pipelines.py # Data processing pipelines
├── settings.py # Project settings
└── spiders/ # Spider modules
└── __init__.py

Building Your First Spider

Step 1: Define Data Items

First, let’s define what data we want to extract. We’ll build a spider for scraping e-commerce products:

webscraper/items.py
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags
def clean_price(value):
"""Clean price string and convert to float"""
if value:
# Remove currency symbols and whitespace
cleaned = ''.join(char for char in value if char.isdigit() or char == '.')
try:
return float(cleaned)
except ValueError:
return None
return None
def clean_text(value):
"""Clean text by removing extra whitespace and HTML tags"""
if value:
cleaned = remove_tags(value).strip()
return ' '.join(cleaned.split())
return None
class ProductItem(scrapy.Item):
# Basic product information
name = scrapy.Field(
input_processor=MapCompose(clean_text),
output_processor=TakeFirst()
)
price = scrapy.Field(
input_processor=MapCompose(clean_price),
output_processor=TakeFirst()
)
original_price = scrapy.Field(
input_processor=MapCompose(clean_price),
output_processor=TakeFirst()
)
currency = scrapy.Field(
output_processor=TakeFirst()
)
description = scrapy.Field(
input_processor=MapCompose(clean_text),
output_processor=Join(' ')
)
# Product details
brand = scrapy.Field(
input_processor=MapCompose(clean_text),
output_processor=TakeFirst()
)
category = scrapy.Field(
input_processor=MapCompose(clean_text),
output_processor=TakeFirst()
)
sku = scrapy.Field(
output_processor=TakeFirst()
)
availability = scrapy.Field(
output_processor=TakeFirst()
)
rating = scrapy.Field(
output_processor=TakeFirst()
)
review_count = scrapy.Field(
output_processor=TakeFirst()
)
# Images and media
images = scrapy.Field()
# Metadata
url = scrapy.Field(
output_processor=TakeFirst()
)
scraped_at = scrapy.Field(
output_processor=TakeFirst()
)
# Additional fields for tracking
source = scrapy.Field(
output_processor=TakeFirst()
)

Step 2: Create Your First Spider

Now let’s create a robust spider that demonstrates best practices:

webscraper/spiders/ecommerce_spider.py
import scrapy
from scrapy.loader import ItemLoader
from webscraper.items import ProductItem
from datetime import datetime
import json
import re
class EcommerceSpider(scrapy.Spider):
name = 'ecommerce'
allowed_domains = ['example-store.com']
# Custom settings for this spider
custom_settings = {
'DOWNLOAD_DELAY': 1,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_START_DELAY': 0.5,
'AUTOTHROTTLE_MAX_DELAY': 3,
'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0,
'FEEDS': {
'data/raw/products_%(time)s.json': {
'format': 'json',
'encoding': 'utf8',
'store_empty': False,
'fields': None,
'indent': 2,
},
}
}
def start_requests(self):
"""Generate initial requests"""
start_urls = [
'https://example-store.com/products',
'https://example-store.com/categories/electronics',
'https://example-store.com/categories/clothing',
]
for url in start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
meta={
'source': 'category_page',
'playwright': True, # Enable JavaScript rendering if needed
}
)
def parse(self, response):
"""Parse category pages and extract product links"""
self.logger.info(f'Parsing category page: {response.url}')
# Extract product links using CSS selectors
product_links = response.css('.product-item a::attr(href)').getall()
if not product_links:
# Try alternative selectors
product_links = response.css('.product-link::attr(href)').getall()
# Follow product links
for link in product_links:
product_url = response.urljoin(link)
yield scrapy.Request(
url=product_url,
callback=self.parse_product,
meta={
'source': 'product_page',
'category_url': response.url
}
)
# Follow pagination
next_page = response.css('.pagination .next::attr(href)').get()
if next_page:
yield scrapy.Request(
url=response.urljoin(next_page),
callback=self.parse,
meta=response.meta
)
def parse_product(self, response):
"""Parse individual product pages"""
self.logger.info(f'Parsing product: {response.url}')
# Create item loader for clean data extraction
loader = ItemLoader(item=ProductItem(), response=response)
# Basic product information
loader.add_css('name', 'h1.product-title::text')
loader.add_css('name', '.product-name::text') # Fallback selector
# Price extraction with multiple selectors
loader.add_css('price', '.price-current::text')
loader.add_css('price', '.current-price::text')
loader.add_xpath('price', '//span[@class="price"]//text()')
# Original price (if on sale)
loader.add_css('original_price', '.price-original::text')
loader.add_css('original_price', '.old-price::text')
# Product description
loader.add_css('description', '.product-description p::text')
loader.add_xpath('description', '//div[@class="description"]//text()')
# Product details
loader.add_css('brand', '.brand-name::text')
loader.add_css('category', '.breadcrumb li:last-child::text')
loader.add_css('sku', '.product-sku::text')
# Availability
availability = response.css('.stock-status::text').get()
if availability:
loader.add_value('availability', 'in_stock' if 'in stock' in availability.lower() else 'out_of_stock')
# Rating and reviews
rating = response.css('.rating-value::text').get()
if rating:
loader.add_value('rating', float(rating))
review_count_text = response.css('.review-count::text').get()
if review_count_text:
review_count = re.search(r'(\d+)', review_count_text)
if review_count:
loader.add_value('review_count', int(review_count.group(1)))
# Images
image_urls = response.css('.product-images img::attr(src)').getall()
if image_urls:
# Convert relative URLs to absolute
absolute_urls = [response.urljoin(url) for url in image_urls]
loader.add_value('images', absolute_urls)
# Metadata
loader.add_value('url', response.url)
loader.add_value('scraped_at', datetime.now().isoformat())
loader.add_value('source', response.meta.get('source', 'unknown'))
# Extract structured data if available
structured_data = self.extract_structured_data(response)
if structured_data:
self.update_loader_from_structured_data(loader, structured_data)
yield loader.load_item()
def extract_structured_data(self, response):
"""Extract JSON-LD structured data"""
scripts = response.xpath('//script[@type="application/ld+json"]/text()').getall()
for script in scripts:
try:
data = json.loads(script)
if isinstance(data, dict) and data.get('@type') == 'Product':
return data
elif isinstance(data, list):
for item in data:
if isinstance(item, dict) and item.get('@type') == 'Product':
return item
except json.JSONDecodeError:
continue
return None
def update_loader_from_structured_data(self, loader, data):
"""Update item loader with structured data"""
if 'name' in data:
loader.add_value('name', data['name'])
if 'offers' in data and isinstance(data['offers'], dict):
offer = data['offers']
if 'price' in offer:
loader.add_value('price', float(offer['price']))
if 'priceCurrency' in offer:
loader.add_value('currency', offer['priceCurrency'])
if 'availability' in offer:
availability = offer['availability'].split('/')[-1].lower()
loader.add_value('availability', availability)
if 'brand' in data:
brand = data['brand']
if isinstance(brand, dict) and 'name' in brand:
loader.add_value('brand', brand['name'])
elif isinstance(brand, str):
loader.add_value('brand', brand)
if 'aggregateRating' in data:
rating_data = data['aggregateRating']
if 'ratingValue' in rating_data:
loader.add_value('rating', float(rating_data['ratingValue']))
if 'reviewCount' in rating_data:
loader.add_value('review_count', int(rating_data['reviewCount']))

Step 3: Data Processing Pipeline

Create a pipeline to process and validate extracted data:

webscraper/pipelines.py
from itemadapter import ItemAdapter
import logging
import json
from datetime import datetime
class ValidationPipeline:
"""Validate and clean scraped items"""
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Validate required fields
required_fields = ['name', 'url']
for field in required_fields:
if not adapter.get(field):
raise DropItem(f"Missing required field: {field}")
# Clean and validate price
price = adapter.get('price')
if price is not None:
if not isinstance(price, (int, float)) or price < 0:
spider.logger.warning(f"Invalid price for {adapter['name']}: {price}")
adapter['price'] = None
# Validate rating
rating = adapter.get('rating')
if rating is not None:
if not isinstance(rating, (int, float)) or not (0 <= rating <= 5):
spider.logger.warning(f"Invalid rating for {adapter['name']}: {rating}")
adapter['rating'] = None
return item
class DuplicationFilterPipeline:
"""Filter out duplicate items"""
def __init__(self):
self.seen_items = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Create a unique identifier for the item
identifier = f"{adapter['name']}_{adapter['url']}"
if identifier in self.seen_items:
raise DropItem(f"Duplicate item found: {adapter['name']}")
else:
self.seen_items.add(identifier)
return item
class JsonWriterPipeline:
"""Write items to JSON file"""
def __init__(self):
self.file = None
self.items = []
def open_spider(self, spider):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"data/processed/{spider.name}_{timestamp}.json"
self.file = open(filename, 'w', encoding='utf-8')
spider.logger.info(f"Opened file: {filename}")
def close_spider(self, spider):
if self.file:
json.dump(self.items, self.file, indent=2, ensure_ascii=False)
self.file.close()
spider.logger.info(f"Saved {len(self.items)} items")
def process_item(self, item, spider):
adapter = ItemAdapter(item)
self.items.append(dict(adapter))
return item
class StatisticsPipeline:
"""Collect scraping statistics"""
def __init__(self):
self.stats = {
'items_scraped': 0,
'items_dropped': 0,
'start_time': None,
'end_time': None
}
def open_spider(self, spider):
self.stats['start_time'] = datetime.now()
spider.logger.info("Statistics collection started")
def close_spider(self, spider):
self.stats['end_time'] = datetime.now()
duration = self.stats['end_time'] - self.stats['start_time']
spider.logger.info("=== SCRAPING STATISTICS ===")
spider.logger.info(f"Items scraped: {self.stats['items_scraped']}")
spider.logger.info(f"Items dropped: {self.stats['items_dropped']}")
spider.logger.info(f"Duration: {duration}")
spider.logger.info(f"Items per minute: {self.stats['items_scraped'] / (duration.total_seconds() / 60):.2f}")
def process_item(self, item, spider):
self.stats['items_scraped'] += 1
return item

Step 4: Configure Settings

Update your settings for optimal performance:

webscraper/settings.py
# Scrapy settings for webscraper project
BOT_NAME = 'webscraper'
SPIDER_MODULES = ['webscraper.spiders']
NEWSPIDER_MODULE = 'webscraper.spiders'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure pipelines
ITEM_PIPELINES = {
'webscraper.pipelines.ValidationPipeline': 300,
'webscraper.pipelines.DuplicationFilterPipeline': 400,
'webscraper.pipelines.JsonWriterPipeline': 500,
'webscraper.pipelines.StatisticsPipeline': 600,
}
# Configure delays and throttling
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
# AutoThrottle settings
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 0.5
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
AUTOTHROTTLE_DEBUG = False
# User agent settings
USER_AGENT = 'webscraper (+http://www.yourdomain.com)'
# Configure caching
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_DIR = 'httpcache'
# Logging settings
LOG_LEVEL = 'INFO'
LOG_FILE = 'logs/scrapy.log'
# Retry settings
RETRY_ENABLED = True
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Concurrent requests
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Memory usage optimization
MEMUSAGE_ENABLED = True
MEMUSAGE_LIMIT_MB = 2048
MEMUSAGE_WARNING_MB = 1024
# Request and response size limits
DOWNLOAD_MAXSIZE = 1073741824 # 1GB
DOWNLOAD_WARNSIZE = 33554432 # 32MB

Running Your Spider

Basic Execution

Terminal window
# Run the spider
scrapy crawl ecommerce
# Run with custom settings
scrapy crawl ecommerce -s DOWNLOAD_DELAY=2
# Save output to specific file
scrapy crawl ecommerce -o products.json
# Run with custom log level
scrapy crawl ecommerce -L DEBUG

Advanced Execution with Parameters

scripts/run_spider.py
# Create a script to run spiders with parameters
import subprocess
import sys
from datetime import datetime
def run_spider(spider_name, **kwargs):
"""Run spider with custom parameters"""
cmd = ['scrapy', 'crawl', spider_name]
# Add custom settings
for key, value in kwargs.items():
cmd.extend(['-s', f'{key}={value}'])
# Add timestamp to output file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = f'data/raw/{spider_name}_{timestamp}.json'
cmd.extend(['-o', output_file])
print(f"Running command: {' '.join(cmd)}")
try:
result = subprocess.run(cmd, check=True, capture_output=True, text=True)
print("Spider completed successfully!")
print(f"Output saved to: {output_file}")
return True
except subprocess.CalledProcessError as e:
print(f"Spider failed with error: {e}")
print(f"Error output: {e.stderr}")
return False
if __name__ == "__main__":
# Example usage
run_spider(
'ecommerce',
DOWNLOAD_DELAY=1.5,
CONCURRENT_REQUESTS=8,
LOG_LEVEL='INFO'
)

Testing Your Spider

Unit Tests

tests/test_ecommerce_spider.py
import unittest
from scrapy.http import HtmlResponse, Request
from webscraper.spiders.ecommerce_spider import EcommerceSpider
class TestEcommerceSpider(unittest.TestCase):
def setUp(self):
self.spider = EcommerceSpider()
def test_parse_product(self):
"""Test product parsing"""
# Sample HTML response
html = """
<html>
<body>
<h1 class="product-title">Test Product</h1>
<span class="price-current">$99.99</span>
<p class="product-description">This is a test product</p>
</body>
</html>
"""
request = Request(url='http://example.com/product/1')
response = HtmlResponse(
url='http://example.com/product/1',
request=request,
body=html.encode('utf-8')
)
# Process the response
items = list(self.spider.parse_product(response))
# Assertions
self.assertEqual(len(items), 1)
item = items[0]
self.assertEqual(item['name'], 'Test Product')
self.assertEqual(item['price'], 99.99)
if __name__ == '__main__':
unittest.main()

Best Practices and Tips

1. Robust Selector Strategy

def extract_with_fallbacks(response, selectors):
"""Extract data with multiple fallback selectors"""
for selector in selectors:
result = response.css(selector).get()
if result:
return result.strip()
return None
# Usage example
price = extract_with_fallbacks(response, [
'.price-current::text',
'.current-price::text',
'.price::text',
'[data-price]::attr(data-price)'
])

2. Error Handling

def safe_extract_float(value, default=None):
"""Safely extract float from string"""
if not value:
return default
try:
# Clean the string
cleaned = ''.join(char for char in str(value) if char.isdigit() or char in '.-')
return float(cleaned)
except (ValueError, TypeError):
return default

3. Logging and Monitoring

# Add custom logging to your spider
import logging
class EcommerceSpider(scrapy.Spider):
def __init__(self):
self.stats = {
'products_found': 0,
'products_processed': 0,
'errors': 0
}
def parse_product(self, response):
try:
self.stats['products_found'] += 1
# ... processing logic ...
self.stats['products_processed'] += 1
except Exception as e:
self.stats['errors'] += 1
self.logger.error(f"Error processing {response.url}: {e}")
def closed(self, reason):
self.logger.info(f"Spider closed: {reason}")
self.logger.info(f"Statistics: {self.stats}")

Summary and Next Steps

Congratulations! You’ve completed Part 1 of the Web Scraping with Scrapy series. You now have:

Understanding of web scraping ethics and best practices
Professional Scrapy environment setup
Production-ready spider with proper architecture
Data extraction using selectors and structured data
Processing pipelines for data validation and storage
Testing framework for reliable spiders

What’s Next?

In Part 2: Advanced Scraping Techniques, we’ll explore:

  • Handling JavaScript-heavy websites with Splash
  • Form submission and login handling
  • Advanced selector techniques and data extraction
  • Handling AJAX requests and dynamic content
  • Custom middleware development

Practice Exercise

Before moving to Part 2, try building a spider for your favorite e-commerce site:

  1. Create a new spider targeting a simple e-commerce site
  2. Extract product names, prices, and descriptions
  3. Implement proper error handling and logging
  4. Add data validation pipelines
  5. Test your spider with different product categories

Resources

Happy scraping! 🕷️