Master Web Scraping with Scrapy: Fundamentals
Welcome to this comprehensive tutorial series on mastering web scraping with Scrapy! In this first part, we’ll establish the foundation by understanding web scraping concepts, setting up Scrapy, and building your first production-ready spider.
Series Overview
This tutorial series will guide you through:
- Part 1: Scrapy Fundamentals (This tutorial)
- Part 2: Advanced Scraping Techniques
- Part 3: Anti-Detection and Scaling
- Part 4: Data Processing and Storage
- Part 5: Production Deployment
What You’ll Learn in This Part
- Understanding web scraping ethics and legality
- Setting up a professional Scrapy development environment
- Building your first spider with proper architecture
- Data extraction using selectors and XPath
- Handling different data types and edge cases
- Best practices for maintainable scraping code
Understanding Web Scraping
What is Web Scraping?
Web scraping is the process of automatically extracting data from websites. It’s like having a robot that can visit web pages, read the content, and collect specific information according to your instructions.
When to Use Web Scraping
# Common use cases for web scraping:use_cases = { "e_commerce": [ "Price monitoring and comparison", "Product catalog aggregation", "Inventory tracking", "Competitor analysis" ], "real_estate": [ "Property listings collection", "Market price analysis", "Investment opportunity identification" ], "news_media": [ "News aggregation", "Sentiment analysis", "Content monitoring" ], "research": [ "Academic paper collection", "Social media data analysis", "Market research" ]}
Legal and Ethical Considerations
Before we dive into the technical aspects, let’s address the important legal and ethical considerations:
# Legal scraping checklistlegal_checklist = { "robots_txt": "Always check and respect robots.txt", "terms_of_service": "Review website terms before scraping", "rate_limiting": "Don't overload servers with requests", "personal_data": "Be careful with personal/sensitive information", "copyright": "Respect intellectual property rights", "public_data": "Focus on publicly available information"}
# Ethical scraping principlesethical_principles = [ "Be respectful of website resources", "Don't impact site performance for other users", "Use scraped data responsibly", "Give attribution when appropriate", "Consider contacting site owners for large-scale scraping"]
Setting Up Your Scrapy Environment
Step 1: Python Environment Setup
Let’s create a professional development environment:
# Create a virtual environmentpython -m venv scrapy_env
# Activate the environment# On Windows:scrapy_env\Scripts\activate# On macOS/Linux:source scrapy_env/bin/activate
# Upgrade pippip install --upgrade pip
Step 2: Install Scrapy and Dependencies
# Install Scrapy with all recommended packagespip install scrapy
# Install additional useful packagespip install scrapy-splash # For JavaScript renderingpip install scrapy-user-agents # For rotating user agentspip install scrapy-rotating-proxies # For proxy rotationpip install itemadapter # For item processingpip install pymongo # For MongoDB storagepip install psycopg2-binary # For PostgreSQL storagepip install redis # For Redis-based deduplication
# Development toolspip install ipython # Better REPLpip install scrapy-shell # Enhanced shellpip install black # Code formattingpip install flake8 # Linting
# Save requirementspip freeze > requirements.txt
Step 3: Create Project Structure
# Create a new Scrapy projectscrapy startproject webscraper
# Navigate to project directorycd webscraper
# Create additional directories for organizationmkdir -p data/raw data/processed data/exportsmkdir -p logsmkdir -p scriptsmkdir -p tests
Your project structure should look like this:
webscraper/├── scrapy.cfg # Deploy configuration├── requirements.txt # Python dependencies├── data/ # Data storage│ ├── raw/ # Raw scraped data│ ├── processed/ # Cleaned data│ └── exports/ # Final exports├── logs/ # Log files├── scripts/ # Utility scripts├── tests/ # Test files└── webscraper/ # Main package ├── __init__.py ├── items.py # Item definitions ├── middlewares.py # Custom middlewares ├── pipelines.py # Data processing pipelines ├── settings.py # Project settings └── spiders/ # Spider modules └── __init__.py
Building Your First Spider
Step 1: Define Data Items
First, let’s define what data we want to extract. We’ll build a spider for scraping e-commerce products:
import scrapyfrom itemloaders.processors import TakeFirst, MapCompose, Joinfrom w3lib.html import remove_tags
def clean_price(value): """Clean price string and convert to float""" if value: # Remove currency symbols and whitespace cleaned = ''.join(char for char in value if char.isdigit() or char == '.') try: return float(cleaned) except ValueError: return None return None
def clean_text(value): """Clean text by removing extra whitespace and HTML tags""" if value: cleaned = remove_tags(value).strip() return ' '.join(cleaned.split()) return None
class ProductItem(scrapy.Item): # Basic product information name = scrapy.Field( input_processor=MapCompose(clean_text), output_processor=TakeFirst() )
price = scrapy.Field( input_processor=MapCompose(clean_price), output_processor=TakeFirst() )
original_price = scrapy.Field( input_processor=MapCompose(clean_price), output_processor=TakeFirst() )
currency = scrapy.Field( output_processor=TakeFirst() )
description = scrapy.Field( input_processor=MapCompose(clean_text), output_processor=Join(' ') )
# Product details brand = scrapy.Field( input_processor=MapCompose(clean_text), output_processor=TakeFirst() )
category = scrapy.Field( input_processor=MapCompose(clean_text), output_processor=TakeFirst() )
sku = scrapy.Field( output_processor=TakeFirst() )
availability = scrapy.Field( output_processor=TakeFirst() )
rating = scrapy.Field( output_processor=TakeFirst() )
review_count = scrapy.Field( output_processor=TakeFirst() )
# Images and media images = scrapy.Field()
# Metadata url = scrapy.Field( output_processor=TakeFirst() )
scraped_at = scrapy.Field( output_processor=TakeFirst() )
# Additional fields for tracking source = scrapy.Field( output_processor=TakeFirst() )
Step 2: Create Your First Spider
Now let’s create a robust spider that demonstrates best practices:
import scrapyfrom scrapy.loader import ItemLoaderfrom webscraper.items import ProductItemfrom datetime import datetimeimport jsonimport re
class EcommerceSpider(scrapy.Spider): name = 'ecommerce' allowed_domains = ['example-store.com']
# Custom settings for this spider custom_settings = { 'DOWNLOAD_DELAY': 1, 'RANDOMIZE_DOWNLOAD_DELAY': True, 'AUTOTHROTTLE_ENABLED': True, 'AUTOTHROTTLE_START_DELAY': 0.5, 'AUTOTHROTTLE_MAX_DELAY': 3, 'AUTOTHROTTLE_TARGET_CONCURRENCY': 2.0, 'FEEDS': { 'data/raw/products_%(time)s.json': { 'format': 'json', 'encoding': 'utf8', 'store_empty': False, 'fields': None, 'indent': 2, }, } }
def start_requests(self): """Generate initial requests""" start_urls = [ 'https://example-store.com/products', 'https://example-store.com/categories/electronics', 'https://example-store.com/categories/clothing', ]
for url in start_urls: yield scrapy.Request( url=url, callback=self.parse, meta={ 'source': 'category_page', 'playwright': True, # Enable JavaScript rendering if needed } )
def parse(self, response): """Parse category pages and extract product links""" self.logger.info(f'Parsing category page: {response.url}')
# Extract product links using CSS selectors product_links = response.css('.product-item a::attr(href)').getall()
if not product_links: # Try alternative selectors product_links = response.css('.product-link::attr(href)').getall()
# Follow product links for link in product_links: product_url = response.urljoin(link) yield scrapy.Request( url=product_url, callback=self.parse_product, meta={ 'source': 'product_page', 'category_url': response.url } )
# Follow pagination next_page = response.css('.pagination .next::attr(href)').get() if next_page: yield scrapy.Request( url=response.urljoin(next_page), callback=self.parse, meta=response.meta )
def parse_product(self, response): """Parse individual product pages""" self.logger.info(f'Parsing product: {response.url}')
# Create item loader for clean data extraction loader = ItemLoader(item=ProductItem(), response=response)
# Basic product information loader.add_css('name', 'h1.product-title::text') loader.add_css('name', '.product-name::text') # Fallback selector
# Price extraction with multiple selectors loader.add_css('price', '.price-current::text') loader.add_css('price', '.current-price::text') loader.add_xpath('price', '//span[@class="price"]//text()')
# Original price (if on sale) loader.add_css('original_price', '.price-original::text') loader.add_css('original_price', '.old-price::text')
# Product description loader.add_css('description', '.product-description p::text') loader.add_xpath('description', '//div[@class="description"]//text()')
# Product details loader.add_css('brand', '.brand-name::text') loader.add_css('category', '.breadcrumb li:last-child::text') loader.add_css('sku', '.product-sku::text')
# Availability availability = response.css('.stock-status::text').get() if availability: loader.add_value('availability', 'in_stock' if 'in stock' in availability.lower() else 'out_of_stock')
# Rating and reviews rating = response.css('.rating-value::text').get() if rating: loader.add_value('rating', float(rating))
review_count_text = response.css('.review-count::text').get() if review_count_text: review_count = re.search(r'(\d+)', review_count_text) if review_count: loader.add_value('review_count', int(review_count.group(1)))
# Images image_urls = response.css('.product-images img::attr(src)').getall() if image_urls: # Convert relative URLs to absolute absolute_urls = [response.urljoin(url) for url in image_urls] loader.add_value('images', absolute_urls)
# Metadata loader.add_value('url', response.url) loader.add_value('scraped_at', datetime.now().isoformat()) loader.add_value('source', response.meta.get('source', 'unknown'))
# Extract structured data if available structured_data = self.extract_structured_data(response) if structured_data: self.update_loader_from_structured_data(loader, structured_data)
yield loader.load_item()
def extract_structured_data(self, response): """Extract JSON-LD structured data""" scripts = response.xpath('//script[@type="application/ld+json"]/text()').getall()
for script in scripts: try: data = json.loads(script) if isinstance(data, dict) and data.get('@type') == 'Product': return data elif isinstance(data, list): for item in data: if isinstance(item, dict) and item.get('@type') == 'Product': return item except json.JSONDecodeError: continue
return None
def update_loader_from_structured_data(self, loader, data): """Update item loader with structured data""" if 'name' in data: loader.add_value('name', data['name'])
if 'offers' in data and isinstance(data['offers'], dict): offer = data['offers'] if 'price' in offer: loader.add_value('price', float(offer['price'])) if 'priceCurrency' in offer: loader.add_value('currency', offer['priceCurrency']) if 'availability' in offer: availability = offer['availability'].split('/')[-1].lower() loader.add_value('availability', availability)
if 'brand' in data: brand = data['brand'] if isinstance(brand, dict) and 'name' in brand: loader.add_value('brand', brand['name']) elif isinstance(brand, str): loader.add_value('brand', brand)
if 'aggregateRating' in data: rating_data = data['aggregateRating'] if 'ratingValue' in rating_data: loader.add_value('rating', float(rating_data['ratingValue'])) if 'reviewCount' in rating_data: loader.add_value('review_count', int(rating_data['reviewCount']))
Step 3: Data Processing Pipeline
Create a pipeline to process and validate extracted data:
from itemadapter import ItemAdapterimport loggingimport jsonfrom datetime import datetime
class ValidationPipeline: """Validate and clean scraped items"""
def process_item(self, item, spider): adapter = ItemAdapter(item)
# Validate required fields required_fields = ['name', 'url'] for field in required_fields: if not adapter.get(field): raise DropItem(f"Missing required field: {field}")
# Clean and validate price price = adapter.get('price') if price is not None: if not isinstance(price, (int, float)) or price < 0: spider.logger.warning(f"Invalid price for {adapter['name']}: {price}") adapter['price'] = None
# Validate rating rating = adapter.get('rating') if rating is not None: if not isinstance(rating, (int, float)) or not (0 <= rating <= 5): spider.logger.warning(f"Invalid rating for {adapter['name']}: {rating}") adapter['rating'] = None
return item
class DuplicationFilterPipeline: """Filter out duplicate items"""
def __init__(self): self.seen_items = set()
def process_item(self, item, spider): adapter = ItemAdapter(item)
# Create a unique identifier for the item identifier = f"{adapter['name']}_{adapter['url']}"
if identifier in self.seen_items: raise DropItem(f"Duplicate item found: {adapter['name']}") else: self.seen_items.add(identifier) return item
class JsonWriterPipeline: """Write items to JSON file"""
def __init__(self): self.file = None self.items = []
def open_spider(self, spider): timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"data/processed/{spider.name}_{timestamp}.json" self.file = open(filename, 'w', encoding='utf-8') spider.logger.info(f"Opened file: {filename}")
def close_spider(self, spider): if self.file: json.dump(self.items, self.file, indent=2, ensure_ascii=False) self.file.close() spider.logger.info(f"Saved {len(self.items)} items")
def process_item(self, item, spider): adapter = ItemAdapter(item) self.items.append(dict(adapter)) return item
class StatisticsPipeline: """Collect scraping statistics"""
def __init__(self): self.stats = { 'items_scraped': 0, 'items_dropped': 0, 'start_time': None, 'end_time': None }
def open_spider(self, spider): self.stats['start_time'] = datetime.now() spider.logger.info("Statistics collection started")
def close_spider(self, spider): self.stats['end_time'] = datetime.now() duration = self.stats['end_time'] - self.stats['start_time']
spider.logger.info("=== SCRAPING STATISTICS ===") spider.logger.info(f"Items scraped: {self.stats['items_scraped']}") spider.logger.info(f"Items dropped: {self.stats['items_dropped']}") spider.logger.info(f"Duration: {duration}") spider.logger.info(f"Items per minute: {self.stats['items_scraped'] / (duration.total_seconds() / 60):.2f}")
def process_item(self, item, spider): self.stats['items_scraped'] += 1 return item
Step 4: Configure Settings
Update your settings for optimal performance:
# Scrapy settings for webscraper projectBOT_NAME = 'webscraper'
SPIDER_MODULES = ['webscraper.spiders']NEWSPIDER_MODULE = 'webscraper.spiders'
# Obey robots.txt rulesROBOTSTXT_OBEY = True
# Configure pipelinesITEM_PIPELINES = { 'webscraper.pipelines.ValidationPipeline': 300, 'webscraper.pipelines.DuplicationFilterPipeline': 400, 'webscraper.pipelines.JsonWriterPipeline': 500, 'webscraper.pipelines.StatisticsPipeline': 600,}
# Configure delays and throttlingDOWNLOAD_DELAY = 1RANDOMIZE_DOWNLOAD_DELAY = True
# AutoThrottle settingsAUTOTHROTTLE_ENABLED = TrueAUTOTHROTTLE_START_DELAY = 0.5AUTOTHROTTLE_MAX_DELAY = 10AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0AUTOTHROTTLE_DEBUG = False
# User agent settingsUSER_AGENT = 'webscraper (+http://www.yourdomain.com)'
# Configure cachingHTTPCACHE_ENABLED = TrueHTTPCACHE_EXPIRATION_SECS = 3600HTTPCACHE_DIR = 'httpcache'
# Logging settingsLOG_LEVEL = 'INFO'LOG_FILE = 'logs/scrapy.log'
# Retry settingsRETRY_ENABLED = TrueRETRY_TIMES = 3RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Concurrent requestsCONCURRENT_REQUESTS = 16CONCURRENT_REQUESTS_PER_DOMAIN = 8
# Memory usage optimizationMEMUSAGE_ENABLED = TrueMEMUSAGE_LIMIT_MB = 2048MEMUSAGE_WARNING_MB = 1024
# Request and response size limitsDOWNLOAD_MAXSIZE = 1073741824 # 1GBDOWNLOAD_WARNSIZE = 33554432 # 32MB
Running Your Spider
Basic Execution
# Run the spiderscrapy crawl ecommerce
# Run with custom settingsscrapy crawl ecommerce -s DOWNLOAD_DELAY=2
# Save output to specific filescrapy crawl ecommerce -o products.json
# Run with custom log levelscrapy crawl ecommerce -L DEBUG
Advanced Execution with Parameters
# Create a script to run spiders with parametersimport subprocessimport sysfrom datetime import datetime
def run_spider(spider_name, **kwargs): """Run spider with custom parameters"""
cmd = ['scrapy', 'crawl', spider_name]
# Add custom settings for key, value in kwargs.items(): cmd.extend(['-s', f'{key}={value}'])
# Add timestamp to output file timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") output_file = f'data/raw/{spider_name}_{timestamp}.json' cmd.extend(['-o', output_file])
print(f"Running command: {' '.join(cmd)}")
try: result = subprocess.run(cmd, check=True, capture_output=True, text=True) print("Spider completed successfully!") print(f"Output saved to: {output_file}") return True except subprocess.CalledProcessError as e: print(f"Spider failed with error: {e}") print(f"Error output: {e.stderr}") return False
if __name__ == "__main__": # Example usage run_spider( 'ecommerce', DOWNLOAD_DELAY=1.5, CONCURRENT_REQUESTS=8, LOG_LEVEL='INFO' )
Testing Your Spider
Unit Tests
import unittestfrom scrapy.http import HtmlResponse, Requestfrom webscraper.spiders.ecommerce_spider import EcommerceSpider
class TestEcommerceSpider(unittest.TestCase):
def setUp(self): self.spider = EcommerceSpider()
def test_parse_product(self): """Test product parsing""" # Sample HTML response html = """ <html> <body> <h1 class="product-title">Test Product</h1> <span class="price-current">$99.99</span> <p class="product-description">This is a test product</p> </body> </html> """
request = Request(url='http://example.com/product/1') response = HtmlResponse( url='http://example.com/product/1', request=request, body=html.encode('utf-8') )
# Process the response items = list(self.spider.parse_product(response))
# Assertions self.assertEqual(len(items), 1) item = items[0] self.assertEqual(item['name'], 'Test Product') self.assertEqual(item['price'], 99.99)
if __name__ == '__main__': unittest.main()
Best Practices and Tips
1. Robust Selector Strategy
def extract_with_fallbacks(response, selectors): """Extract data with multiple fallback selectors""" for selector in selectors: result = response.css(selector).get() if result: return result.strip() return None
# Usage exampleprice = extract_with_fallbacks(response, [ '.price-current::text', '.current-price::text', '.price::text', '[data-price]::attr(data-price)'])
2. Error Handling
def safe_extract_float(value, default=None): """Safely extract float from string""" if not value: return default
try: # Clean the string cleaned = ''.join(char for char in str(value) if char.isdigit() or char in '.-') return float(cleaned) except (ValueError, TypeError): return default
3. Logging and Monitoring
# Add custom logging to your spiderimport logging
class EcommerceSpider(scrapy.Spider):
def __init__(self): self.stats = { 'products_found': 0, 'products_processed': 0, 'errors': 0 }
def parse_product(self, response): try: self.stats['products_found'] += 1 # ... processing logic ... self.stats['products_processed'] += 1
except Exception as e: self.stats['errors'] += 1 self.logger.error(f"Error processing {response.url}: {e}")
def closed(self, reason): self.logger.info(f"Spider closed: {reason}") self.logger.info(f"Statistics: {self.stats}")
Summary and Next Steps
Congratulations! You’ve completed Part 1 of the Web Scraping with Scrapy series. You now have:
✅ Understanding of web scraping ethics and best practices
✅ Professional Scrapy environment setup
✅ Production-ready spider with proper architecture
✅ Data extraction using selectors and structured data
✅ Processing pipelines for data validation and storage
✅ Testing framework for reliable spiders
What’s Next?
In Part 2: Advanced Scraping Techniques, we’ll explore:
- Handling JavaScript-heavy websites with Splash
- Form submission and login handling
- Advanced selector techniques and data extraction
- Handling AJAX requests and dynamic content
- Custom middleware development
Practice Exercise
Before moving to Part 2, try building a spider for your favorite e-commerce site:
- Create a new spider targeting a simple e-commerce site
- Extract product names, prices, and descriptions
- Implement proper error handling and logging
- Add data validation pipelines
- Test your spider with different product categories
Resources
Happy scraping! 🕷️