Async Support
SpiderChef is built from the ground up with asynchronous programming support, allowing for efficient handling of I/O-bound operations like web requests. This guide explains how to use SpiderChef's async capabilities.
Async Execution Model
SpiderChef uses Python's asyncio framework for asynchronous execution. When you call recipe.cook(), it returns a coroutine that must be executed in an async context:
import asyncio
from spiderchef import Recipe
recipe = Recipe.from_yaml('recipe.yaml')
# Run the recipe asynchronously
result = asyncio.run(recipe.cook())
AsyncStep vs SyncStep
SpiderChef provides two base classes for creating steps:
SyncStep: For synchronous operationsAsyncStep: For asynchronous operations that useawait
When to Use AsyncStep
Use AsyncStep when your step needs to:
- Make HTTP requests
- Query databases asynchronously
- Perform any I/O-bound operations
- Use other async functions or libraries
Here's an example of an AsyncStep:
import asyncio
from spiderchef import AsyncStep, Recipe
from typing import Any
class DelayedProcessStep(AsyncStep):
delay_seconds: int = 1
async def _execute(self, recipe: Recipe, previous_output: Any = None) -> Any:
# Simulate some async processing
await asyncio.sleep(self.delay_seconds)
# Process the data
if isinstance(previous_output, list):
return [item.upper() if isinstance(item, str) else item
for item in previous_output]
elif isinstance(previous_output, str):
return previous_output.upper()
return previous_output
When to Use SyncStep
Use SyncStep for CPU-bound operations or operations that don't benefit from asynchronous execution:
- Data transformation
- Regular expression matching
- Mathematical calculations
- Any processing that doesn't involve waiting for external resources
Here's an example of a SyncStep:
from spiderchef import SyncStep, Recipe
from typing import Any
class FilterStep(SyncStep):
min_length: int = 5
def _execute(self, recipe: Recipe, previous_output: Any = None) -> list:
if not isinstance(previous_output, list):
return previous_output
return [item for item in previous_output
if isinstance(item, str) and len(item) >= self.min_length]
Mixing Sync and Async Steps
SpiderChef seamlessly handles the mixing of synchronous and asynchronous steps in a recipe. When an async step follows a sync step (or vice versa), SpiderChef handles the transition automatically.
Parallel Execution
For advanced use cases, you can create steps that execute operations in parallel:
import asyncio
from spiderchef import AsyncStep, Recipe
from typing import Any, List
class ParallelFetchStep(AsyncStep):
urls: List[str]
async def _execute(self, recipe: Recipe, previous_output: Any = None) -> List[str]:
async def fetch_url(url):
# This would use a proper HTTP client in real code
await asyncio.sleep(1) # Simulate network delay
return f"Content from {url}"
# Create tasks for all URLs
tasks = [fetch_url(url) for url in self.urls]
# Execute all tasks in parallel and wait for them to complete
results = await asyncio.gather(*tasks)
return results
Best Practices for Async Steps
-
Use the right base class: Choose
AsyncStepfor I/O-bound operations andSyncStepfor CPU-bound operations. -
Avoid blocking calls: Inside an
AsyncStep, avoid blocking operations that would prevent other tasks from running. -
Handle exceptions properly: Use try/except blocks to properly handle exceptions in asynchronous code.
-
Consider rate limiting: When making multiple requests, consider implementing rate limiting to avoid overwhelming the target server.
-
Use timeouts: Always set timeouts for network operations to prevent infinite waiting.
Example: Asynchronous Web Scraper
Here's a complete example of a recipe that uses async steps to scrape multiple pages in parallel:
name: ParallelScraper
base_url: https://example.com
steps:
- type: fetch
name: fetch_category_page
page_type: text
path: /categories
- type: xpath
name: extract_category_urls
expression: //a[@class='category-link']/@href
- type: parallel_fetch
name: fetch_all_categories
urls: ${previous_output}
- type: foreach
name: extract_products_from_categories
steps:
- type: xpath
name: extract_product_names
expression: //div[@class='product']/h3/text()
- type: save
name: save_category_products
variable: category_products
This recipe:
- Fetches the main category page
- Extracts all category URLs
- Fetches all category pages in parallel
- For each category page, extracts the product names
- Saves the products for each category
The parallel_fetch step would be a custom async step as shown in the ParallelFetchStep example above.