Quick Start¶

This guide will help you create your first scraper with Scrap-E.

Basic Web Scraping¶

HTTP Scraper Example¶

The simplest way to scrape a website is using the HttpScraper:

import asyncio
from scrap_e.scrapers.web.http_scraper import HttpScraper

async def scrape_website():
    # Create scraper instance
    scraper = HttpScraper()

    # Scrape a webpage
    result = await scraper.scrape("https://example.com")

    if result.success:
        print(f"Status: {result.data.status_code}")
        print(f"Content length: {len(result.data.content)}")
        print(f"Title: {result.data.extracted_data.get('title') if result.data.extracted_data else 'No title'}")

    # Clean up resources
    await scraper._cleanup()

# Run the scraper
asyncio.run(scrape_website())

Extracting Specific Data¶

Use extraction rules to target specific elements:

from scrap_e.core.models import ExtractionRule
from scrap_e.scrapers.web.http_scraper import HttpScraper

async def extract_data():
    scraper = HttpScraper()

    # Add extraction rules
    scraper.extraction_rules = [
        ExtractionRule(
            name="headline",
            selector="h1.main-title",
            required=True
        ),
        ExtractionRule(
            name="paragraphs",
            selector="p.content",
            multiple=True  # Extract all matching elements
        )
    ]

    result = await scraper.scrape("https://example.com")

    if result.success and result.data.extracted_data:
        data = result.data.extracted_data
        print(f"Headline: {data.get('headline')}")
        print(f"Found {len(data.get('paragraphs', []))} paragraphs")

    await scraper._cleanup()

asyncio.run(extract_data())

Browser-Based Scraping¶

For JavaScript-heavy sites, use the BrowserScraper:

from scrap_e.scrapers.web.browser_scraper import BrowserScraper

async def scrape_with_browser():
    scraper = BrowserScraper()

    # Wait for specific element to load
    result = await scraper.scrape(
        "https://example.com",
        wait_for_selector="div.dynamic-content",
        capture_screenshot=True  # Take a screenshot
    )

    if result.success:
        print(f"Page title: {result.data.title}")
        if result.data.screenshot:
            print(f"Screenshot captured: {len(result.data.screenshot)} bytes")

    await scraper._cleanup()

asyncio.run(scrape_with_browser())

Handling Multiple Pages¶

Scrape multiple pages concurrently:

async def scrape_multiple():
    scraper = HttpScraper()

    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3"
    ]

    results = await scraper.scrape_multiple(urls)

    for result in results:
        if result.success:
            print(f"Scraped: {result.data.url}")
        else:
            print(f"Failed: {result.error}")

    await scraper._cleanup()

asyncio.run(scrape_multiple())

Error Handling¶

Proper error handling ensures robust scraping:

from scrap_e.core.exceptions import ScraperError, ConnectionError
from scrap_e.scrapers.web.http_scraper import HttpScraper

async def safe_scraping():
    scraper = HttpScraper()

    try:
        result = await scraper.scrape("https://example.com")

        if result.success:
            process_data(result.data)
        else:
            print(f"Scraping failed: {result.error}")

    except ConnectionError as e:
        print(f"Connection error: {e}")
    except ScraperError as e:
        print(f"Scraper error: {e}")
    finally:
        await scraper._cleanup()

def process_data(data):
    # Process your scraped data here
    print(f"Processing data from {data.url}")

asyncio.run(safe_scraping())

Next Steps¶

Learn about Configuration options
Explore Web Scraping in depth
Check out API Scraping
See the API Reference for detailed documentation