Installation¶
System Requirements¶
- Python: 3.13 or higher (required for modern async features)
- Operating System: Windows 11, macOS, or Linux
- Memory: Minimum 2GB RAM (4GB+ recommended for browser automation)
- Disk Space: 500MB for basic installation, 2GB+ for full development setup
Installation Methods¶
From PyPI (Production Use)¶
When Scrap-E is published to PyPI:
From Source (Current Development)¶
For the latest features and development:
# Clone the repository
git clone https://github.com/beelzer/scrap-e.git
cd scrap-e
# Install with uv (recommended)
uv sync --dev
# Or install with pip
pip install -e ".[dev]"
Post-Installation Setup¶
After installing Scrap-E, complete the setup with these steps:
1. Install Pre-commit Hooks (Development)¶
2. Install Playwright Browsers (Required for Browser Scraping)¶
This installs Chromium, Firefox, and WebKit browsers for automated scraping.
3. Verify Installation¶
This command checks your system configuration and reports any issues.
Optional Dependencies¶
Browser Automation¶
Scrap-E supports multiple browser engines via Playwright:
# Install specific browsers
playwright install chromium
playwright install firefox
playwright install webkit
Database Support¶
For database scraping, install additional drivers:
# PostgreSQL
pip install asyncpg psycopg2-binary
# MySQL
pip install aiomysql pymysql
# MongoDB
pip install motor pymongo
# Redis
pip install redis aioredis
File Processing¶
For advanced file processing capabilities:
# PDF processing
pip install pypdf pymupdf
# Excel files
pip install openpyxl xlrd
# Image processing with OCR
pip install pillow pytesseract
Development Dependencies¶
The full development environment includes:
This includes:
Testing Tools:
- pytest, pytest-asyncio, pytest-cov
- pytest-benchmark for performance testing
- pytest-mock for mocking
Code Quality:
- ruff for linting and formatting
- mypy for type checking
- pre-commit for git hooks
- bandit for security scanning
Documentation:
- mkdocs-material for documentation
- mkdocstrings for API docs
Environment Variables¶
Configure Scrap-E using environment variables:
# Core settings
export SCRAPER_DEBUG=true
export SCRAPER_LOG_LEVEL=INFO
export SCRAPER_OUTPUT_DIR=/path/to/output
# Browser settings
export SCRAPER_BROWSER_TYPE=chromium
export SCRAPER_BROWSER_HEADLESS=true
# Database connections
export SCRAPER_POSTGRES_URL=postgresql://user:pass@host:5432/db
export SCRAPER_MONGODB_URL=mongodb://localhost:27017/db
export SCRAPER_REDIS_URL=redis://localhost:6379/0
Docker Installation¶
Run Scrap-E in a containerized environment:
The Docker setup includes all dependencies and browsers pre-installed.
Verification¶
Basic Verification¶
import scrap_e
print(f"Scrap-E version: {scrap_e.__version__}")
# Test HTTP scraping
import asyncio
from scrap_e.scrapers.web import HttpScraper
async def test():
scraper = HttpScraper()
result = await scraper.scrape("https://httpbin.org/json")
print("HTTP scraping:", "✓" if result.success else "✗")
asyncio.run(test())
CLI Verification¶
# Check version
scrap-e --version
# System diagnostic
scrap-e doctor
# Test scraping
scrap-e scrape https://httpbin.org/json
Browser Scraping Verification¶
Troubleshooting¶
Common Issues¶
Playwright not found:
Permission errors on Windows:
Browser automation fails:
# Install system dependencies (Linux)
sudo apt-get install libnss3 libatk-bridge2.0-0 libxss1 libasound2
# Or use our Docker image
docker run --rm scrap-e scrap-e doctor
Getting Help¶
If you encounter issues:
- Run
scrap-e doctor
for diagnostic information - Check the troubleshooting guide
- Search GitHub issues
- Create a new issue with diagnostic output
Next Steps¶
- Read the Quick Start Guide to create your first scraper
- Learn about Configuration options
- Explore the User Guide for detailed examples
- Check out the CLI documentation for command-line usage