Contributing¶

Thank you for your interest in contributing to Scrap-E! This guide will help you get started with contributing to the project.

Quick Start¶

Fork the repository on GitHub
Clone your fork locally
Set up the development environment
Create a feature branch
Make your changes
Run tests and quality checks
Submit a pull request

Development Environment Setup¶

Prerequisites¶

Python 3.13 or higher
Git
UV (recommended) or pip
Docker Desktop (optional, for testing)

Clone and Setup¶

# Clone your fork
git clone https://github.com/YOUR_USERNAME/scrap-e.git
cd scrap-e

# Add upstream remote
git remote add upstream https://github.com/beelzer/scrap-e.git

# Install with development dependencies
uv sync --dev

# Or with pip
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Install Playwright browsers
playwright install

# Verify setup
scrap-e doctor

Development Dependencies¶

The development environment includes:

Core Tools:

pytest for testing
mypy for type checking
ruff for linting and formatting
pre-commit for git hooks

Testing Tools:

pytest-asyncio for async testing
pytest-cov for coverage
pytest-benchmark for performance testing
pytest-mock for mocking

Quality Assurance:

bandit for security scanning
safety for dependency vulnerability checks
vulture for dead code detection

Code Standards¶

Python Style Guide¶

We follow strict code quality standards:

PEP 8 compliance enforced by ruff
Type hints required for all functions and methods
Docstrings required for all public functions and classes
Async/await preferred over callbacks
Modern Python features (Python 3.13+)

Code Formatting¶

Formatting is handled automatically by ruff:

# Format code
ruff format

# Check and fix linting issues
ruff check --fix

# Type checking
mypy src

Import Organization¶

Imports should be organized as follows:

"""Module docstring."""

# Standard library
import asyncio
import json
from pathlib import Path
from typing import Any

# Third-party packages  
import httpx
from pydantic import BaseModel

# Local imports
from scrap_e.core.base_scraper import BaseScraper
from scrap_e.core.models import ScraperResult

Naming Conventions¶

Classes: PascalCase (HttpScraper, ScraperConfig)
Functions/Methods: snake_case (scrape_url, get_data)
Variables: snake_case (user_agent, max_retries)
Constants: UPPER_SNAKE_CASE (DEFAULT_TIMEOUT, MAX_WORKERS)
Private: Leading underscore (_internal_method)

Documentation Standards¶

All public functions and classes must have docstrings:

async def scrape_multiple(
    self,
    sources: list[str],
    max_concurrent: int | None = None,
    **kwargs: Any,
) -> list[ScraperResult[T]]:
    """
    Scrape multiple sources concurrently.

    Args:
        sources: List of sources to scrape
        max_concurrent: Maximum number of concurrent operations
        **kwargs: Additional arguments passed to each scrape operation

    Returns:
        List of ScraperResult objects

    Raises:
        ScraperError: If configuration is invalid
        ConnectionError: If network issues occur

    Example:
        ```python
        scraper = HttpScraper()
        results = await scraper.scrape_multiple([
            "https://site1.com",
            "https://site2.com"
        ])
        ```
    """

Testing¶

Running Tests¶

# Run all tests
pytest

# Run with coverage
pytest --cov=scrap_e

# Run specific test categories
pytest -m "not slow"          # Skip slow tests
pytest -m integration         # Run integration tests only
pytest -m performance         # Run performance tests only

# Run tests in parallel
pytest -n auto

Test Organization¶

Tests are organized by category:

tests/
├── unit/                    # Unit tests
│   ├── core/               # Core functionality tests
│   └── scrapers/           # Scraper-specific tests
├── integration/            # Integration tests
├── performance/            # Performance benchmarks
└── fixtures/               # Test fixtures and data

Writing Tests¶

Unit Tests¶

"""Test HTTP scraper functionality."""

import pytest
from scrap_e.scrapers.web import HttpScraper
from scrap_e.core.models import ExtractionRule


class TestHttpScraper:
    """Test suite for HttpScraper."""

    @pytest.fixture
    async def scraper(self) -> HttpScraper:
        """Create a test scraper instance."""
        return HttpScraper()

    async def test_basic_scraping(self, scraper: HttpScraper) -> None:
        """Test basic scraping functionality."""
        result = await scraper.scrape("https://httpbin.org/json")

        assert result.success
        assert result.data is not None
        assert result.data.status_code == 200
        assert "application/json" in result.data.headers.get("content-type", "")

    async def test_extraction_rules(self, scraper: HttpScraper) -> None:
        """Test data extraction with rules."""
        scraper.add_extraction_rule(ExtractionRule(
            name="title",
            selector="h1",
            required=True
        ))

        result = await scraper.scrape("https://example.com")

        assert result.success
        assert "title" in result.data.extracted_data

Integration Tests¶

"""Integration tests for web scraping."""

import pytest
from scrap_e.scrapers.web import HttpScraper


@pytest.mark.integration
@pytest.mark.network
class TestWebScrapingIntegration:
    """Integration tests requiring network access."""

    async def test_real_website_scraping(self) -> None:
        """Test scraping a real website."""
        scraper = HttpScraper()

        result = await scraper.scrape("https://httpbin.org/headers")

        assert result.success
        assert result.data.status_code == 200

Performance Tests¶

"""Performance tests and benchmarks."""

import pytest
from scrap_e.scrapers.web import HttpScraper


@pytest.mark.benchmark
class TestScrapingPerformance:
    """Performance benchmarks."""

    async def test_concurrent_scraping_performance(self, benchmark) -> None:
        """Benchmark concurrent scraping performance."""
        scraper = HttpScraper()
        urls = [f"https://httpbin.org/delay/{i}" for i in range(5)]

        result = await benchmark(scraper.scrape_multiple, urls)

        assert len(result) == 5
        assert all(r.success for r in result)

Test Fixtures¶

Create reusable fixtures in tests/fixtures/:

"""Common test fixtures."""

import pytest
from pathlib import Path


@pytest.fixture
def sample_html() -> str:
    """Sample HTML content for testing."""
    return """
    <html>
        <head><title>Test Page</title></head>
        <body>
            <h1>Main Title</h1>
            <p class="content">Test content</p>
        </body>
    </html>
    """


@pytest.fixture
def temp_config_file(tmp_path: Path) -> Path:
    """Create a temporary configuration file."""
    config_file = tmp_path / "config.yaml"
    config_file.write_text("""
    debug: true
    log_level: DEBUG
    default_timeout: 10.0
    """)
    return config_file

Pull Request Process¶

Before Submitting¶

Create a feature branch:

git checkout -b feature/your-feature-name

Make your changes following code standards
Run the full test suite:

pytest

Run quality checks:

pre-commit run --all-files

Update documentation if needed
Add/update tests for your changes

Submitting the PR¶

Push your branch:

git push origin feature/your-feature-name

Create a pull request with:
Clear title describing the change
Detailed description explaining what and why
Link to related issues
Screenshots for UI changes
Breaking changes clearly marked

PR Template¶

## Description

Brief description of the changes.

## Type of Change

- [ ] Bug fix (non-breaking change)
- [ ] New feature (non-breaking change)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update

## Testing

- [ ] Unit tests pass
- [ ] Integration tests pass
- [ ] Added tests for new functionality
- [ ] Manual testing performed

## Checklist

- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings introduced
- [ ] Performance impact considered

## Related Issues

Closes #(issue number)

Review Process¶

Automated checks must pass (CI/CD)
Code review by maintainers
Address feedback and update PR
Final approval and merge

Types of Contributions¶

Bug Fixes¶

Check existing issues first
Create an issue if one doesn't exist
Include test cases that reproduce the bug
Fix the issue with minimal changes
Verify the fix with tests

New Features¶

Discuss the feature in an issue first
Consider the API design carefully
Implement with tests
Update documentation
Consider backward compatibility

Documentation¶

Use clear, concise language
Include code examples
Test code examples
Follow existing structure
Update navigation if needed

Performance Improvements¶

Benchmark current performance
Implement improvements
Measure impact
Include benchmark tests
Document performance gains

Code Architecture¶

Adding New Scrapers¶

When adding new scraper types:

Inherit from BaseScraper
Implement required methods
Add configuration options
Create comprehensive tests
Update documentation

Example structure:

"""Custom scraper implementation."""

from typing import Any, AsyncIterator
from scrap_e.core.base_scraper import BaseScraper
from scrap_e.core.config import ScraperConfig
from scrap_e.core.models import ScraperType


class CustomScraper(BaseScraper[CustomData, CustomConfig]):
    """Custom scraper implementation."""

    def _get_default_config(self) -> CustomConfig:
        """Get default configuration."""
        return CustomConfig()

    @property
    def scraper_type(self) -> ScraperType:
        """Return scraper type."""
        return ScraperType.CUSTOM

    async def _initialize(self) -> None:
        """Initialize resources."""
        pass

    async def _cleanup(self) -> None:
        """Clean up resources."""
        pass

    async def _scrape(self, source: str, **kwargs: Any) -> CustomData:
        """Perform scraping."""
        # Implementation here
        pass

    async def _stream_scrape(self, source: str, chunk_size: int, **kwargs: Any) -> AsyncIterator[CustomData]:
        """Stream scraping implementation."""
        # Implementation here
        pass

    async def _validate_source(self, source: str, **kwargs: Any) -> None:
        """Validate source accessibility."""
        # Implementation here
        pass

Issue Reporting¶

Bug Reports¶

Include:

Environment details (OS, Python version, package versions)
Steps to reproduce
Expected vs actual behavior
Error messages and stack traces
Minimal code example
scrap-e doctor output

Feature Requests¶

Include:

Use case description
Proposed API design
Alternative solutions considered
Implementation willingness

Security Issues¶

Report security vulnerabilities privately:

Email: security@scrap-e.dev (if available)
GitHub Security: Use GitHub's security advisory feature
Do not open public issues for security vulnerabilities

Community Guidelines¶

Code of Conduct¶

Be respectful to all contributors
Welcome newcomers and help them learn
Focus on technical merit
Assume good intentions
Resolve conflicts professionally

Communication¶

Use GitHub issues for bugs and features
Use GitHub discussions for questions and ideas
Be clear and concise in communications
Provide context for your requests
Follow up on conversations

Release Process¶

Versioning¶

We follow Semantic Versioning:

Major: Breaking changes
Minor: New features, backward compatible
Patch: Bug fixes, backward compatible

Release Checklist¶

Update version numbers
Update CHANGELOG.md
Run full test suite
Build documentation
Create release tag
Publish to PyPI
Update GitHub release

Getting Help¶

Resources¶

Documentation: Latest docs at /docs
Examples: Check /examples directory
Tests: See /tests for usage patterns
GitHub Issues: Search existing issues first
GitHub Discussions: For general questions

Mentorship¶

New contributors can:

Find good first issues labeled good-first-issue
Ask for guidance in issues or discussions
Pair program with experienced contributors
Join community calls (if available)

Recognition¶

Contributors are recognized through:

Git commit attribution
CONTRIBUTORS.md file
Release notes mentions
GitHub contributor stats

Thank you for contributing to Scrap-E! Your contributions help make web scraping accessible to everyone.