CLI Commands API¶
The Scrap-E command-line interface provides powerful tools for scraping operations. This reference documents all available commands and their options.
Main Command¶
scrap-e¶
The root command with global options available to all subcommands.
Global Options¶
Option | Description | Default |
---|---|---|
--version |
Show version and exit | - |
--config PATH |
Path to configuration file | None |
--debug/--no-debug |
Enable debug mode | False |
--help |
Show help message | - |
Configuration¶
Global configuration affects all commands:
# Using configuration file
scrap-e --config scraper.yaml scrape https://example.com
# Debug mode for detailed logging
scrap-e --debug scrape https://example.com
Commands¶
scrape¶
Scrape data from a single URL.
Arguments¶
URL
- The URL to scrape (required)
Options¶
Option | Short | Type | Description | Default |
---|---|---|---|---|
--method |
-m |
Choice | Scraping method: http , browser |
http |
--output |
-o |
Path | Output file path | stdout |
--format |
-f |
Choice | Output format: json , csv , html |
json |
--selector |
-s |
Text | CSS selector (repeatable) | None |
--xpath |
-x |
Text | XPath expression (repeatable) | None |
--wait-for |
Text | Wait for selector (browser only) | None | |
--screenshot |
Flag | Capture screenshot (browser only) | False | |
--headless/--no-headless |
Flag | Browser headless mode | True | |
--user-agent |
Text | Custom user agent string | Default UA | |
--timeout |
Integer | Request timeout in seconds | 30 |
Examples¶
Basic scraping:
Extract specific data:
scrap-e scrape https://news.example.com \
--selector "h2.headline" \
--selector ".article-summary" \
--output articles.json
Browser scraping with JavaScript:
scrap-e scrape https://spa.example.com \
--method browser \
--wait-for ".dynamic-content" \
--screenshot \
--output spa_data.json
XPath extraction:
scrap-e scrape https://example.com \
--xpath "//h1/text()" \
--xpath "//p[@class='content']/text()" \
--format csv
Return Codes¶
0
- Success1
- Scraping failed2
- Invalid arguments3
- Network error4
- Parsing error
batch¶
Scrape multiple URLs concurrently.
Arguments¶
URL
- One or more URLs to scrape (required, multiple allowed)
Options¶
Option | Short | Type | Description | Default |
---|---|---|---|---|
--method |
-m |
Choice | Scraping method: http , browser |
http |
--concurrent |
-c |
Integer | Number of concurrent requests | 5 |
--output-dir |
-o |
Path | Output directory for results | None |
Examples¶
Basic batch scraping:
High concurrency with output directory:
scrap-e batch \
https://site1.com \
https://site2.com \
https://site3.com \
--concurrent 10 \
--output-dir scraped_results/
Browser batch scraping:
scrap-e batch \
https://app1.example.com \
https://app2.example.com \
--method browser \
--concurrent 2
Output Structure¶
When using --output-dir
, files are saved as:
result_0.json
- First URL resultresult_1.json
- Second URL result- etc.
Performance Considerations¶
- Default concurrency is 5 to be respectful to target servers
- Increase
--concurrent
for better performance if the target supports it - Browser method is slower but handles JavaScript-heavy sites
- Monitor network usage and adjust concurrency accordingly
sitemap¶
Extract URLs from XML sitemaps and optionally scrape them.
Arguments¶
SITEMAP_URL
- URL of the XML sitemap (required)
Options¶
Option | Short | Type | Description |
---|---|---|---|
--output |
-o |
Path | Output file for extracted URLs |
--scrape |
Flag | Scrape all URLs from sitemap |
Examples¶
Extract URLs from sitemap:
Save URLs to file:
Extract and scrape all URLs:
Sitemap Support¶
Supports standard XML sitemap formats:
- Regular sitemaps (
<urlset>
) - Sitemap index files (
<sitemapindex>
) - Nested sitemaps (automatically resolved)
Output Format¶
URL extraction only:
With scraping (--scrape
flag):
- Console output shows scraping progress
- Results are not saved unless combined with
--output-dir
from batch command
doctor¶
System diagnostics and health check.
No Options¶
This command takes no additional options.
Checks Performed¶
- Python Version - Verifies Python 3.13+ compatibility
- Package Dependencies - Checks installation of required packages:
- httpx (HTTP client)
- beautifulsoup4 (HTML parsing)
- playwright (Browser automation)
- pandas (Data processing)
- pydantic (Data validation)
- Browser Drivers - Tests browser availability:
- Chromium
- Firefox
- WebKit
- System Dependencies - Verifies system-level requirements
Example Output¶
Scrap-E System Check
Component Status Result
Python Version 3.13.0 ✓ OK
Package: httpx Installed ✓ OK
Package: beautifulsoup4 Installed ✓ OK
Package: playwright Installed ✓ OK
Package: pandas Installed ✓ OK
Package: pydantic Installed ✓ OK
Browser: chromium Available ✓ OK
Browser: firefox Available ✓ OK
Browser: webkit Not available ✗ FAIL
⚠ WARNING Some checks failed. Run 'playwright install webkit' to install missing browsers.
Return Codes¶
0
- All checks passed1
- Some checks failed (warnings)2
- Critical failures (cannot operate)
Troubleshooting¶
Based on doctor output:
# Install missing browsers
playwright install
# Install missing packages
pip install beautifulsoup4 pandas
# Fix Python version
pyenv install 3.13.0
pyenv local 3.13.0
serve¶
Start the Scrap-E API server (planned feature).
Options¶
Option | Type | Description | Default |
---|---|---|---|
--host |
Text | API server host | 127.0.0.1 |
--port |
Integer | API server port | 8000 |
Examples¶
Start server on default host/port:
Custom host and port:
Current Status¶
This command is a placeholder for future API server functionality. Currently displays:
Configuration Files¶
YAML Configuration¶
# scraper.yaml
debug: true
user_agent: "Custom Scraper Bot 1.0"
default_timeout: 60
concurrent_requests: 10
browser:
headless: true
type: chromium
viewport:
width: 1920
height: 1080
output:
format: json
pretty_print: true
Use with any command:
Environment Variables¶
Set configuration via environment variables:
export SCRAPER_USER_AGENT="Custom Bot 1.0"
export SCRAPER_DEFAULT_TIMEOUT=60
export SCRAPER_BROWSER_HEADLESS=true
export SCRAPER_DEBUG=true
Variables are automatically loaded by all commands.
Output Formats¶
JSON Format¶
Default format with complete metadata:
{
"success": true,
"data": {
"url": "https://example.com",
"status_code": 200,
"headers": {...},
"content": "<!DOCTYPE html>...",
"extracted_data": {
"title": "Page Title",
"links": ["url1", "url2"]
},
"metadata": {...}
},
"metadata": {
"scraper_type": "web_http",
"source": "https://example.com",
"timestamp": "2024-01-15T10:30:00Z",
"duration_seconds": 1.23,
"records_scraped": 1,
"errors_count": 0
}
}
CSV Format¶
Tabular format for extracted data:
HTML Format¶
Raw HTML content:
<!DOCTYPE html>
<html>
<head><title>Example Page</title></head>
<body>
<h1>Welcome</h1>
<p>Content here...</p>
</body>
</html>
Error Handling¶
Common Error Messages¶
Invalid URL:
Network timeout:
Missing selector:
Browser not available:
Debug Mode¶
Enable debug mode for detailed error information:
Debug output includes:
- HTTP request/response headers
- Full exception tracebacks
- Parser backend selection
- Performance timing information
Integration Examples¶
Shell Scripting¶
#!/bin/bash
# bulk_scraper.sh
URLS=(
"https://example.com/page1"
"https://example.com/page2"
"https://example.com/page3"
)
echo "Starting bulk scraping..."
for url in "${URLS[@]}"; do
filename="output_$(basename "$url" | tr '/' '_').json"
echo "Scraping $url -> $filename"
if scrap-e scrape "$url" --output "$filename"; then
echo "✓ Success: $url"
else
echo "✗ Failed: $url"
fi
done
echo "Bulk scraping complete!"
Cron Jobs¶
# Crontab entry for daily scraping
0 2 * * * /usr/local/bin/scrap-e sitemap https://news.example.com/sitemap.xml --scrape > /var/log/scraper.log 2>&1
# Weekly full site scraping
0 1 * * 0 /usr/local/bin/scrap-e batch $(cat /home/user/urls.txt) --output-dir /data/scraped/$(date +\%Y-\%m-\%d)/
CI/CD Pipeline¶
# .github/workflows/scraping.yml
name: Daily Data Collection
on:
schedule:
- cron: '0 6 * * *' # Daily at 6 AM
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.13'
- name: Install Scrap-E
run: |
pip install scrap-e
playwright install chromium
- name: Verify Installation
run: scrap-e doctor
- name: Scrape Data
run: |
scrap-e batch \
https://api.example.com/data \
https://feeds.example.com/rss \
--output-dir data/$(date +%Y-%m-%d) \
--concurrent 5
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: scraped-data
path: data/
Performance Guidelines¶
Concurrency Settings¶
Scenario | Recommended Concurrency | Notes |
---|---|---|
Small API | 1-2 | Respect rate limits |
Public website | 3-5 | Be respectful |
Internal service | 10-20 | Based on capacity |
Bulk processing | 20-50 | Monitor resources |
Memory Usage¶
- HTTP method: ~5MB per concurrent request
- Browser method: ~50-100MB per concurrent request
- Factor in target response sizes
- Use
--output-dir
for large batches to avoid memory accumulation
Network Considerations¶
- Default timeout (30s) works for most sites
- Increase timeout for slow APIs:
--timeout 120
- Use
--user-agent
to identify your scraper - Implement delays between requests if needed (future feature)
For more detailed usage examples, see the CLI Usage Guide.