docs-scraper
CLI tool that scrapes documents from various sources into local PDF files using browser automation.
Installation
npm install -g docs-scraper
Quick start
Scrape any document URL to PDF:
docs-scraper scrape https://example.com/document
Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf
Basic scraping
Scrape with daemon (recommended, keeps browser warm):
docs-scraper scrape <url>
Scrape with named profile (for authenticated sites):
docs-scraper scrape <url> -p <profile-name>
Scrape with pre-filled data (e.g., email for DocSend):
docs-scraper scrape <url> -D [email protected]
Direct mode (single-shot, no daemon):
docs-scraper scrape <url> --no-daemon
Authentication workflow
When a document requires authentication (login, email verification, passcode):
Initial scrape returns a job ID:
docs-scraper scrape https://docsend.com/view/xxx # Output: Scrape blocked # Job ID: abc123Retry with data:
docs-scraper update abc123 -D [email protected] # or with password docs-scraper update abc123 -D [email protected] -D password=1234
Profile management
Profiles store session cookies for authenticated sites.
docs-scraper profiles list # List saved profiles
docs-scraper profiles clear # Clear all profiles
docs-scraper scrape <url> -p myprofile # Use a profile
Daemon management
The daemon keeps browser instances warm for faster scraping.
docs-scraper daemon status # Check status
docs-scraper daemon start # Start manually
docs-scraper daemon stop # Stop daemon
Note: Daemon auto-starts when running scrape commands.
Cleanup
PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.
Manual cleanup:
docs-scraper cleanup # Delete all PDFs
docs-scraper cleanup --older-than 1h # Delete PDFs older than 1 hour
Job management
docs-scraper jobs list # List blocked jobs awaiting auth
Supported sources
- Direct PDF links - Downloads PDF directly
- Notion pages - Exports Notion page to PDF
- DocSend documents - Handles DocSend viewer
- LLM fallback - Uses Claude API for any other webpage
Scraper Reference
Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.
DirectPdfScraper
Handles: URLs ending in .pdf
Data fields: None (downloads directly)
Example:
docs-scraper scrape https://example.com/document.pdf
DocsendScraper
Handles: docsend.com/view/*, docsend.com/v/*, and subdomains (e.g., org-a.docsend.com)
URL patterns:
- Documents:
https://docsend.com/view/{id}orhttps://docsend.com/v/{id} - Folders:
https://docsend.com/view/s/{id} - Subdomains:
https://{subdomain}.docsend.com/view/{id}
Data fields:
| Field | Type | Description |
|---|---|---|
email |
Email address for document access | |
password |
password | Passcode/password for protected documents |
name |
text | Your name (required for NDA-gated documents) |
Examples:
# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected]
# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D password=secret123
# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D name="John Doe"
# Retry blocked job
docs-scraper update abc123 -D [email protected] -D password=secret123
Notes:
- DocSend may require any combination of email, password, and name
- Folders are scraped as a table of contents PDF with document links
- The scraper auto-checks NDA checkboxes when name is provided
NotionScraper
Handles: notion.so/*, *.notion.site/*
Data fields:
| Field | Type | Description |
|---|---|---|
email |
Notion account email | |
password |
password | Notion account password |
Examples:
# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123
# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
-D [email protected] -D password=mypassword
# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123
Notes:
- Public Notion pages don't require authentication
- Toggle blocks are automatically expanded before PDF generation
- Uses session profiles to persist login across scrapes
LlmFallbackScraper
Handles: Any URL not matched by other scrapers (automatic fallback)
Data fields: Dynamic - determined by Claude analyzing the page
The LLM scraper uses Claude to analyze the page HTML and detect:
- Login forms (extracts field names dynamically)
- Cookie banners (auto-dismisses)
- Expandable content (auto-expands)
- CAPTCHAs (reports as blocked)
- Paywalls (reports as blocked)
Common dynamic fields:
| Field | Type | Description |
|---|---|---|
email |
Login email (if detected) | |
password |
password | Login password (if detected) |
username |
text | Username (if login uses username) |
Examples:
# Generic webpage (no auth)
docs-scraper scrape https://example.com/article
# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
-D [email protected] -D password=secret
# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret
Notes:
- Requires
ANTHROPIC_API_KEYenvironment variable - Field names are extracted from the page's actual form fields
- Limited to 2 login attempts before failing
- CAPTCHAs require manual intervention
Data field summary
| Scraper | password | name | Other | |
|---|---|---|---|---|
| DirectPdf | - | - | - | - |
| DocSend | โ | โ | โ | - |
| Notion | โ | โ | - | - |
| LLM Fallback | โ* | โ* | - | Dynamic* |
*Fields detected dynamically from page analysis
Environment setup (optional)
Only needed for LLM fallback scraper:
export ANTHROPIC_API_KEY=your_key
Optional browser settings:
export BROWSER_HEADLESS=true # Set false for debugging
Common patterns
Archive a Notion page:
docs-scraper scrape https://notion.so/My-Page-abc123
Download protected DocSend:
docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D [email protected] -D password=1234
Batch scraping with profiles:
docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite
Output
Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf)
Blocked: Job ID + required credential types
Troubleshooting
- Timeout:
docs-scraper daemon stop && docs-scraper daemon start - Auth fails:
docs-scraper jobs listto check pending jobs - Disk full:
docs-scraper cleanupto remove old PDFs