โ† Back to PDF & Documents
PDF & Documents by @chrisling-dev

links-to-pdfs

Scrape documents from Notion, DocSend, PDFs, and other

0
Source Code

docs-scraper

CLI tool that scrapes documents from various sources into local PDF files using browser automation.

Installation

npm install -g docs-scraper

Quick start

Scrape any document URL to PDF:

docs-scraper scrape https://example.com/document

Returns local path: ~/.docs-scraper/output/1706123456-abc123.pdf

Basic scraping

Scrape with daemon (recommended, keeps browser warm):

docs-scraper scrape <url>

Scrape with named profile (for authenticated sites):

docs-scraper scrape <url> -p <profile-name>

Scrape with pre-filled data (e.g., email for DocSend):

docs-scraper scrape <url> -D [email protected]

Direct mode (single-shot, no daemon):

docs-scraper scrape <url> --no-daemon

Authentication workflow

When a document requires authentication (login, email verification, passcode):

  1. Initial scrape returns a job ID:

    docs-scraper scrape https://docsend.com/view/xxx
    # Output: Scrape blocked
    #         Job ID: abc123
    
  2. Retry with data:

    docs-scraper update abc123 -D [email protected]
    # or with password
    docs-scraper update abc123 -D [email protected] -D password=1234
    

Profile management

Profiles store session cookies for authenticated sites.

docs-scraper profiles list     # List saved profiles
docs-scraper profiles clear    # Clear all profiles
docs-scraper scrape <url> -p myprofile  # Use a profile

Daemon management

The daemon keeps browser instances warm for faster scraping.

docs-scraper daemon status     # Check status
docs-scraper daemon start      # Start manually
docs-scraper daemon stop       # Stop daemon

Note: Daemon auto-starts when running scrape commands.

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon automatically cleans up files older than 1 hour.

Manual cleanup:

docs-scraper cleanup                    # Delete all PDFs
docs-scraper cleanup --older-than 1h    # Delete PDFs older than 1 hour

Job management

docs-scraper jobs list         # List blocked jobs awaiting auth

Supported sources

  • Direct PDF links - Downloads PDF directly
  • Notion pages - Exports Notion page to PDF
  • DocSend documents - Handles DocSend viewer
  • LLM fallback - Uses Claude API for any other webpage

Scraper Reference

Each scraper accepts specific -D data fields. Use the appropriate fields based on the URL type.

DirectPdfScraper

Handles: URLs ending in .pdf

Data fields: None (downloads directly)

Example:

docs-scraper scrape https://example.com/document.pdf

DocsendScraper

Handles: docsend.com/view/*, docsend.com/v/*, and subdomains (e.g., org-a.docsend.com)

URL patterns:

  • Documents: https://docsend.com/view/{id} or https://docsend.com/v/{id}
  • Folders: https://docsend.com/view/s/{id}
  • Subdomains: https://{subdomain}.docsend.com/view/{id}

Data fields:

Field Type Description
email email Email address for document access
password password Passcode/password for protected documents
name text Your name (required for NDA-gated documents)

Examples:

# Pre-fill email for DocSend
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected]

# With password protection
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D password=secret123

# With NDA name requirement
docs-scraper scrape https://docsend.com/view/abc123 -D [email protected] -D name="John Doe"

# Retry blocked job
docs-scraper update abc123 -D [email protected] -D password=secret123

Notes:

  • DocSend may require any combination of email, password, and name
  • Folders are scraped as a table of contents PDF with document links
  • The scraper auto-checks NDA checkboxes when name is provided

NotionScraper

Handles: notion.so/*, *.notion.site/*

Data fields:

Field Type Description
email email Notion account email
password password Notion account password

Examples:

# Public page (no auth needed)
docs-scraper scrape https://notion.so/Public-Page-abc123

# Private page with login
docs-scraper scrape https://notion.so/Private-Page-abc123 \
  -D [email protected] -D password=mypassword

# Custom domain
docs-scraper scrape https://docs.company.notion.site/Page-abc123

Notes:

  • Public Notion pages don't require authentication
  • Toggle blocks are automatically expanded before PDF generation
  • Uses session profiles to persist login across scrapes

LlmFallbackScraper

Handles: Any URL not matched by other scrapers (automatic fallback)

Data fields: Dynamic - determined by Claude analyzing the page

The LLM scraper uses Claude to analyze the page HTML and detect:

  • Login forms (extracts field names dynamically)
  • Cookie banners (auto-dismisses)
  • Expandable content (auto-expands)
  • CAPTCHAs (reports as blocked)
  • Paywalls (reports as blocked)

Common dynamic fields:

Field Type Description
email email Login email (if detected)
password password Login password (if detected)
username text Username (if login uses username)

Examples:

# Generic webpage (no auth)
docs-scraper scrape https://example.com/article

# Webpage requiring login
docs-scraper scrape https://members.example.com/article \
  -D [email protected] -D password=secret

# When blocked, check the job for required fields
docs-scraper jobs list
# Then retry with the fields the scraper detected
docs-scraper update abc123 -D username=myuser -D password=secret

Notes:

  • Requires ANTHROPIC_API_KEY environment variable
  • Field names are extracted from the page's actual form fields
  • Limited to 2 login attempts before failing
  • CAPTCHAs require manual intervention

Data field summary

Scraper email password name Other
DirectPdf - - - -
DocSend โœ“ โœ“ โœ“ -
Notion โœ“ โœ“ - -
LLM Fallback โœ“* โœ“* - Dynamic*

*Fields detected dynamically from page analysis

Environment setup (optional)

Only needed for LLM fallback scraper:

export ANTHROPIC_API_KEY=your_key

Optional browser settings:

export BROWSER_HEADLESS=true   # Set false for debugging

Common patterns

Archive a Notion page:

docs-scraper scrape https://notion.so/My-Page-abc123

Download protected DocSend:

docs-scraper scrape https://docsend.com/view/xxx
# If blocked:
docs-scraper update <job-id> -D [email protected] -D password=1234

Batch scraping with profiles:

docs-scraper scrape https://site.com/doc1 -p mysite
docs-scraper scrape https://site.com/doc2 -p mysite

Output

Success: Local file path (e.g., ~/.docs-scraper/output/1706123456-abc123.pdf) Blocked: Job ID + required credential types

Troubleshooting

  • Timeout: docs-scraper daemon stop && docs-scraper daemon start
  • Auth fails: docs-scraper jobs list to check pending jobs
  • Disk full: docs-scraper cleanup to remove old PDFs