โ† Back to Personal Development

docstrange

Document extraction API by Nanonets

0
Source Code

DocStrange by Nanonets

Document extraction API โ€” convert PDFs, images, and documents to markdown, JSON, or CSV with per-field confidence scoring.

Get your API key: https://docstrange.nanonets.com/app

Quick Start

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=markdown"

Response:

{
  "success": true,
  "record_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "completed",
  "result": {
    "markdown": {
      "content": "# Invoice\n\n**Invoice Number:** INV-2024-001..."
    }
  }
}

Setup

1. Get Your API Key

# Visit the dashboard
https://docstrange.nanonets.com/app

Save your API key:

export DOCSTRANGE_API_KEY="your_api_key_here"

2. OpenClaw Configuration (Optional)

Recommended: Use environment variables (most secure):

{
  skills: {
    entries: {
      "docstrange": {
        enabled: true,
        // API key loaded from environment variable DOCSTRANGE_API_KEY
      },
    },
  },
}

Alternative: Store in config file (use with caution):

{
  skills: {
    entries: {
      "docstrange": {
        enabled: true,
        env: {
          DOCSTRANGE_API_KEY: "your_api_key_here",
        },
      },
    },
  },
}

Security Note: If storing API keys in ~/.openclaw/openclaw.json:

  • Set file permissions: chmod 600 ~/.openclaw/openclaw.json
  • Never commit this file to version control
  • Prefer environment variables or your agent's secret store when possible
  • Rotate keys regularly and limit API key permissions if supported

Common Tasks

Extract to Markdown

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=markdown"

Access content: response["result"]["markdown"]["content"]

Extract JSON Fields

Simple field list:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=json" \
  -F 'json_options=["invoice_number", "date", "total_amount", "vendor"]' \
  -F "include_metadata=confidence_score"

With JSON schema:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=json" \
  -F 'json_options={"type": "object", "properties": {"invoice_number": {"type": "string"}, "total_amount": {"type": "number"}}}'

Response with confidence scores:

{
  "result": {
    "json": {
      "content": {
        "invoice_number": "INV-2024-001",
        "total_amount": 500.00
      },
      "metadata": {
        "confidence_score": {
          "invoice_number": 98,
          "total_amount": 99
        }
      }
    }
  }
}

Extract Tables to CSV

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/sync" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=csv" \
  -F "csv_options=table"

Async Extraction (Large Documents)

For documents >5 pages, use async and poll:

Queue the document:

curl -X POST "https://extraction-api.nanonets.com/api/v1/extract/async" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY" \
  -F "[email protected]" \
  -F "output_format=markdown"

# Returns: {"record_id": "12345", "status": "processing"}

Poll for results:

curl -X GET "https://extraction-api.nanonets.com/api/v1/extract/results/12345" \
  -H "Authorization: Bearer $DOCSTRANGE_API_KEY"

# Returns: {"status": "completed", "result": {...}}

Advanced Features

Bounding Boxes

Get element coordinates for layout analysis:

-F "include_metadata=bounding_boxes"

Hierarchy Output

Extract document structure (sections, tables, key-value pairs):

-F "json_options=hierarchy_output"

Financial Documents Mode

Enhanced table and number formatting:

-F "markdown_options=financial-docs"

Custom Instructions

Guide extraction with prompts:

-F "custom_instructions=Focus on financial data. Ignore headers."
-F "prompt_mode=append"

Multiple Formats

Request multiple formats in one call:

-F "output_format=markdown,json"

When to Use

Use DocStrange For:

  • Invoice and receipt processing
  • Contract text extraction
  • Bank statement parsing
  • Form digitization
  • Image OCR (scanned documents)

Don't Use For:

  • Documents >5 pages with sync (use async)
  • Video/audio transcription
  • Non-document images

Best Practices

Document Size Endpoint Notes
<=5 pages /extract/sync Immediate response
>5 pages /extract/async Poll for results

JSON Extraction:

  • Field list: ["field1", "field2"] โ€” quick extractions
  • JSON schema: {"type": "object", ...} โ€” strict typing, nested data

Confidence Scores:

  • Add include_metadata=confidence_score
  • Scores are 0-100 per field
  • Review fields <80 manually

Schema Templates

Invoice

{
  "type": "object",
  "properties": {
    "invoice_number": {"type": "string"},
    "date": {"type": "string"},
    "vendor": {"type": "string"},
    "total": {"type": "number"},
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {"type": "string"},
          "quantity": {"type": "number"},
          "price": {"type": "number"}
        }
      }
    }
  }
}

Receipt

{
  "type": "object",
  "properties": {
    "merchant": {"type": "string"},
    "date": {"type": "string"},
    "total": {"type": "number"},
    "items": {
      "type": "array",
      "items": {"type": "object", "properties": {"name": {"type": "string"}, "price": {"type": "number"}}}
    }
  }
}

Security & Privacy

Data Handling

Important: Documents uploaded to DocStrange are transmitted to https://extraction-api.nanonets.com and processed on external servers.

Before uploading sensitive documents:

  • Review Nanonets' privacy policy and data retention policies: https://docstrange.nanonets.com/docs
  • Verify encryption in transit (HTTPS) and at rest
  • Confirm data deletion/retention timelines
  • Test with non-sensitive sample documents first

Best practices:

  • Do not upload highly sensitive PII (SSNs, medical records, financial account numbers) until you've confirmed the service's security and compliance posture
  • Use API keys with limited permissions/scopes if available
  • Rotate API keys regularly (every 90 days recommended)
  • Monitor API usage logs for unauthorized access
  • Never log or commit API keys to repositories or examples

File Size Limits

  • Sync endpoint: Recommended for documents โ‰ค5 pages
  • Async endpoint: Use for documents >5 pages to avoid timeouts
  • Large files: Consider using file_url with publicly accessible URLs instead of uploading large files directly

Operational Safeguards

  • Always use environment variables or secure secret stores for API keys
  • Never include real API keys in code examples or documentation
  • Use placeholder values like "your_api_key_here" in examples
  • Set appropriate file permissions on configuration files (600 for JSON configs)
  • Enable API key rotation and monitor usage through the dashboard

Troubleshooting

400 Bad Request:

  • Provide exactly one input: file, file_url, or file_base64
  • Verify API key is valid

Sync Timeout:

  • Use async for documents >5 pages
  • Poll /extract/results/{record_id}

Missing Confidence Scores:

  • Requires json_options (field list or schema)
  • Add include_metadata=confidence_score

Authentication Errors:

  • Verify DOCSTRANGE_API_KEY environment variable is set
  • Check API key hasn't expired or been revoked
  • Ensure no extra whitespace in API key value

Pre-Publish Security Checklist

Before publishing or updating this skill, verify:

  • package.json declares requiredEnv and primaryEnv for DOCSTRANGE_API_KEY
  • package.json lists API endpoints in endpoints array
  • All code examples use placeholder values ("your_api_key_here") not real keys
  • No API keys or secrets are embedded in SKILL.md or package.json
  • Security & Privacy section documents data handling and risks
  • Configuration examples include security warnings for plaintext storage
  • File permission guidance is included for config files

References