Source Code
Skill: deep-scraper
Overview
A high-performance engineering tool for deep web scraping. It uses a containerized Docker + Crawlee (Playwright) environment to penetrate protections on complex websites like YouTube and X/Twitter, providing "interception-level" raw data.
Requirements
- Docker: Must be installed and running on the host machine.
- Image: Build the environment with the tag
clawd-crawlee.- Build command:
docker build -t clawd-crawlee skills/deep-scraper/
- Build command:
Integration Guide
Simply copy the skills/deep-scraper directory into your skills/ folder. Ensure the Dockerfile remains within the skill directory for self-contained deployment.
Standard Interface (CLI)
docker run -t --rm -v $(pwd)/skills/deep-scraper/assets:/usr/src/app/assets clawd-crawlee node assets/main_handler.js [TARGET_URL]
Output Specification (JSON)
The scraping results are printed to stdout as a JSON string:
status: SUCCESS | PARTIAL | ERRORtype: TRANSCRIPT | DESCRIPTION | GENERICvideoId: (For YouTube) The validated Video ID.data: The core text content or transcript.
Core Rules
- ID Validation: All YouTube tasks MUST verify the Video ID to prevent cache contamination.
- Privacy: Strictly forbidden from scraping password-protected or non-public personal information.
- Alpha-Focused: Automatically strips ads and noise, delivering pure data optimized for LLM processing.