Transform web content into structured knowledge with AI-powered analysis tools
Features โข Quick Start โข Documentation โข Examples
Claude Code Analyst is a comprehensive toolkit for capturing, converting, and visualizing web content. Built specifically for Claude Code integration, it provides powerful utilities to transform unstructured web content into clean Markdown documents, preserve complete HTML archives, and generate insightful Mermaid.js visualizations.
- ๐ Complete Content Capture: Convert web articles to Markdown OR preserve as clean HTML archives
- ๐ Content Intelligence: Extract structured data with comprehensive metadata preservation
- ๐จ Visual Understanding: Automatically generate diagrams from text to reveal hidden patterns and relationships
- ๐ Production Quality: Respects robots.txt, handles edge cases, and produces clean, consistent output
Transform web articles into clean, portable Markdown files:
- Smart Extraction: Uses Mozilla's Readability algorithm to extract main content while filtering out ads, navigation, and clutter
- Image Preservation: Downloads and organizes images with proper relative path references
- Rich Metadata: Captures title, publication date, word count, and source attribution in YAML frontmatter
- Dual Input Support: Works with both web URLs and local HTML files
- Respectful Scraping: Checks robots.txt before processing any URL
- Clean Output: Generates well-formatted Markdown with preserved text flow
Create self-contained HTML archives of web pages:
- Complete Preservation: Downloads entire web pages as clean, readable HTML documents
- Smart Content Extraction: Uses advanced algorithms to identify and extract main content
- Image Archiving: Downloads all referenced images with proper HTTP headers to bypass basic protection
- Enhanced Substack Support: Properly handles Substack articles with anchor-wrapped images
- Comprehensive Metadata: Preserves OpenGraph, Twitter cards, publication dates, and source attribution
- Professional Styling: Generates clean HTML5 output with embedded responsive CSS
- Offline Ready: Creates fully self-contained archives perfect for offline reading and research
Create intelligent visualizations from Markdown content:
- Auto-Analysis: Identifies concepts, workflows, timelines, and relationships from text
- Multiple Diagram Types: Generates flowcharts, timelines, mind maps, Sankey diagrams, and more
- Contextual Output: Each visualization includes relevant source text and explanations
- Batch Processing: Creates comprehensive visualization sets from single documents
- Claude Code Integration: Available as a custom
/mermaidcommand
Convert Mermaid diagrams to high-quality images:
- Professional Quality: Uses official Mermaid CLI for production-grade rendering
- Multiple Formats: Export as PNG, SVG, or PDF with configurable themes and dimensions
- Batch Processing: Convert multiple diagrams from a single markdown file
- Organized Output: Sequential naming and proper folder structure
- Theme Support: Default, dark, forest, neutral, and base themes available
- Custom Configuration: Configurable via
config.ymlfor dimensions, themes, and output settings
- Python 3.13+
- uv package manager
- Optional: Mermaid CLI for image conversion
# Clone the repository
git clone https://github.com/manavsehgal/claude-code-analyst.git
cd claude-code-analyst
# Install dependencies with uv
uv sync# Convert any web article
uv run python scripts/article_to_md.py https://example.com/article
# Convert local HTML file
uv run python scripts/article_to_md.py /path/to/local/file.html
# Specify custom output directory
uv run python scripts/article_to_md.py https://example.com/article --output-dir my-articlesOutput Structure:
markdown/
โโโ article-title-kebab-case/
โโโ article.md # Clean Markdown with YAML frontmatter
โโโ images/ # Preserved images
โโโ image1.jpg
โโโ image2.png
# Download complete HTML archive
uv run python scripts/html_downloader.py https://example.com/article
# Custom output directory
uv run python scripts/html_downloader.py https://example.com/article --output-dir archives
# Skip robots.txt check (use responsibly)
uv run python scripts/html_downloader.py https://example.com/article --skip-robotsOutput Structure:
html/
โโโ article-title-kebab-case/
โโโ index.html # Self-contained HTML document
โโโ images/ # All downloaded images
โโโ diagram1.png
โโโ chart2.svg
# In Claude Code, use the custom command
/mermaid markdown/article-title/article.mdOutput Structure:
mermaid/
โโโ article-title/
โโโ 01-timeline.md
โโโ 02-flowchart.md
โโโ 03-relationships.md
โโโ README.md
# Convert Mermaid diagrams to high-quality images
uv run python scripts/mermaid_to_image.py mermaid/article-title/01-timeline.md --format png --theme dark
# Batch convert all diagrams in a file
uv run python scripts/mermaid_to_image.py mermaid/article-title/workflow.md --format svgOutput Structure:
visualizations/
โโโ article-title/
โโโ 01-timeline-01.png
โโโ 02-flowchart-01.svg
โโโ 03-relationships-01.pdf
# Step 1: Create HTML archive for clean reading
uv run python scripts/html_downloader.py https://research-paper.com/ai-study
# Step 2: Create Markdown for text analysis
uv run python scripts/article_to_md.py https://research-paper.com/ai-study
# Step 3: Generate visualizations (in Claude Code)
/mermaid markdown/ai-study/article.md
# Step 4: Convert diagrams to presentation-ready images
uv run python scripts/mermaid_to_image.py mermaid/ai-study/01-workflow.md --format png --theme dark
# Result: Complete research package with readable archive, processable text,
# and visual insights with presentation-ready images# For offline documentation that preserves original styling
uv run python scripts/html_downloader.py https://docs.example.com/api-guide --output-dir documentation
# For portable markdown documentation
uv run python scripts/article_to_md.py https://docs.example.com/api-guide --output-dir documentation| Guide | Description |
|---|---|
| Article Converter Guide | Complete guide for Markdown conversion tool |
| HTML Downloader Guide | Comprehensive HTML archiving tool documentation |
| Mermaid Generator Guide | Creating visualizations with Claude Code |
| CLAUDE.md | Claude Code configuration and development settings |
| Documentation Index | All available documentation |
---
title: "Understanding Neural Networks"
source_url: https://example.com/neural-networks
article_date: 2024-12-15
date_scraped: 2024-12-20
word_count: 2847
image_count: 12
---
# Understanding Neural Networks
Article content with preserved formatting and ...<!DOCTYPE html>
<html lang="en">
<head>
<!-- Comprehensive metadata preservation -->
<meta name="source-url" content="https://original-url.com">
<meta property="og:title" content="Article Title">
<meta name="twitter:card" content="summary_large_image">
<!-- Embedded responsive styling -->
<style>
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI'... }
img { max-width: 100%; height: auto; }
</style>
</head>
<body>
<!-- Clean, readable content with local image references -->
<img src="images/local-diagram.png" alt="Diagram">
</body>
</html>timeline
title Evolution of AI Models
2017 : Transformer Architecture
: Attention Mechanism
2020 : GPT-3 Release
: 175B Parameters
2023 : ChatGPT Launch
: Consumer AI Era
claude-code-analyst/
โโโ scripts/ # Python tools and utilities
โ โโโ article_to_md.py # Web article to Markdown converter
โ โโโ html_downloader.py # HTML page archiving tool
โ โโโ mermaid_to_image.py # Mermaid diagram to image converter
โโโ docs/ # User guides and documentation
โ โโโ README.md # Documentation index
โ โโโ article-to-md-guide.md
โ โโโ html-downloader-guide.md
โ โโโ mermaid-visualization-guide.md
โโโ html/ # HTML archives (generated)
โ โโโ article-title/
โ โโโ index.html
โ โโโ images/
โโโ markdown/ # Converted articles (generated)
โ โโโ article-title/
โ โโโ article.md
โ โโโ images/
โโโ mermaid/ # Visualizations (generated)
โ โโโ article-title/
โ โโโ *.md
โโโ visualizations/ # Generated images (from mermaid_to_image.py)
โ โโโ article-title/
โ โโโ diagram-01.png
โ โโโ chart-02.svg
โ โโโ flow-03.pdf
โโโ projects/ # Analysis projects
โโโ transcripts/ # Video transcripts
โโโ backlog/ # Project planning
โ โโโ active-backlog.md
โโโ tests/ # Test suite
โโโ .claude/ # Claude Code custom commands
โ โโโ commands/
โ โโโ mermaid.md # Mermaid visualization generator
โ โโโ readme.md # README generation command
โโโ config.yml # Configuration file
โโโ CLAUDE.md # Claude Code configuration
โโโ pyproject.toml # Project dependencies
โโโ README.md # This file
# Install all dependencies
uv sync
# Install development dependencies
uv sync --dev
# Run tests
uv run pytest tests/
# Code quality checks
uv run ruff check .
uv run black .
uv run mypy .- Follow PEP 8 guidelines
- Use type hints for all functions
- Write comprehensive docstrings
- Maintain test coverage
- Respect robots.txt and website terms of service
| Package | Purpose |
|---|---|
requests |
Web fetching and HTTP handling |
beautifulsoup4 |
HTML parsing and manipulation |
markdownify |
HTML to Markdown conversion |
readability-lxml |
Article content extraction |
mermaid-mcp |
Mermaid diagram processing and image conversion |
pyyaml |
Configuration file handling |
ruff |
Fast Python linting |
black |
Code formatting |
mypy |
Static type checking |
pytest |
Testing framework |
- Academic Papers: Archive research papers as HTML for citation and clean Markdown for analysis
- Literature Reviews: Convert multiple sources to consistent formats for comparative analysis
- Reference Management: Build structured knowledge bases with metadata preservation
- Technical Documentation: Convert API docs to portable Markdown or preserve as styled HTML
- Team Knowledge Base: Archive important articles and resources for offline access
- Competitive Intelligence: Analyze competitor content and track changes over time
- News Archiving: Preserve news articles before they change or disappear
- Content Migration: Move content between platforms while maintaining formatting
- Fact Checking: Create timestamped archives of web content for verification
- Market Research: Archive industry reports and analysis
- Competitive Analysis: Track competitor announcements and strategy documents
- Compliance: Maintain records of regulatory content and policy changes
- Strategic Planning: Visualize business processes and strategies from archived content
- Article to Markdown conversion with metadata (web URLs and local HTML files)
- HTML page archiving with image preservation and enhanced Substack support
- Mermaid visualization generation (Claude Code integration)
- Mermaid to image conversion (PNG, SVG, PDF export)
- Comprehensive documentation and user guides
- Professional development tooling (ruff, black, mypy, pytest)
- PDF article processing support
- Batch processing multiple URLs with progress tracking
- Custom CSS themes for HTML archives
- Export to additional formats (JSON, CSV, EPUB)
- Enhanced metadata extraction (author detection, category classification)
- API endpoint for programmatic access
- Video/audio content transcription and processing
- Archive compression (ZIP/TAR formats)
- Integration with more visualization formats beyond Mermaid
- Chrome/Firefox browser extension for one-click archiving
- Cloud storage integration (S3, Google Drive, Dropbox)
We welcome contributions! Please follow these guidelines:
- Fork the repository and create a feature branch
- Follow PEP 8 and add comprehensive type hints
- Write tests for new functionality (aim for >80% coverage)
- Update documentation for any user-facing changes
- Respect ethical guidelines - ensure tools are used responsibly
- Test thoroughly with various website types and edge cases
# 1. Setup development environment
git clone https://github.com/manavsehgal/claude-code-analyst.git
cd claude-code-analyst
uv sync --dev
# 2. Create feature branch
git checkout -b feature/amazing-feature
# 3. Make changes and test
uv run pytest tests/
uv run ruff check .
uv run black .
# 4. Commit and push
git commit -m 'Add amazing feature'
git push origin feature/amazing-feature
# 5. Open Pull RequestSee CLAUDE.md for detailed development guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
This toolkit is designed for legitimate research, documentation, and analysis purposes. Please use responsibly:
- Respect robots.txt and website terms of service
- Don't overload servers - use reasonable delays between requests
- Respect copyright - maintain proper attribution and don't republish without permission
- Be transparent - the tools identify themselves with appropriate User-Agent strings
- Mozilla Readability for content extraction algorithms
- Mermaid.js for beautiful diagram rendering
- Claude Code for AI-powered development capabilities
- uv for modern Python package management
- BeautifulSoup for robust HTML parsing
- ๐ Documentation: Check the comprehensive guides for detailed instructions
- ๐ Bug Reports: Use the GitHub issue tracker
- ๐ก Feature Requests: Join discussions in the community forum
- ๐ Claude Code: Integrated custom commands for seamless workflow
Built with โค๏ธ for the Claude Code community