Skip to content

Latest commit

 

History

History
777 lines (654 loc) · 32 KB

File metadata and controls

777 lines (654 loc) · 32 KB

✅ Stage 0: Project Setup & Infrastructure

✅ Development Environment Setup

  • Initialise GitHub repository
  • Set up branch protection
  • Resolve naming issues and override branch protection for admins
  • Create dev/prod branches
  • Set up local development environment
  • Add initial documentation

✅ Go Project Structure

  • Initialise Go project
  • Set up dependency management
  • Create project structure
  • Add basic configs
  • Set up testing framework

✅ Production Infrastructure Setup

  • Set up dev/prod environments
  • Configure environment variables
  • Set up secrets management
  • Create Dockerfile and container setup
  • Configure Fly.io
    • Set up Fly.io account and project
    • Configure deployment settings
    • Set up environment variables in Fly.io
    • Create deployment workflow
    • Add health check endpoint monitoring
  • Test production deployment
  • Initial Sentry.io connection

✅ Stage 1: Core Setup & Basic Crawling

✅ Core API Implementation

  • Initialise Go project structure and dependencies
  • Set up basic API endpoints
  • Set up environment variables and configs
  • Implement basic health checks and monitoring
  • Add basic error monitoring with Sentry
  • Set up endpoint performance tracking
  • Add graceful shutdown handling
  • Implement configuration validation

✅ Enhance Crawler Results

  • Set up Colly crawler configuration
  • Implement concurrent crawling logic
  • Add basic error handling
  • Add rate limiting (fixed client IP detection)
  • Add retry logic
  • Handle different response types/errors
  • Implement cache validation checks
  • Add crawler-specific error tracking
  • Set up crawler performance monitoring

✅ Set up Turso for storing results

  • Design database schema
  • Set up Turso connection and config
  • Implement data models and queries
  • Add basic error handling
  • Add retry logic
  • Add database performance monitoring
  • Set up query error tracking

✅ Stage 2: Multi-domain Support & Job Queue Architecture

✅ Job Queue Architecture

  • Design job and task data structures
  • Implement persistent job storage in database
  • Create worker pool for concurrent URL processing
  • Add job management API (create, start, cancel, status)
  • Implement database retry logic for job operations to handle transient errors
  • Enhance error reporting and monitoring

✅ Sitemap Integration

  • Implement sitemap.xml parser
  • Add URL filtering based on path patterns
  • Handle sitemap index files
  • Process multiple sitemaps
  • Implement robust URL normalisation in sitemap processing
  • Add improved error handling for malformed URLs

✅ Link Discovery & Crawling

  • Extract links from crawled pages
  • Filter links to stay within target domain
  • Basic link discovery logic
  • Queue discovered links for processing

✅ Job Management API

  • Create job endpoints (create/list/get/cancel)
  • Add progress calculation and reporting
  • Store recent crawled pages in job history
  • Implement multi-domain support

✅ Stage 3: PostgreSQL Migration & Performance Optimisation

✅ Fly.io Production Setup

  • Set up production environment on Fly.io
  • Deploy and test rate limiting in production
  • Configure auto-scaling rules
  • Set up production logging
  • Implement monitoring alerts
  • Configure backup strategies (Supabase handles automatically)

✅ Performance Optimisation

  • Implement caching layer
  • Optimise database queries
  • Configure rate limiting with proper client IP detection
  • Add performance monitoring
  • Made decision to switch to postgres at this point

✅ PostgreSQL Migration

✅ PostgreSQL Setup and Infrastructure

  • Set up PostgreSQL on Fly.io
    • Create database instance
    • Configure connection settings
    • Configure security settings

✅ Database Layer Replacement

  • Implement PostgreSQL schema
    • Convert SQLite schema to PostgreSQL syntax
    • Add proper indexes
    • Implement connection pooling
  • Replace database access layer
    • Update db package to use PostgreSQL
    • Add health checks and monitoring
    • Implement efficient error handling

✅ Task Queue and Worker Redesign

  • Implement PostgreSQL-based task queue
    • Use row-level locking with SELECT FOR UPDATE SKIP LOCKED
    • Optimise for concurrent access
    • Plan task prioritisation implementation (docs created)
  • Redesign worker pool
    • Create single global worker pool
    • Implement optimised task acquisition

✅ URL Processing Improvements

  • Enhanced sitemap processing
    • Implement robust URL normalisation
    • Add support for relative URLs in sitemaps
    • Improve error handling for malformed URLs
  • Improve URL validation
    • Better handling of URL variations
    • Consistent URL formatting throughout the codebase

✅ Code Refactoring

  • Eliminate duplicate code
    • Move database operations to a unified interface
    • Consolidate similar functions into single implementations
    • Move functions to appropriate packages
  • Remove global state
    • Implement proper dependency injection
    • Replace global DB instance with passed parameters
    • Improve transaction management with DbQueue
  • Standardise naming conventions
    • Use consistent function names across packages
    • Clarify responsibilities between packages

✅ Code Cleanup

  • Remove redundant worker pool creation
    • Eliminate duplicate worker pools in API handlers
    • Ensure single global worker pool is used consistently
  • Simplify middleware stack
    • Reduce excessive transaction monitoring
    • Optimise Sentry integrations
    • Remove unnecessary wrapping functions
  • Clean up API endpoints
    • Document endpoints to consolidate or remove
    • Plan endpoint implementation simplification
    • Standardise error handling approach
    • Implementation plan completed in docs/plans/api-cleanup.md
  • Fix metrics collection (plan created)
    • Document metrics to expose
    • Plan for unused metrics tracking removal
    • Identify relevant PostgreSQL metrics to add
  • Remove depth functionality
    • Remove depth column from tasks table
    • Remove max_depth column from jobs table
    • Update EnqueueURLs function to remove depth parameter
    • Update type definitions to remove depth fields
    • Remove depth-related logic from link discovery process
    • Update documentation to remove depth references

✅ Final Transition

  • Update core endpoints to use new implementation
  • Remove SQLite-specific code
  • Clean up dependencies and imports
  • Update configuration and documentation

🟡 Stage 4: Core Authentication & MVP Interface

✅ Implement Supabase Authentication

  • Configure Supabase Auth settings
  • Implement JWT validation middleware in Go
  • Add social login providers configuration (Google, Facebook, Slack, GitHub, Microsoft, Figma, LinkedIn + Email)
  • Set up user session handling and token validation
  • Implement comprehensive auth error handling
  • Create user registration with auto-organisation creation
  • Configure custom domain authentication (hover.auth.goodnative.co)
  • Implement account linking for multiple auth providers per user (handled by Supabase Auth via auth.identities table)

✅ Connect user data to PostgreSQL

  • Design user data schema with Row Level Security
  • Implement user profile storage
  • Add user preferences handling
  • Configure PostgreSQL policies for data access
  • Create database operations for users and organisations

✅ Simple Organisation Sharing

Organisation model implemented:

  • Auto-create organisation when user signs up
  • Create shared access to all jobs/tasks/reports within organisation

✅ API-First Architecture Development (Completed v0.4.2)

  • Comprehensive RESTful API Infrastructure
    • Standardised response format with request IDs and consistent error handling
    • Interface-agnostic RESTful endpoints (/v1/* structure)
    • Comprehensive middleware stack (CORS, logging, rate limiting)
    • Proper HTTP status codes and structured error responses
  • Multi-Interface Authentication Foundations
    • JWT-based authentication with Supabase integration
    • Authentication middleware for protected endpoints

✅ MVP Interface Development (Completed v0.5.3)

  • Dashboard Demonstration Infrastructure
    • Working vanilla JavaScript dashboard with modern UI design
    • API integration for job statistics and progress tracking (/v1/dashboard/stats, /v1/jobs)
    • Stable production deployment without Web Components dependencies
    • Responsive design with professional styling and user experience
  • Template + Data Binding Foundation
    • Architecture documentation for template-based integration approach
    • Attribute-based event handling system (gnh-action, gnh-data-*)
    • Event delegation framework for extensible functionality
    • Demonstration of template approach in production dashboard

🟡 Template + Data Binding Implementation (Completed v0.5.5)

  • Core Data Binding Library
    • Basic attribute-based event handling (gnh-action="refresh-dashboard")
    • JavaScript library for data-gnh-bind attribute processing
    • Template engine for data-gnh-template repeated content
    • Authentication integration with conditional element display (data-gnh-auth)
    • Form handling with data-gnh-form and validation (data-gnh-validate)
    • Style and attribute binding (data-gnh-bind-style, data-gnh-bind-attr)
  • Enhanced Job Management
    • Real-time job progress updates via data binding
    • Job creation forms with template-based validation
    • Error handling and user feedback systems
    • Advanced filtering and search capabilities
  • User Experience Features
    • Account settings and profile management templates
    • Notification system integration

✅ Task prioritisation & URL processing

  • Stop duplicate domain crawls oncurrently, close old job

    • When creating a job, check if there's an active job for this user
    • If so, cancel the old job
  • Task Prioritisation

    • Prioritisation by page hierarchy and importance
    • Implement link priority ordering for header links (1st: 1.000, 2nd: 0.990, etc.)
    • Apply priority ordering logic to all discovered page links
  • Robots.txt Compliance

    • Parse and honour robots.txt crawl-delay directives
    • Filter URLs against Disallow/Allow patterns before enqueueing
    • Cache robots.txt rules at job level to prevent repeated fetches
    • Fail manual URL creation if robots.txt cannot be checked
    • Filter dynamically discovered links against robots rules
  • URL Processing Enhancements

    • Filter out links that are hidden via inline style attributes.
    • Remove anchor links from link discovery
    • Support compressed sitemaps (.xml.gz and other formats)
    • If sitemap can't be found, setup job with / page and start as normal finding links through pages
    • Only store source_url if page was found ON a page and redirect_url if it's a redirect AND it doesn't match the domain/path of the task
  • Considering impact of and plan updates Go v1.25 release

  • Blocking Avoidance

    • Series of tweaks to reduce blocking

✅ Recurring Job Scheduling (Completed v0.18.0)

  • Scheduler System Implementation
    • Database schema with schedulers table and scheduler_id foreign key
    • Support for 6, 12, 24, and 48-hour intervals
    • Background service polls for ready schedules every 30 seconds
    • Jobs created from schedulers marked with source_type='scheduler'
    • Scheduler management API endpoints (create, update, delete, list)
    • Dashboard UI for managing schedules (enable/disable, view jobs, delete)
    • Schedule dropdown in job creation modal for optional recurring schedules
    • Comprehensive error handling with structured logging
    • Input validation and rollback logic for failed operations

🟡 Webflow App Integration (Completed v0.23.0)

  • Webflow OAuth Connection
    • Register as Webflow developer and create App
    • OAuth flow with HMAC-signed state for CSRF protection
    • Token storage in Supabase Vault with automatic cleanup
    • User identity display via authorized_user:read scope
    • Dashboard UI showing connection status and username
    • Shared OAuth utilities extracted from Slack integration
  • Webflow Site Selection
    • List user's accessible Webflow sites via /v2/sites endpoint
    • Site picker UI in dashboard connections panel with search/pagination
    • Per-site settings stored in webflow_site_settings table
    • Connection management endpoints (list/get/delete)
  • Manual Job Triggering (Completed v0.24.0)
    • Jobs automatically triggered when schedule or auto-publish enabled
    • Jobs can be triggered via scheduler or webhooks
    • Show last crawl status (via general job list)
  • Scheduling Configuration
    • Connect Webflow sites to existing scheduler system
    • Schedule dropdown for recurring cache warming (None/6h/12h/24h/48h)
    • Per-site schedule management in dashboard
    • Automatic scheduler creation/update/deletion based on interval selection
  • Run on Publish (Webhooks)
    • "Auto-crawl on publish" toggle in site configuration
    • Register site_publish webhook with Webflow API (per-site control)
    • Webhook endpoint to receive publish events (org-scoped and legacy token-based)
    • Webhook signature verification (NOTE: Webflow v2 doesn't provide signatures yet)
    • Trigger cache warming job on verified publish events with auto_publish validation
    • Platform-org mapping for workspace-based webhook resolution

✅ Slack Integration (Completed v0.20.0)

  • Slack Application Development
    • OAuth flow for installing GNH Slack app to workspaces
    • Bot tokens stored securely in Supabase Vault
    • Auto-linking users to Slack workspaces via database triggers
    • Supabase Slack OIDC support for user authentication
  • Notification Delivery
    • Job completion notifications via Slack DMs
    • Error notifications when jobs fail
    • API endpoints for workspace management and user preferences

✅ Google Analytics 4 Integration (Completed)

  • OAuth Connection Setup (Steps 1-3)
    • Google OAuth 2.0 configuration and credentials
    • OAuth flow implementation with state token CSRF protection
    • Account and property selection functionality
    • Token storage in Supabase Vault with refresh logic
    • Database schema for user_ga_connections table
    • Dashboard UI for connecting/disconnecting GA4 properties
  • Analytics Data Retrieval (Step 4)
    • Implement GA4 Data API client (analyticsdata/v1beta)
    • Fetch recent visitor/view data for each page path
    • Query metric: screenPageViews only
    • Support for 7, 28, and 180-day lookback periods
    • Scheduled background sync service (opt-in per domain, no sync by default)
    • Token refresh mechanism for expired access tokens
  • Pages Table Integration (Step 5)
    • Add analytics columns to page_analytics table:
      • page_views_7d - Page views (last 7 days)
      • page_views_28d - Page views (last 28 days)
      • page_views_180d - Page views (last 180 days)
      • fetched_at - Timestamp of last GA sync
    • Atomic upsert logic to merge GA data with existing page records
  • Task Prioritisation Enhancement (Step 6)
    • Incorporate page view data into task priority calculation
    • Prioritise high-traffic pages for earlier cache warming
    • Automatically enabled when domain has linked GA account
  • Data Export Integration (Step 7)
    • Include page view metrics in CSV/JSON/Excel exports
    • Add columns: Views (7d), Views (28d), Views (180d)
    • Dashboard displays page view metrics alongside performance data

🎯 STAGE 5: MVP LAUNCH PREPARATION (Current)

5.0 Finalise outstanding actions above

  • GA
  • Account settings / management (settings page operational — billing awaits Paddle in 5.2)

5.1: Webflow Job Triggering & Polish

  • Trigger immediate job when schedule or auto-publish enabled
  • Extension Development
    • Build Webflow Designer Extension using Designer Extension SDK
    • Implement site health metrics display (broken links, slow pages)
    • Add job management interface (view status, trigger crawls)
    • Configuration panel for schedule and webhook settings
  • Integration & Testing
    • Connect extension to GNH API via OAuth
    • Test extension in Webflow Designer workspace
    • Handle error states and loading indicators

5.2: Payment Infrastructure

  • Paddle Integration
    • Set up Paddle account and configuration
    • Implement subscription webhooks and payment flow
    • Create subscription plans and checkout process
  • Subscription Management
    • Link subscriptions to organisations
    • Handle subscription updates and plan changes
    • Add subscription status checks
  • Usage Tracking & Quotas
    • Implement usage counters and basic limits
    • Set up usage reporting functionality
    • Implement organisation-level usage quotas

5.3: Branding & UI Cleanup

  • Visual Design System
    • Define colour palette, typography, spacing scales
    • Create reusable CSS variables and utility classes
    • Design logo and favicon assets
  • Dashboard Redesign & Polish
    • Ensure responsive layout at core to everything
    • Optimise elements for dashboard vs. Webflow designer App
    • Improve nav bar, settings & notifications layout
    • Improve layout consistency and visual hierarchy
    • Refine job cards, status indicators, and data tables
    • Add loading states, empty states, and transitions
  • Error States & Messaging
    • Design clear error messages and recovery suggestions
    • Improve validation feedback for forms
    • Create consistent notification system
  • Onboarding Flow
    • Quick start flow - Crawl domain & create account
      • Marketing page
      • Webflow App + auth Webflow, set schedule, publish
    • Welcome screen for new users - tick box/dismiss cards
      • Quick start guide or tooltip tour
      • Crawl domain, create a schedule
      • Explain plans & update if required
      • View results, export slow and error pages
      • Integrate steps GA, Slack, Webflow

5.4: Marketing Page

  • Marketing Infrastructure
    • Simple Webflow marketing page with product explanation
    • Basic navigation structure and call-to-action
      • Quick crawl & account creation
    • User documentation and help resources
    • Landing pages
      • Cache warmer - make your site load faster
      • Load speed - find slow pages
      • Broken links - find the important ones
      • Integrations - Slack, Webflow, Google Analytics
    • Pricing page with subscription tiers

5.5: Webflow Marketplace Submission

Full details in Webflow Marketplace

  • Marketplace Preparation
    • Complete Webflow App listing (description, screenshots, demo video)
    • Prepare support documentation and setup guide
    • Create terms of service and privacy policy
  • Submission & Approval
    • Submit app to Webflow marketplace for review
    • Address feedback and make required changes
    • Obtain marketplace approval

5.6: Pre-Launch Polish & Testing

  • Alpha Testing
    • Internal testing with team members
    • Beta testing with 3-5 friendly Webflow users
    • Collect feedback and address critical issues
  • Security & Compliance
    • Final security audit of authentication flows
    • Review RLS policies and data isolation
    • Confirm GDPR/privacy compliance basics
  • Responsive Design Cleanup
    • Audit all pages/layouts at mobile (<480px), tablet (480-960px), and desktop (960px+) breakpoints
    • Fix dashboard, settings, job details, and nav for small screens
    • Test integration panels (Webflow sites grid, member lists, GA properties) at all breakpoints
    • Ensure forms, modals, and toast notifications work on touch devices

5.7: Launch & First Customers

  • Soft Launch
    • Make app available to first 10 users
    • Monitor system performance and error rates
    • Provide responsive support to early adopters
  • Iterative Improvements
    • Gather user feedback on critical issues
    • Address bugs and usability problems
    • Track key metrics (signup rate, job success, retention)

⚪ Stage 6: Post-MVP Enhancements

🔴 WordPress Integration

  • WordPress Plugin Development
    • Create basic WordPress plugin for Hover
    • Plugin configuration interface for domain settings
    • Display crawl results and statistics in WordPress admin
    • Trigger manual crawls from WordPress dashboard
  • WordPress.org Submission
    • Prepare plugin listing and screenshots
    • Submit plugin to WordPress plugin directory
    • Address review feedback and obtain approval

🔴 Shopify Integration

  • Shopify App Development
    • OAuth integration with Shopify
    • Embedded app interface for store owners
    • Display site health metrics in Shopify admin
    • Automatic crawl triggers on theme publish
  • Shopify App Store Submission
    • Complete app listing with demo and screenshots
    • Submit to Shopify App Store for review
    • Address feedback and obtain approval

Slack enhancements

  • Slash commands (/crawl sitedomain.com)
  • Threading with progress updates
  • Interactive message actions

🔴 Multi-Platform Authentication Architecture

  • Organisation-Based Data Model (Completed v0.19.0)
    • Implement many-to-many user-organisation relationships
    • Create organisation context switching logic
    • Implement data isolation between organisations
    • Add store/site entity linked to single organisation
  • Platform Authentication Adapters
    • Shopify OAuth and session management
    • WordPress API key integration
    • Map platform stores/sites to BB organisations
    • Progressive account creation for platform users
  • Unified User System
    • Single BB user accessible via multiple platforms
    • Platform context determines visible organisation
    • Shadow accounts for store staff (auto-created on action)
    • Account claiming and upgrade flows

🔴 Platform SDK Development

  • Core JavaScript SDK
    • Extract data-binding system into standalone library
    • Create platform-agnostic API client
    • Implement organisation context management
    • Add platform-specific authentication handlers
  • Platform Adapters
    • Shopify app bridge integration
    • WordPress plugin integration helpers
    • Platform-specific UI component adapters
    • Event handling for platform contexts

⚪ Stage 7: Scale & Advanced Features

🔴 Supabase Platform Integration

  • Real-time Features (See SUPABASE-REALTIME.md) - 60% COMPLETE
    • Real-time notification badge updates via Postgres Changes subscription (v0.20.1)
    • Real-time dashboard job list updates via WebSocket subscriptions (v0.20.1)
    • Real-time job detail progress updates with per-job subscriptions (v0.20.1)
    • Real-time dashboard stats without page refresh (requires API endpoint changes)
    • Live presence indicators for multi-user organisations
  • Database Optimisation
    • Move CPU-intensive analytics queries to PostgreSQL functions
    • Optimise task acquisition with database-side logic
    • Enhance Row Level Security policies for multi-tenant usage
    • Consolidate database connection settings into single configuration location and make them configurable via environment variables (internal/db/db.go:113-115)
  • Backend Simplification via Supabase (See supabase-simplification.md)
    • Phase 1: Migrate stuck job cleanup to pg_cron
      • Create run_job_cleanup() PostgreSQL function
      • Schedule with cron.schedule('job-cleanup', '* * * * *', ...)
      • Remove CleanupStuckJobs() from Go worker monitors (~100 lines)
    • Phase 2: Migrate notification delivery to Edge Functions
      • Create deliver-notification Edge Function
      • Update notify_job_status_change() trigger to call via pg_net
      • Remove Go notification listener and Slack delivery code (~451 lines)
      • Remove slack-go/slack dependency
  • File Storage & Edge Functions
    • Store crawler logs, sitemap caches, and error reports in Supabase Storage
    • Create Edge Functions for webhook handling and scheduled tasks
    • Handle Webflow publish events via Edge Functions
    • Add managed Postgres proxy in front of edge/serverless workloads to shield the primary pool

🔴 API & Integration Enhancements

  • API Client Libraries
    • Enhance core JavaScript client with advanced authentication
    • Create interface-specific adapters
    • Document API with OpenAPI specification
  • Webhook System
    • Implement webhook subscription for site_publish events
    • Verify webhook signatures using x-webflow-signature headers
    • Create webhook system for job completion notifications
  • API Key Management
    • Create API key system for integrations
    • Implement scoped permissions for different interfaces

🔴 Infrastructure & Operations

  • 1Password Secrets Management - Implementation Plan
    • Set up 1Password vault structure for Hover
    • Configure flyctl shell plugin for local development
    • Implement 1Password Service Account for GitHub Actions CI/CD
    • Migrate secrets from GitHub Secrets to 1Password
  • Database Management
    • Set up backup schedule and automated recovery testing
    • Implement data retention policies
    • Create comprehensive database health monitoring
    • Implement burst-protected connection classes (separate Supabase roles/DSNs for batch vs interactive traffic)
    • Introduce read replica routing with lag monitoring and primary fallbacks
    • Add tenant-level pool quotas with schema/role isolation to enforce fairness
  • Scheduling & Automation
    • Create configuration UI for scheduling options (completed v0.18.0)
    • Implement recurring job scheduler for 6/12/24/48 hour intervals (completed v0.18.0)
    • Background service checks for ready schedules every 30 seconds (completed v0.18.0)
    • Automatic cache warming based on Webflow publish events
  • Monitoring & Reporting
    • Fix completion percentage to reflect actual completed vs skipped tasks (not always 100%) (internal/db/db.go:404)
    • Publish OTEL metrics for connection pool saturation and wire Grafana alerts
    • Incident runbook and escalation checklist
    • Minimal status page for alpha
  • Frontend Architecture Consideration
    • Evaluate Vue/Svelte framework migration if dashboard exceeds 8000 LOC or team scaling requires modern framework (current: 4000 LOC vanilla JS with custom data binding, no build process - consider migration only if actual pain points emerge)

⚪ Stage 7: Feature Refinement & Launch Preparation

🔴 Security & Compliance

  • Core app functionality
    • Path inclusion/exclusion rules
    • Domain blocklist/allowlist for crawler (prevent crawling specific domains)
  • Enhanced Authentication
    • Test and refine multi-provider account linking
    • Member invitation system for organisations
  • Audit & Security Features
    • Secure admin endpoints properly with system_role authentication (internal/api/admin.go:11,25)
    • GDPR compliance features (data export, deletion audit trails)

🔴 Launch & Marketing

  • Marketing Infrastructure
    • Simple Webflow marketing page with product explanation
    • Basic navigation structure and call-to-action
    • User documentation and help resources
  • Launch Preparation
    • Complete marketplace submission process
    • Set up support channels and user onboarding
    • Implement usage analytics and tracking

🔴 Data Archiving & Retention

  • Implement two-tier data storage strategy
    • Use Supabase Storage for "hot" data (recent logs, debug files)
    • Implement Cloudflare R2 for "cold" storage of historical HTML page captures
    • Create automated Go job to handle data lifecycle (e.g., move files > 30 days to R2)
    • Update database to track storage location (hot/cold) for each archived file

🟡 Alpha Data Retention

  • Retention policy for alpha
    • Auto-delete crawler logs and stored HTML older than 90 days

🔴 Content Storage & Change Tracking

  • Implement Semantic Hashing for change detection - Implementation Plan
    • Add content_hash and html_storage_path columns to tasks table
    • Add latest_content_hash column to pages table
    • Implement HTML parsing and canonical content extraction in Go worker
    • Store HTML in Supabase Storage only when semantic hash changes

✅ Code Quality & Maintenance (Completed)

  • Increase Test Coverage - Implementation Plan
    • Set up Supabase test branch database infrastructure
    • Add testify testing framework
    • Create simplified test plan (Phase 1: 80-115 lines)
    • Implement Phase 1 tests (GetJob, CreateJob, CancelJob, ProcessSitemapFallback)
    • Implement integration tests (EnqueueJobURLs)
    • Implement unit tests with mocks (CrawlerInterface refactoring)
    • Enable Codecov reporting and Test Analytics
    • Set up CI/CD with Supabase pooler URLs for IPv4 compatibility
    • Fix test environment loading to use .env.test file
    • Reorganise testing documentation into modular structure
    • Fix critical test issues from expert review (P0/P1 priorities)
    • Implement sqlmock tests for database operations
    • Create comprehensive mock infrastructure (MockDB, DSN helpers)
    • Implement Comprehensive API Testing - ✅ COMPLETED
  • Code Quality Improvement - core quality gates now enforced in CI
    • Phase 1: Automated formatting and ineffectual assignments cleanup
    • Phase 2: Refactor high-complexity functions (processTask, processNextTask completed)
    • Add golangci-lint to CI/CD pipeline with Go 1.25 compatibility
    • Improve Go Report Card score from C to A

🔴 Robots.txt Compliance Auditing

  • Track and audit robots.txt filtering decisions
    • Add optional logging table for blocked URLs during job processing
    • Record URL, path, matching disallow pattern, and job context
    • Create admin endpoint to review filtering decisions
    • Add metrics for blocked vs allowed URL ratios per domain
    • Enable/disable audit logging per job for performance