Add GitHub Copilot instructions and Python migration with Phase 1 & 3 complete#5
Add GitHub Copilot instructions and Python migration with Phase 1 & 3 complete#5Copilot wants to merge 5 commits into
Conversation
Co-authored-by: beusj <8457943+beusj@users.noreply.github.com>
|
@copilot Help me convert the ruby code to python modules |
Converting the Ruby geocoding engine to Python would be a significant architectural change involving ~2000 lines of code across 7 modules:
This would also require:
This is beyond the scope of the current PR (GitHub Copilot instructions). Would you like me to create a separate issue/PR for this conversion, or would you prefer to discuss the requirements and approach first? |
|
@copilot Lets discuss the requirements and approach |
Conversion Requirements & ApproachCurrent Architecture
Key Components to Convert1. Core Modules (~2000 lines)
2. Critical Dependencies
Recommended ApproachOption A: Rewrite in Python (Complex)
Option B: Python-R Interface (Moderate)
Option C: Native R Implementation (Alternative)
Testing Strategy
Which approach interests you? Option B (Python via reticulate) offers the best balance of effort and maintainability. |
|
@copilot I think we should rewrite in python and consider duckdb to manage the geometry/geography data |
Python + DuckDB Migration PlanGreat choice! DuckDB offers significant advantages for this use case. Why DuckDB is a Strong FitAdvantages over SQLite:
Migration ArchitectureNew Stack: Components to Build:
Key ConsiderationsChallenges:
Benefits:
Next StepsWould you like me to:
I recommend starting with #3 to validate the approach before full conversion. |
|
@copilot 1 - also, I think the R entrypoint can likely be rewritten in python |
Full Python Migration PlanYou're absolutely right - the R entrypoint is straightforward to port to Python. Current entrypoint.R does: Main Functions (160 lines):
Python Equivalents:
New ArchitectureFull Python Stack: Benefits:
Proposed StructureShould I proceed with creating this scaffolding on a new branch? I can start with:
|
|
@copilot Yes, 1 through 4 |
- Created geocoder_us/ package with __init__.py - Ported constants.py from Ruby (1000+ lines) - TwoWayMap class for bidirectional lookups - Directional prefixes/suffixes - Street type prefixes and suffixes - US state and territory mappings - Created preprocessing.py with address validation functions - clean_address() for normalization - address_is_po_box() detector - address_is_institutional() for Cincinnati addresses - address_is_nonaddress() for placeholder text - Created entrypoint.py CLI with full interface - Argument parsing (filename, score_threshold) - CSV I/O with pandas - Preprocessing pipeline - Output file naming matching original format - Summary reporting with tabulate - Added requirements.txt with Python dependencies - pandas, duckdb, tabulate - Levenshtein, metaphone - joblib for parallel processing - Created PYTHON_MIGRATION.md with roadmap - Updated .gitignore for Python artifacts Total: ~1060 lines of Python code Geocoding engine TODO: database.py, address.py, metaphone integration Co-authored-by: beusj <8457943+beusj@users.noreply.github.com>
✅ Complete! Created Python migration scaffolding (commit 02d20e0): Branch: Currently on Delivered (items 1-4):
Files Created:
Current Status:
Next: Implement |
|
@copilot Continue with planned steps as in roadmap |
Starting Phase 1 implementation. Given the scope, I'll focus on foundational pieces first: Planned for this session:
This sets up the core modules needed for Phase 2 (database migration) and Phase 3 (integration). The full geocoding queries and scoring logic will follow once the database is migrated. Will commit these modules and update the roadmap. Estimated ~500-800 lines across 3 files. |
…database Phase 1 Core Geocoding Engine: - Created address.py (330 lines) - Complete address parser - Regex-based parsing of number, street, city, state, ZIP - Street and city tokenization for database matching - Abbreviation expansion using constants - PO Box and intersection detection - Street/city parts generation for fuzzy matching - Created metaphone.py (190 lines) - Phonetic matching - Metaphone algorithm implementation - Phonetic similarity scoring - Multi-word encoding support - Compatible with external metaphone libraries - Created database.py (210 lines) - DuckDB interface - Thread-safe connection management - Spatial and fuzzystrsim extension loading - Query method framework (stubs for schema) - Scoring weight constants - Created test_modules.py - Validation test script - Tests address parsing with sample addresses - Tests metaphone encoding and similarity - Validates constants loading - Tests street parts generation - Updated PYTHON_MIGRATION.md roadmap - Marked Phase 1 modules complete - Updated completed items list All modules tested and working. Ready for Phase 2 (database migration). Co-authored-by: beusj <8457943+beusj@users.noreply.github.com>
|
@copilot continue with the plan |
Continuing with Phase 3: Integration (Phase 2 requires database file). Implementing now:
This completes the geocoding pipeline (except actual DB queries which need the migrated database). The system will be ready for Phase 2 database work. |
…ring
Phase 3 Integration Complete:
- Updated entrypoint.py (290 → 437 lines)
- Integrated Address parsing from Phase 1
- Implemented parallel geocoding with joblib
- Uses all CPU cores (n_jobs=-1)
- Progress display during processing
- Added result caching with joblib Memory
- Disk-based cache in .geocoding_cache/
- Prevents re-geocoding same addresses
- Implemented geocode_single_address() function
- Parses addresses using Address class
- Returns structured results (placeholder DB queries)
- Error handling for failed addresses
- Added apply_score_threshold() function
- Filters results by score and precision
- Classifies results: geocoded, imprecise_geocode, po_box, etc.
- Sets lat/lon to None for low-quality geocodes
- Updated write_output_file() to apply filtering
- Created test_integration.py
- End-to-end workflow test
- Tests preprocessing, geocoding, filtering, output
- Creates test CSV, runs pipeline, validates results
- Updated PYTHON_MIGRATION.md
- Marked Phase 3 tasks complete
- Updated completed items with integration details
Ready for Phase 2 (database migration) to enable actual coordinate lookups.
Co-authored-by: beusj <8457943+beusj@users.noreply.github.com>
Adds
.github/copilot-instructions.mdto provide Copilot with repository context for better code suggestions, plus Python migration implementation with Phase 1 core geocoding modules and Phase 3 integration complete.GitHub Copilot Instructions
The file surfaces domain-specific conventions (e.g.,
mappp::mappp()for parallel cached geocoding, mandatory 5-digit ZIP codes) and production constraints (backward compatibility, thread safety) that would otherwise require deep codebase knowledge.Python Migration - Phase 1 & 3 Complete
Created Python implementation structure for migrating to full Python stack with DuckDB. Phase 1 core geocoding modules and Phase 3 integration pipeline are now implemented and tested.
Files Added
Initial Scaffolding:
geocoder_us/package: Python module structure__init__.py: Package initializationconstants.py: ~1000 lines ported from Ruby (street types, state codes, directional prefixes)preprocessing.py: Address cleaning and validation functions (clean_address, address_is_po_box, etc.)entrypoint.py: Complete CLI implementation with integration (437 lines)Phase 1 Core Modules:
geocoder_us/address.py(330 lines): Complete address parsergeocoder_us/metaphone.py(190 lines): Phonetic matching enginegeocoder_us/database.py(210 lines): DuckDB interfaceDocumentation & Testing:
requirements.txt: Python dependencies (pandas, duckdb, tabulate, Levenshtein, metaphone, joblib)PYTHON_MIGRATION.md: Complete 5-phase roadmap (Phase 1 & 3 marked complete)test_modules.py: Validation test script for core modulestest_integration.py: End-to-end pipeline integration testMigration Architecture
Target Stack:
Current Status
Phase 1: Core Geocoding Engine ✅ COMPLETE
Phase 3: Integration ✅ COMPLETE
.geocoding_cache/)Testing: All modules tested and validated:
Phase 2 Pending: Database Migration
What Works Now
The geocoder can:
What Needs Database Migration
The geocoder CANNOT yet:
The core Python geocoding pipeline is fully implemented and ready to accept real database queries once Phase 2 (database migration) is completed.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.