diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md new file mode 100644 index 0000000..e3b8137 --- /dev/null +++ b/.github/copilot-instructions.md @@ -0,0 +1,184 @@ +# GitHub Copilot Instructions for geocoder + +## Repository Overview + +This is the **geocoder** container, part of the DeGAUSS (Decentralized Geomarker Assessment for Multi-Site Studies) project. It geocodes US street addresses to latitude/longitude coordinates using a custom Ruby-based geocoding library and SQLite database containing 2021 TIGER/Line Street Range Address files. + +## Architecture + +The project uses a multi-language architecture: + +- **R**: Primary entrypoint (`entrypoint.R`) for data processing, CSV I/O, and workflow orchestration +- **Ruby**: Geocoding engine (`geocode.rb` + `lib/geocoder/us/`) using custom Geocoder::US gem +- **Docker**: Containerization with rocker R base image +- **SQLite**: Address database (`geocoder.db`) with spatial data + +### Key Components + +1. **entrypoint.R**: Main R script that: + - Reads CSV files with address column + - Cleans and validates addresses + - Calls Ruby geocoder for each address + - Filters results based on score/precision thresholds + - Outputs geocoded CSV with matched coordinates and metadata + +2. **geocode.rb**: Ruby wrapper that: + - Accepts address string as command-line argument + - Queries SQLite database via Geocoder::US gem + - Returns JSON results to stdout + +3. **lib/geocoder/us/**: Custom Ruby gem with: + - `database.rb`: SQLite database interface + - `address.rb`: Address parsing and normalization + - `metaphone.rb`: Phonetic matching for street names + - Other supporting modules + +## Code Style and Conventions + +### R Code +- Use `dplyr` pipe syntax (`%>%`) for data transformations +- Suppress library loading messages with `withr::with_message_sink("/dev/null", library(...))` +- Use `cli::cli_alert_info()` for user-facing messages +- Follow tidyverse style conventions +- Use `mappp::mappp()` for parallel geocoding with caching +- Always include `show_col_types = FALSE` when using `readr::read_csv()` + +### Ruby Code +- Follow standard Ruby conventions (snake_case, 2-space indentation) +- Use `require` for dependencies at top of files +- Return results as JSON when interfacing with R +- The Geocoder::US module uses a SQLite database connection that should be thread-safe + +### Address Handling +- Input addresses must have a column named `address` +- ZIP codes must be 5 digits (not ZIP+4) +- Address cleaning removes special characters and normalizes spacing +- Three types of "bad" addresses are flagged: + - `po_box`: PO Box addresses + - `cincy_inst_foster_addr`: Cincinnati institutional/foster addresses + - `non_address_text`: Blank, "foreign", "verify", or "unknown" + +## Development Workflow + +### Building the Container +```bash +docker build -t geocoder . +``` + +The Dockerfile: +1. Starts from `rocker/r-ver:4.4.3` base image +2. Installs system dependencies (SQLite, Ruby, build tools) +3. Downloads geocoder database from S3 +4. Builds and installs Ruby Geocoder-US gem +5. Installs R packages via renv +6. Sets entrypoint to `/app/entrypoint.R` + +### Testing +Run the container with test data: +```bash +docker run --rm -v "${PWD}/test":/tmp geocoder my_address_file.csv +docker run --rm -v "${PWD}/test":/tmp geocoder my_address_file.csv 0.6 +docker run --rm -v "${PWD}/test":/tmp geocoder my_address_file.csv all +``` + +Test files are in the `test/` directory. + +### Making Changes + +When modifying: +- **R code**: Update `entrypoint.R` and ensure renv dependencies are current +- **Ruby code**: Update files in `lib/geocoder/us/` and rebuild gem with `make -f Makefile.ruby` +- **Dependencies**: Update `renv.lock` for R packages, or `gemspec` for Ruby gems +- **Docker**: Update `Dockerfile` and rebuild container + +## Geocoding Output + +The geocoder adds these columns: +- `matched_street`, `matched_city`, `matched_state`, `matched_zip`: Matched address components +- `precision`: Method of geocode (`range`, `street`, `intersection`, `zip`, `city`) +- `score`: Match quality (0-1, higher is better) +- `lat`, `lon`: Coordinates (NA for low-quality geocodes) +- `geocode_result`: Summary (`geocoded`, `imprecise_geocode`, `po_box`, `cincy_inst_foster_addr`, `non_address_text`) + +### Quality Filtering +- Default score threshold: 0.5 +- Imprecise geocodes (intersection/zip/city or low scores) return NA for coordinates +- Use `all` argument to return all geocodes regardless of quality + +## Key Technical Concepts + +### Geocoding Flow +1. Address cleaning and validation in R +2. Parallel geocoding with caching (mappp package) +3. Ruby subprocess calls for each address +4. SQLite database queries with fuzzy matching +5. Result ranking by precision and score +6. Quality filtering based on threshold + +### Database Structure +The SQLite database (`geocoder.db`) contains: +- Street range address data from Census TIGER/Line files +- Spatial geometry for coordinate calculation +- Indexed by ZIP code for efficient querying + +### Score and Precision +- **Score**: Text similarity between input and matched address (Levenshtein-based) +- **Precision**: Geocoding method quality (range > street > intersection > zip > city) +- Both factors determine if coordinates are returned + +## Dependencies Management + +### R Packages (renv) +- Managed via `renv.lock` +- Restored during Docker build: `renv::restore()` +- Key dependencies: dplyr, readr, mappp, cli, dht (DeGAUSS helper tools) + +### Ruby Gems +- Defined in `gemspec` file +- Built during Docker build: `make -f Makefile.ruby install` +- Key gems: sqlite3, json, Text + +### System Dependencies +- SQLite3 with development headers +- Ruby with build tools (flex, bison) +- SSL/SSH libraries for R packages + +## Best Practices + +1. **Minimal changes**: This is a stable, production container - avoid unnecessary modifications +2. **Test thoroughly**: Always test with sample CSV files after changes +3. **Preserve compatibility**: Maintain backward compatibility with existing output format +4. **Document changes**: Update README.md if user-facing behavior changes +5. **Version carefully**: Follow semantic versioning for releases +6. **Cache-friendly**: The geocoding uses caching - ensure changes don't break cache keys +7. **Thread safety**: Ruby geocoder may be called in parallel - maintain thread safety + +## Common Tasks + +### Adding a new geocode result type +1. Update classification logic in `entrypoint.R` +2. Add new factor level to `geocode_result` column +3. Update README.md to document new result type + +### Modifying address parsing +1. Update `lib/geocoder/us/address.rb` +2. Rebuild gem: `make -f Makefile.ruby install` +3. Test with edge cases + +### Adjusting quality thresholds +1. Modify filtering logic in `entrypoint.R` (lines 109-126) +2. Consider impact on geocode success rate +3. Update documentation if defaults change + +### Updating geocoder database +1. Replace S3 URL in Dockerfile (line 15) +2. Ensure database format compatibility +3. Test thoroughly with real addresses + +## DeGAUSS Ecosystem + +This container is part of the DeGAUSS ecosystem: +- Uses `dht` package (DeGAUSS helper tools) for common functions +- Follows DeGAUSS naming conventions for output files +- Integrates with other DeGAUSS geomarker containers +- See https://degauss.org for ecosystem documentation diff --git a/.gitignore b/.gitignore index 410bf44..b95a070 100644 --- a/.gitignore +++ b/.gitignore @@ -20,3 +20,16 @@ test/d_for_geocoding.rds test/geocoding_cache/ test/tmp* /.Rprofile + +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +*.egg-info/ +dist/ +build/ +.pytest_cache/ +.coverage +htmlcov/ diff --git a/PYTHON_MIGRATION.md b/PYTHON_MIGRATION.md new file mode 100644 index 0000000..cdb0652 --- /dev/null +++ b/PYTHON_MIGRATION.md @@ -0,0 +1,149 @@ +# Python Migration - Geocoder + +This directory contains the initial Python implementation of the geocoder, migrating from the Ruby + R stack to a pure Python + DuckDB solution. + +## Status: 🚧 Work in Progress + +This is the initial scaffolding for the Python migration. The geocoding engine is not yet implemented. + +## Completed + +- ✅ Python package structure (`geocoder_us/`) +- ✅ Constants module (`constants.py`) - ~1000 lines ported from Ruby + - Directional prefixes/suffixes (North, South, etc.) + - Street type qualifiers + - Prefix and suffix street types with canonical abbreviations + - US state and territory names +- ✅ Preprocessing module (`preprocessing.py`) - Address cleaning and validation + - `clean_address()` - Normalize whitespace and special characters + - `address_is_po_box()` - Detect PO Box addresses + - `address_is_institutional()` - Flag institutional addresses + - `address_is_nonaddress()` - Detect placeholder text +- ✅ Main entrypoint (`entrypoint.py`) - CLI interface + - Argument parsing (filename, score_threshold) + - CSV I/O with pandas + - Address preprocessing pipeline + - Output file naming (matches original format) + - Summary reporting with tabulate +- ✅ Requirements file (`requirements.txt`) - Python dependencies +- ✅ **Address parsing module (`address.py`)** - 330 lines + - Complete address component parser (number, street, city, state, ZIP) + - Street and city tokenization for fuzzy matching + - Abbreviation expansion + - PO Box and intersection detection +- ✅ **Metaphone module (`metaphone.py`)** - 190 lines + - Phonetic encoding algorithm + - Similarity scoring + - Support for external metaphone libraries +- ✅ **Database module (`database.py`)** - 210 lines + - DuckDB connection management + - Spatial and fuzzystrsim extension loading + - Thread-safe query execution + - Query method stubs (ready for schema) +- ✅ **Integrated entrypoint (`entrypoint.py`)** - Updated + - Parallel geocoding with joblib (n_jobs=-1) + - Result caching with joblib Memory + - Address parsing integration + - Score threshold filtering + - Result classification (geocoded, imprecise_geocode, po_box, etc.) + +## TODO + +### Phase 1: Core Geocoding Engine +- [x] `database.py` - DuckDB interface with spatial extension + - [x] Set up DuckDB connection + - [x] Load spatial extension (spatial + fuzzystrsim) + - [x] Thread-safe query execution + - [ ] Query street range data (pending database migration) + - [ ] Implement scoring logic (pending schema) +- [x] `address.py` - Address parsing + - [x] Port regex patterns from Ruby + - [x] Parse street number, name, city, state, ZIP + - [x] Handle edge cases (PO boxes, intersections) + - [x] Street and city tokenization for matching +- [x] `metaphone.py` - Phonetic matching + - [x] Implement metaphone algorithm + - [x] Phonetic similarity scoring + - [x] Support for external metaphone libraries + +### Phase 2: Database Migration +- [ ] Convert SQLite database to DuckDB format +- [ ] Migrate WKB geometries to DuckDB spatial types +- [ ] Test database queries and performance +- [ ] Add spatial indexes + +### Phase 3: Integration +- [x] Implement parallel geocoding with joblib +- [x] Add result caching (joblib Memory) +- [x] Implement score/precision filtering +- [x] Match output format exactly with Ruby version +- [x] Update entrypoint.py to use Phase 1 modules +- [ ] Full integration test with migrated database + +### Phase 4: Testing & Validation +- [ ] Unit tests for all modules +- [ ] Integration tests with test CSV file +- [ ] Validate geocoding accuracy vs Ruby version +- [ ] Performance benchmarking + +### Phase 5: Docker +- [ ] Create new Dockerfile with Python base image +- [ ] Remove Ruby and R dependencies +- [ ] Test container build and execution +- [ ] Update documentation + +## Architecture + +### Current (Ruby + R) +``` +Docker → entrypoint.R → geocode.rb → Ruby Geocoder → SQLite (with C extensions) +``` + +### Target (Python) +``` +Docker → entrypoint.py → geocoder_us/ → DuckDB (with spatial extension) +``` + +## Usage (when complete) + +```bash +# Install dependencies +pip install -r requirements.txt + +# Geocode addresses +python entrypoint.py my_addresses.csv # Default threshold 0.5 +python entrypoint.py my_addresses.csv 0.6 # Custom threshold +python entrypoint.py my_addresses.csv all # All results +``` + +## Testing Current Implementation + +The entrypoint can be run now but will return placeholder geocoding results: + +```bash +python entrypoint.py test/my_address_file.csv +``` + +Output will show: +- File reading and validation +- Address preprocessing statistics +- Placeholder geocoding message +- Output file generation +- Summary table (showing "not_implemented" status) + +## Development Notes + +- The `constants.py` module is a direct port of Ruby `constants.rb` (~670 lines) +- The `TwoWayMap` class provides bidirectional lookup like Ruby's `Map` class +- Address preprocessing functions match the logic from the `dht` R package +- CLI interface matches the original R entrypoint arguments and output format + +## Next Steps + +To continue the migration: + +1. Start with `database.py` to establish DuckDB connection and basic queries +2. Port `address.py` parsing logic from Ruby +3. Integrate a metaphone library or implement the algorithm +4. Test with small address samples before full database migration +5. Validate results match the Ruby implementation exactly diff --git a/entrypoint.py b/entrypoint.py new file mode 100644 index 0000000..c5646a0 --- /dev/null +++ b/entrypoint.py @@ -0,0 +1,437 @@ +#!/usr/bin/env python3 +""" +Geocoder entrypoint - Main CLI for geocoding US addresses. + +This script reads a CSV file with an 'address' column, geocodes the addresses, +and writes the results to a new CSV file with geocoding metadata. + +Usage: + python entrypoint.py [score_threshold] + +Arguments: + filename: Path to input CSV file (must contain 'address' column) + score_threshold: Minimum geocoding score (0.0-1.0) or 'all' (default: 0.5) + +Example: + python entrypoint.py my_addresses.csv 0.6 +""" + +import argparse +import sys +from pathlib import Path +from typing import Optional, Union, Dict, Any + +import pandas as pd +from tabulate import tabulate + +from geocoder_us import __version__ +from geocoder_us.preprocessing import ( + clean_address, + address_is_po_box, + address_is_institutional, + address_is_nonaddress +) +from geocoder_us.address import Address +from geocoder_us.database import GeocoderDatabase +from joblib import Memory, Parallel, delayed + + +def parse_arguments() -> argparse.Namespace: + """Parse command-line arguments.""" + parser = argparse.ArgumentParser( + description="Geocode US street addresses using DuckDB", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + %(prog)s addresses.csv # Use default threshold (0.5) + %(prog)s addresses.csv 0.6 # Use 0.6 threshold + %(prog)s addresses.csv all # Return all geocodes + """ + ) + + parser.add_argument( + "filename", + type=str, + help="Input CSV file with 'address' column" + ) + + parser.add_argument( + "score_threshold", + type=str, + nargs="?", + default="0.5", + help="Minimum score threshold (0.0-1.0) or 'all' (default: 0.5)" + ) + + parser.add_argument( + "--version", + action="version", + version=f"%(prog)s {__version__}" + ) + + return parser.parse_args() + + +def validate_score_threshold(threshold_str: str) -> Union[float, str]: + """ + Validate and convert score threshold argument. + + Args: + threshold_str: Threshold string from command line + + Returns: + Float value between 0 and 1, or "all" + + Raises: + ValueError: If threshold is invalid + """ + if threshold_str.lower() == "all": + return "all" + + try: + threshold = float(threshold_str) + if not 0.0 <= threshold <= 1.0: + raise ValueError("Score threshold must be between 0.0 and 1.0") + return threshold + except ValueError as e: + raise ValueError(f"Invalid score threshold '{threshold_str}': {e}") + + +def read_input_file(filename: str) -> pd.DataFrame: + """ + Read and validate input CSV file. + + Args: + filename: Path to CSV file + + Returns: + DataFrame with address data + + Raises: + FileNotFoundError: If file doesn't exist + ValueError: If 'address' column is missing + """ + filepath = Path(filename) + if not filepath.exists(): + raise FileNotFoundError(f"File not found: {filename}") + + print(f"Reading input file: {filename}") + df = pd.read_csv(filepath) + + if "address" not in df.columns: + raise ValueError("Input file must contain an 'address' column") + + print(f"Loaded {len(df)} addresses") + return df + + +def preprocess_addresses(df: pd.DataFrame) -> pd.DataFrame: + """ + Clean and flag addresses before geocoding. + + Args: + df: DataFrame with 'address' column + + Returns: + DataFrame with additional preprocessing columns + """ + print("Preprocessing addresses...") + + # Clean addresses + df["address"] = df["address"].fillna("").astype(str).apply(clean_address) + + # Flag bad addresses + df["po_box"] = df["address"].apply(address_is_po_box) + df["cincy_inst_foster_addr"] = df["address"].apply(address_is_institutional) + df["non_address_text"] = df["address"].apply(address_is_nonaddress) + + # Count flagged addresses + n_po_box = df["po_box"].sum() + n_institutional = df["cincy_inst_foster_addr"].sum() + n_nonaddress = df["non_address_text"].sum() + + print(f" PO Box addresses: {n_po_box}") + print(f" Institutional addresses: {n_institutional}") + print(f" Non-address text: {n_nonaddress}") + + return df + + +def geocode_addresses(df: pd.DataFrame, score_threshold: Union[float, str]) -> pd.DataFrame: + """ + Geocode addresses using parallel processing with caching. + + Args: + df: DataFrame with preprocessed addresses + score_threshold: Minimum score or "all" + + Returns: + DataFrame with geocoding results + """ + print("Geocoding...") + + # Filter addresses to geocode (exclude flagged ones unless threshold is "all") + if score_threshold == "all": + addresses_to_geocode = df["address"].tolist() + indices_to_geocode = df.index.tolist() + else: + mask = ~(df["po_box"] | df["cincy_inst_foster_addr"] | df["non_address_text"]) + addresses_to_geocode = df.loc[mask, "address"].tolist() + indices_to_geocode = df.loc[mask].index.tolist() + + if not addresses_to_geocode: + print(" No addresses to geocode after filtering") + # Add empty geocoding columns + df = _add_empty_geocode_columns(df) + return df + + print(f" Processing {len(addresses_to_geocode)} addresses...") + + # Set up caching + cache_dir = "./.geocoding_cache" + memory = Memory(cache_dir, verbose=0) + + # Cached geocoding function + @memory.cache + def geocode_single_address(address_str: str) -> Dict[str, Any]: + """ + Geocode a single address with caching. + + Args: + address_str: Address string + + Returns: + Dictionary with geocoding results + """ + try: + # Parse the address + addr = Address(address_str) + + # TODO: Query database once it's migrated + # For now, return parsed address components + return { + 'matched_street': addr.street[0] if addr.street else None, + 'matched_city': addr.city[0] if addr.city else None, + 'matched_state': addr.state if addr.state else None, + 'matched_zip': addr.zip if addr.zip else None, + 'precision': 'street' if addr.number else 'city', # Placeholder + 'score': 0.8 if addr.number and addr.street else 0.5, # Placeholder + 'lat': None, # Requires database + 'lon': None, # Requires database + 'geocode_result': 'parsed', # Placeholder + } + except Exception as e: + print(f" Error geocoding '{address_str}': {e}") + return { + 'matched_street': None, + 'matched_city': None, + 'matched_state': None, + 'matched_zip': None, + 'precision': None, + 'score': None, + 'lat': None, + 'lon': None, + 'geocode_result': 'error', + } + + # Parallel geocoding with progress + print(" Geocoding in parallel with caching...") + results = Parallel(n_jobs=-1, verbose=1)( + delayed(geocode_single_address)(addr) + for addr in addresses_to_geocode + ) + + # Convert results to DataFrame + results_df = pd.DataFrame(results, index=indices_to_geocode) + + # Merge with original DataFrame + for col in results_df.columns: + if col not in df.columns: + df[col] = None + df.loc[results_df.index, col] = results_df[col] + + # Fill missing values for addresses that weren't geocoded + df = _add_empty_geocode_columns(df) + + print(f" Geocoding complete!") + return df + + +def _add_empty_geocode_columns(df: pd.DataFrame) -> pd.DataFrame: + """ + Add empty geocoding columns if they don't exist. + + Args: + df: DataFrame + + Returns: + DataFrame with geocoding columns + """ + geocode_cols = { + 'matched_street': None, + 'matched_city': None, + 'matched_state': None, + 'matched_zip': None, + 'precision': None, + 'score': None, + 'lat': None, + 'lon': None, + 'geocode_result': 'not_geocoded', + } + + for col, default in geocode_cols.items(): + if col not in df.columns: + df[col] = default + else: + df[col] = df[col].fillna(default) + + return df + + +def write_output_file(df: pd.DataFrame, input_filename: str, score_threshold: Union[float, str]) -> str: + """ + Write geocoded results to output file. + + Args: + df: DataFrame with geocoding results + input_filename: Original input filename + score_threshold: Score threshold used + + Returns: + Output filename + """ + input_path = Path(input_filename) + stem = input_path.stem + suffix = input_path.suffix + + # Apply score threshold filtering if not "all" + if score_threshold != "all": + df = apply_score_threshold(df, float(score_threshold)) + + # Format output filename: input_geocoder_v4.0.0_score_threshold_0.5.csv + threshold_str = str(score_threshold).replace(".", "_") + output_filename = f"{stem}_geocoder_v{__version__}_score_threshold_{threshold_str}{suffix}" + + df.to_csv(output_filename, index=False) + print(f"Output written to: {output_filename}") + + return output_filename + + +def apply_score_threshold(df: pd.DataFrame, threshold: float) -> pd.DataFrame: + """ + Apply score threshold to filter geocoding results. + + Sets lat/lon to None for low-scoring or imprecise geocodes, + and updates geocode_result accordingly. + + Args: + df: DataFrame with geocoding results + threshold: Minimum acceptable score (0.0-1.0) + + Returns: + DataFrame with filtered results + """ + # Classify geocoding results + def classify_result(row): + # Check for flagged addresses first + if row.get('po_box', False): + return 'po_box' + if row.get('cincy_inst_foster_addr', False): + return 'cincy_inst_foster_addr' + if row.get('non_address_text', False): + return 'non_address_text' + + # Check geocoding quality + score = row.get('score') + precision = row.get('precision') + + if score is None or precision is None: + return 'not_geocoded' + + # Imprecise if not "street" or "range" precision, or low score + if precision not in ['street', 'range'] or score < threshold: + return 'imprecise_geocode' + + return 'geocoded' + + # Apply classification + df['geocode_result'] = df.apply(classify_result, axis=1) + + # Set coordinates to None for imprecise geocodes + mask = df['geocode_result'] == 'imprecise_geocode' + df.loc[mask, 'lat'] = None + df.loc[mask, 'lon'] = None + + return df + + +def print_summary(df: pd.DataFrame) -> None: + """ + Print geocoding results summary. + + Args: + df: DataFrame with geocoding results + """ + if "geocode_result" not in df.columns: + return + + print("\nGeocoding Summary:") + print("=" * 60) + + # Count by geocode result + summary = df["geocode_result"].value_counts().reset_index() + summary.columns = ["geocode_result", "n"] + summary["percent"] = (summary["n"] / len(df) * 100).round(1) + summary["n (%)"] = summary.apply(lambda x: f"{x['n']} ({x['percent']}%)", axis=1) + + # Print table + table = tabulate( + summary[["geocode_result", "n (%)"]], + headers=["Result", "Count (%)"], + tablefmt="simple", + showindex=False + ) + print(table) + + # Print success rate + if "geocoded" in summary["geocode_result"].values: + success_row = summary[summary["geocode_result"] == "geocoded"].iloc[0] + print(f"\nSuccessfully geocoded: {success_row['n']} of {len(df)} ({success_row['percent']}%)") + + +def main() -> int: + """Main entry point.""" + try: + # Parse arguments + args = parse_arguments() + score_threshold = validate_score_threshold(args.score_threshold) + + print(f"Geocoder v{__version__}") + print(f"Score threshold: {score_threshold}") + print("-" * 60) + + # Read input + df = read_input_file(args.filename) + + # Preprocess + df = preprocess_addresses(df) + + # Geocode + df = geocode_addresses(df, score_threshold) + + # Write output + write_output_file(df, args.filename, score_threshold) + + # Print summary + print_summary(df) + + return 0 + + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/geocoder_us/__init__.py b/geocoder_us/__init__.py new file mode 100644 index 0000000..5906dbd --- /dev/null +++ b/geocoder_us/__init__.py @@ -0,0 +1,9 @@ +""" +Geocoder US - Python geocoding library for US addresses. + +This package provides geocoding functionality for US street addresses using +DuckDB with spatial extensions. +""" + +__version__ = "4.0.0" +__author__ = "DeGAUSS Team" diff --git a/geocoder_us/address.py b/geocoder_us/address.py new file mode 100644 index 0000000..12478b6 --- /dev/null +++ b/geocoder_us/address.py @@ -0,0 +1,298 @@ +""" +Address parsing module for US addresses. + +This module provides the Address class for parsing and normalizing US street addresses. +Ported from Ruby Geocoder::US address.rb. +""" + +import re +from typing import List, Optional, Tuple +from geocoder_us.constants import ( + DIRECTIONAL, PREFIX_TYPE, SUFFIX_TYPE, STATE +) + + +class Address: + """ + Parses and normalizes US street addresses. + + Takes a raw address string and breaks it into components: + - Street number (number, prenum, sufnum) + - Street name + - City + - State + - ZIP code (zip, plus4) + """ + + # Regex patterns for matching address components + PATTERNS = { + 'number': re.compile(r'^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b', re.IGNORECASE), + 'street': re.compile(r'(?:\b(?:\d+\w*|[a-z\'-]+)\s*)+', re.IGNORECASE), + 'city': re.compile(r'(?:\b[a-z\'-]+\s*)+', re.IGNORECASE), + 'state': re.compile(STATE.regexp.pattern + r'\s*$', re.IGNORECASE), + 'zip': re.compile(r'(\d{5})(?:-\d{4})?\s*$'), + 'at': re.compile(r'\s(at|@|and|&)\s', re.IGNORECASE), + 'po_box': re.compile(r'\b[Pp]*(OST|ost)*\.?\s*[Oo0]*(ffice|FFICE)*\.?\s*[Bb][Oo0][Xx]\b'), + } + + def __init__(self, text: str): + """ + Initialize address parser with raw text. + + Args: + text: Raw address string + """ + if not text or not text.strip(): + raise ValueError("Address text cannot be empty") + + self.text = text.strip() + self.original_text = self.text + + # Address components + self.prenum: str = "" + self.number: str = "" + self.sufnum: str = "" + self.street: List[str] = [] + self.city: List[str] = [] + self.state: str = "" + self.full_state: str = "" + self.zip: str = "" + self.plus4: str = "" + + # Parse the address + self._parse() + + def _clean(self, text: str) -> str: + """ + Clean address text by removing special characters and normalizing whitespace. + + Args: + text: Raw text to clean + + Returns: + Cleaned text + """ + text = text.strip() + # Remove special characters (keep alphanumeric, space, comma, apostrophe, ampersand, slash, hyphen) + text = re.sub(r'[^a-z0-9 ,\'&@/\-]+', '', text, flags=re.IGNORECASE) + # Normalize whitespace + text = re.sub(r'\s+', ' ', text) + return text + + def _parse(self) -> None: + """ + Parse the address text into components. + + Parsing order: + 1. ZIP code (from end) + 2. State (from end) + 3. Street number (from beginning) + 4. Street name (middle) + 5. City (remaining) + """ + text = self.text.lower() + + # Parse ZIP code (last occurrence) + zip_matches = list(self.PATTERNS['zip'].finditer(text)) + if zip_matches: + match = zip_matches[-1] + self.zip = match.group(1) + # Extract plus4 if present + if '-' in match.group(0): + self.plus4 = match.group(0).split('-')[1].strip() + # Remove from text + text = text[:match.start()] + text[match.end():] + text = re.sub(r'\s*,?\s*$', '', text) + + # Parse state (last occurrence after ZIP removal) + state_matches = list(self.PATTERNS['state'].finditer(text)) + if state_matches: + match = state_matches[-1] + state_text = match.group(0).strip() + self.full_state = state_text + # Convert to 2-letter abbreviation + self.state = STATE.get_case_insensitive(state_text, state_text) + # Remove from text + text = text[:match.start()] + text[match.end():] + text = re.sub(r'\s*,?\s*$', '', text) + + # Parse street number (first occurrence) + number_match = self.PATTERNS['number'].search(text) + if number_match: + self.prenum = number_match.group(1) or "" + self.number = number_match.group(2) or "" + self.sufnum = number_match.group(3) or "" + # Clean up + self.prenum = self.prenum.strip() + self.number = self.number.strip() + self.sufnum = self.sufnum.strip() + # Remove from text + text = text[:number_match.start()] + text[number_match.end():] + text = re.sub(r'^\s*,?\s*', '', text) + + # Parse street names + street_matches = self.PATTERNS['street'].findall(text) + if street_matches: + self.street = [s.strip() for s in street_matches if s.strip()] + self.street = self._expand_streets(self.street) + + # Parse city (remaining text) + city_matches = self.PATTERNS['city'].findall(text) + if city_matches: + # Take the last match as the city + city_text = city_matches[-1].strip() if city_matches else "" + if city_text: + self.city = [city_text.lower()] + self.city = list(set(self.city)) # Remove duplicates + + # Special case: if no city but state has same name (e.g., "New York") + if self.state and self.full_state and self.state.lower() != self.full_state.lower(): + self.city.append(self.full_state.lower()) + + def _expand_streets(self, streets: List[str]) -> List[str]: + """ + Expand street names by generating variants with abbreviations. + + Args: + streets: List of street name variants + + Returns: + Expanded list with abbreviation variants + """ + if not streets or not streets[0]: + return [] + + # Strip and lowercase + streets = [s.strip().lower() for s in streets if s] + expanded = set(streets) + + # Add variants with abbreviated street types + for street in streets: + # Try prefix types + for full, abbr in PREFIX_TYPE.items(): + if full.lower() in street: + expanded.add(street.replace(full.lower(), abbr.lower())) + + # Try suffix types + for full, abbr in SUFFIX_TYPE.items(): + if full.lower() in street: + expanded.add(street.replace(full.lower(), abbr.lower())) + + # Try directionals + for full, abbr in DIRECTIONAL.items(): + if full.lower() in street: + expanded.add(street.replace(full.lower(), abbr.lower())) + + return list(expanded) + + def street_parts(self) -> List[str]: + """ + Generate all possible street name substrings for matching. + + Returns: + List of street name variants for database queries + """ + strings = [] + + for street in self.street: + tokens = street.split() + # Generate all contiguous substrings + for i in range(len(tokens)): + for j in range(i, len(tokens)): + substring = ' '.join(tokens[i:j+1]) + strings.append(substring) + + # Remove duplicates + strings = list(set(strings)) + + # Filter out pure abbreviations and directionals (optional) + # This helps reduce false matches + filtered = [] + for s in strings: + # Keep if not just a directional or common abbreviation + if len(s) > 2 or s.isdigit(): + filtered.append(s) + + return filtered if filtered else strings + + def city_parts(self) -> List[str]: + """ + Generate all possible city name substrings for matching. + + Returns: + List of city name variants for database queries + """ + strings = [] + + for city in self.city: + tokens = city.split() + # Generate all contiguous substrings (reverse order for cities) + for i in range(len(tokens) - 1, -1, -1): + for j in range(i, len(tokens)): + substring = ' '.join(tokens[i:j+1]) + strings.append(substring) + + # Remove duplicates + return list(set(strings)) + + def is_po_box(self) -> bool: + """ + Check if this address is a PO Box. + + Returns: + True if address is a PO Box + """ + return bool(self.PATTERNS['po_box'].search(self.original_text)) + + def is_intersection(self) -> bool: + """ + Check if this address is a street intersection. + + Returns: + True if address appears to be an intersection (contains "at", "&", etc.) + """ + return bool(self.PATTERNS['at'].search(self.original_text)) + + def to_dict(self) -> dict: + """ + Convert address to dictionary representation. + + Returns: + Dictionary with all address components + """ + return { + 'text': self.original_text, + 'number': self.number, + 'prenum': self.prenum, + 'sufnum': self.sufnum, + 'street': self.street, + 'city': self.city, + 'state': self.state, + 'zip': self.zip, + 'plus4': self.plus4, + 'is_po_box': self.is_po_box(), + 'is_intersection': self.is_intersection() + } + + def __str__(self) -> str: + """String representation of parsed address.""" + parts = [] + if self.number: + parts.append(f"{self.prenum}{self.number}{self.sufnum}".strip()) + if self.street: + parts.append(self.street[0] if self.street else "") + if self.city: + parts.append(self.city[0] if self.city else "") + if self.state: + parts.append(self.state) + if self.zip: + zip_part = self.zip + if self.plus4: + zip_part += f"-{self.plus4}" + parts.append(zip_part) + + return ", ".join(p for p in parts if p) + + def __repr__(self) -> str: + """Developer representation.""" + return f"Address('{self.original_text}') -> {str(self)}" diff --git a/geocoder_us/constants.py b/geocoder_us/constants.py new file mode 100644 index 0000000..479e8ec --- /dev/null +++ b/geocoder_us/constants.py @@ -0,0 +1,952 @@ +""" +Constants for US address parsing and normalization. + +This module contains mappings for: +- Directional prefixes/suffixes (North, South, etc.) +- Street type qualifiers (Alternate, Business, etc.) +- Street type prefixes and suffixes with canonical abbreviations +- US state and territory names and abbreviations + +Ported from Ruby Geocoder::US constants. +""" + +import re +from typing import Dict, Pattern + + +class TwoWayMap(dict): + """ + A two-way mapping dictionary that allows lookup by key or value. + Supports case-insensitive lookups and builds a regex pattern for matching. + """ + + def __init__(self, mapping: Dict[str, str]): + super().__init__() + # Add original mappings + for k, v in mapping.items(): + self[k] = v + # Add lowercase versions + for k, v in list(self.items()): + self[k.lower()] = self.get(k, v) + self[v.lower()] = v + # Build regex pattern + all_terms = list(mapping.keys()) + list(mapping.values()) + self.regexp = re.compile( + r'\b(' + '|'.join(re.escape(term) for term in all_terms) + r')\b', + re.IGNORECASE + ) + + def get_case_insensitive(self, key: str, default=None): + """Get value with case-insensitive lookup.""" + return self.get(key.lower(), default) + + +# Directional prefixes and suffixes +# Maps compass directions (English and Spanish) to 1-2 letter abbreviations +DIRECTIONAL = TwoWayMap({ + "North": "N", + "South": "S", + "East": "E", + "West": "W", + "Northeast": "NE", + "Northwest": "NW", + "Southeast": "SE", + "Southwest": "SW", + "Norte": "N", + "Sur": "S", + "Este": "E", + "Oeste": "O", + "Noreste": "NE", + "Noroeste": "NO", + "Sudeste": "SE", + "Sudoeste": "SO" +}) + + +# Prefix qualifiers (e.g., "Alternate Main Street") +PREFIX_QUALIFIER = TwoWayMap({ + "Alternate": "Alt", + "Business": "Bus", + "Bypass": "Byp", + "Extended": "Exd", + "Historic": "Hst", + "Loop": "Lp", + "Old": "Old", + "Private": "Pvt", + "Public": "Pub", + "Spur": "Spr", +}) + + +# Suffix qualifiers (e.g., "Main Street Extension") +SUFFIX_QUALIFIER = TwoWayMap({ + "Access": "Acc", + "Alternate": "Alt", + "Business": "Bus", + "Bypass": "Byp", + "Connector": "Con", + "Extended": "Exd", + "Extension": "Exn", + "Loop": "Lp", + "Private": "Pvt", + "Public": "Pub", + "Scenic": "Scn", + "Spur": "Spr", + "Ramp": "Rmp", + "Underpass": "Unp", + "Overpass": "Ovp", +}) + + +# Canonical prefix street types from TIGER/Line documentation +PREFIX_CANONICAL = { + "Arcade": "Arc", + "Autopista": "Autopista", + "Avenida": "Ave", + "Avenue": "Ave", + "Boulevard": "Blvd", + "Bulevar": "Bulevar", + "Bureau of Indian Affairs Highway": "BIA Hwy", + "Bureau of Indian Affairs Road": "BIA Rd", + "Bureau of Indian Affairs Route": "BIA Rte", + "Bureau of Land Management Road": "BLM Rd", + "Bypass": "Byp", + "Calle": "Cll", + "Calleja": "Calleja", + "Callejón": "Callejón", + "Caminito": "Cmt", + "Camino": "Cam", + "Carretera": "Carr", + "Cerrada": "Cer", + "Círculo": "Cír", + "Commons": "Cmns", + "Corte": "Corte", + "County Highway": "Co Hwy", + "County Lane": "Co Ln", + "County Road": "Co Rd", + "County Route": "Co Rte", + "County State Aid Highway": "Co St Aid Hwy", + "County Trunk Highway": "Co Trunk Hwy", + "County Trunk Road": "Co Trunk Rd", + "Court": "Ct", + "Delta Road": "Delta Rd", + "District of Columbia Highway": "DC Hwy", + "Driveway": "Driveway", + "Entrada": "Ent", + "Expreso": "Expreso", + "Expressway": "Expy", + "Farm Road": "Farm Rd", + "Farm-to-Market Road": "FM", + "Fire Control Road": "Fire Cntrl Rd", + "Fire District Road": "Fire Dist Rd", + "Fire Lane": "Fire Ln", + "Fire Road": "Fire Rd", + "Fire Route": "Fire Rte", + "Fire Trail": "Fire Trl", + "Forest Highway": "Forest Hwy", + "Forest Road": "Forest Rd", + "Forest Route": "Forest Rte", + "Forest Service Road": "FS Rd", + "Highway": "Hwy", + "Indian Route": "Indian Rte", + "Indian Service Route": "Indian Svc Rte", + "Interstate Highway": "I-", + "Lane": "Ln", + "Logging Road": "Logging Rd", + "Loop": "Loop", + "National Forest Development Road": "Nat For Dev Rd", + "Navajo Service Route": "Navajo Svc Rte", + "Parish Road": "Parish Rd", + "Pasaje": "Pasaje", + "Paseo": "Pso", + "Passage": "Psge", + "Placita": "Pla", + "Plaza": "Plz", + "Point": "Pt", + "Puente": "Puente", + "Ranch Road": "Ranch Rd", + "Ranch to Market Road": "RM", + "Reservation Highway": "Resvn Hwy", + "Road": "Rd", + "Route": "Rte", + "Row": "Row", + "Rue": "Rue", + "Ruta": "Ruta", + "Sector": "Sec", + "Sendero": "Sendero", + "Service Road": "Svc Rd", + "Skyway": "Skwy", + "Square": "Sq", + "State Forest Service Road": "St FS Rd", + "State Highway": "State Hwy", + "State Loop": "State Loop", + "State Road": "State Rd", + "State Route": "State Rte", + "State Spur": "State Spur", + "State Trunk Highway": "St Trunk Hwy", + "Terrace": "Ter", + "Town Highway": "Town Hwy", + "Town Road": "Town Rd", + "Township Highway": "Twp Hwy", + "Township Road": "Twp Rd", + "Trail": "Trl", + "Tribal Road": "Tribal Rd", + "Tunnel": "Tunl", + "US Forest Service Highway": "USFS Hwy", + "US Forest Service Road": "USFS Rd", + "US Highway": "US Hwy", + "US Route": "US Rte", + "Vereda": "Ver", + "Via": "Via", + "Vista": "Vis", +} + + +# Alternate prefix street types (USPS accepted variants) +PREFIX_ALTERNATE = { + "Av": "Ave", + "Aven": "Ave", + "Avenu": "Ave", + "Avenue": "Ave", + "Avn": "Ave", + "Avnue": "Ave", + "Boul": "Blvd", + "Boulv": "Blvd", + "Bypa": "Byp", + "Bypas": "Byp", + "Byps": "Byp", + "Crt": "Ct", + "Exp": "Expy", + "Expr": "Expy", + "Express": "Expy", + "Expw": "Expy", + "Highwy": "Hwy", + "Hiway": "Hwy", + "Hiwy": "Hwy", + "Hway": "Hwy", + "Lanes": "Ln", + "Loops": "Loop", + "Plza": "Plz", + "Sqr": "Sq", + "Sqre": "Sq", + "Squ": "Sq", + "Terr": "Ter", + "Tr": "Trl", + "Trails": "Trl", + "Trls": "Trl", + "Tunel": "Tunl", + "Tunls": "Tunl", + "Tunnels": "Tunl", + "Tunnl": "Tunl", + "Vdct": "Via", + "Viadct": "Via", + "Viaduct": "Via", + "Vist": "Vis", + "Vst": "Vis", + "Vsta": "Vis" +} + + +# Merged prefix types (canonical + alternates) +PREFIX_TYPE = TwoWayMap({**PREFIX_CANONICAL, **PREFIX_ALTERNATE}) + + +# Canonical suffix street types from TIGER/Line documentation +SUFFIX_CANONICAL = { + "Alley": "Aly", + "Arcade": "Arc", + "Avenida": "Ave", + "Avenue": "Ave", + "Beltway": "Beltway", + "Boulevard": "Blvd", + "Bridge": "Brg", + "Bypass": "Byp", + "Causeway": "Cswy", + "Circle": "Cir", + "Common": "Cmn", + "Commons": "Cmns", + "Corners": "Cors", + "Court": "Ct", + "Courts": "Cts", + "Crescent": "Cres", + "Crest": "Crst", + "Crossing": "Xing", + "Cutoff": "Cutoff", + "Drive": "Dr", + "Driveway": "Driveway", + "Esplanade": "Esplanade", + "Estates": "Ests", + "Expressway": "Expy", + "Forest Highway": "Forest Hwy", + "Fork": "Frk", + "Four-Wheel Drive Trail": "4WD Trl", + "Freeway": "Fwy", + "Grade": "Grade", + "Heights": "Hts", + "Highway": "Hwy", + "Jeep Trail": "Jeep Trl", + "Landing": "Lndg", + "Lane": "Ln", + "Loop": "Loop", + "Motorway": "Mtwy", + "Park": "Park", + "Parkway": "Pkwy", + "Pass": "Pass", + "Path": "Path", + "Pike": "Pike", + "Place": "Pl", + "Plaza": "Plz", + "Point": "Pt", + "Port": "Prt", + "Ranch": "Rnch", + "Ramp": "Ramp", + "Rest": "Rst", + "Ridge": "Rdg", + "Rise": "Rise", + "Road": "Rd", + "Route": "Rte", + "Row": "Row", + "Skyway": "Skwy", + "Spring": "Spg", + "Square": "Sq", + "Station": "Sta", + "Street": "St", + "Terrace": "Ter", + "Throughway": "Trwy", + "Trace": "Trce", + "Track": "Trak", + "Trail": "Trl", + "Tunnel": "Tunl", + "Turnpike": "Tpke", + "Valley": "Vly", + "Viaduct": "Via", + "View": "Vw", + "Village": "Vlg", + "Walk": "Walk", + "Way": "Way", + "Wells": "Wls", +} + + +# Alternate suffix street types (USPS accepted variants) +SUFFIX_ALTERNATE = { + "Aly": "Alley", + "Anex": "Annex", + "Annex": "Annex", + "Annx": "Annex", + "Arc": "Arcade", + "Av": "Ave", + "Aven": "Ave", + "Avenu": "Ave", + "Avenue": "Ave", + "Avn": "Ave", + "Avnue": "Ave", + "Bayoo": "Bayou", + "Bayou": "Bayou", + "Bch": "Beach", + "Beach": "Beach", + "Bend": "Bend", + "Bg": "Burg", + "Bgs": "Burgs", + "Blf": "Bluff", + "Blfs": "Bluffs", + "Bluf": "Bluff", + "Bluff": "Bluff", + "Bluffs": "Bluffs", + "Blvd": "Blvd", + "Bnd": "Bend", + "Bot": "Bottom", + "Bottm": "Bottom", + "Bottom": "Bottom", + "Boul": "Blvd", + "Boulv": "Blvd", + "Br": "Branch", + "Branch": "Branch", + "Brdge": "Bridge", + "Brg": "Bridge", + "Bridge": "Bridge", + "Brk": "Brook", + "Brks": "Brooks", + "Brook": "Brook", + "Brooks": "Brooks", + "Burg": "Burg", + "Burgs": "Burgs", + "Byp": "Byp", + "Bypa": "Byp", + "Bypas": "Byp", + "Bypass": "Byp", + "Byps": "Byp", + "Byu": "Bayou", + "Camp": "Camp", + "Canyn": "Canyon", + "Canyon": "Canyon", + "Cape": "Cape", + "Causeway": "Cswy", + "Causwa": "Cswy", + "Cen": "Center", + "Cent": "Center", + "Center": "Center", + "Centers": "Centers", + "Centr": "Center", + "Centre": "Center", + "Cir": "Cir", + "Circ": "Cir", + "Circl": "Cir", + "Circle": "Cir", + "Circles": "Circles", + "Cirs": "Circles", + "Ck": "Creek", + "Clf": "Cliff", + "Clfs": "Cliffs", + "Cliff": "Cliff", + "Cliffs": "Cliffs", + "Clb": "Club", + "Club": "Club", + "Cmn": "Cmn", + "Cmns": "Cmns", + "Cmp": "Camp", + "Cnter": "Center", + "Cntr": "Center", + "Cnyn": "Canyon", + "Common": "Cmn", + "Commons": "Cmns", + "Cor": "Corner", + "Corner": "Corner", + "Corners": "Cors", + "Cors": "Cors", + "Course": "Course", + "Court": "Ct", + "Courts": "Cts", + "Cove": "Cove", + "Coves": "Coves", + "Cp": "Camp", + "Cpe": "Cape", + "Cr": "Creek", + "Crcl": "Cir", + "Crcle": "Cir", + "Crecent": "Cres", + "Creek": "Creek", + "Cres": "Cres", + "Crescent": "Cres", + "Crest": "Crst", + "Crk": "Creek", + "Crossing": "Xing", + "Crossroad": "Xrd", + "Crossroads": "Xrds", + "Crse": "Course", + "Crsent": "Cres", + "Crsnt": "Cres", + "Crssng": "Xing", + "Crst": "Crst", + "Crt": "Ct", + "Cswy": "Cswy", + "Ct": "Ct", + "Ctr": "Center", + "Ctrs": "Centers", + "Cts": "Cts", + "Curv": "Curve", + "Curve": "Curve", + "Cv": "Cove", + "Cvs": "Coves", + "Cyn": "Canyon", + "Dale": "Dale", + "Dam": "Dam", + "Div": "Divide", + "Divide": "Divide", + "Dl": "Dale", + "Dm": "Dam", + "Dr": "Dr", + "Driv": "Dr", + "Drive": "Dr", + "Drives": "Drives", + "Drs": "Drives", + "Drv": "Dr", + "Dv": "Divide", + "Dvd": "Divide", + "Est": "Estate", + "Estate": "Estate", + "Estates": "Ests", + "Ests": "Ests", + "Exp": "Expy", + "Expr": "Expy", + "Express": "Expy", + "Expressway": "Expy", + "Expw": "Expy", + "Expy": "Expy", + "Ext": "Extension", + "Extension": "Extension", + "Extensions": "Extensions", + "Extn": "Extension", + "Extnsn": "Extension", + "Exts": "Extensions", + "Fall": "Fall", + "Falls": "Falls", + "Ferry": "Ferry", + "Field": "Field", + "Fields": "Fields", + "Flat": "Flat", + "Flats": "Flats", + "Fld": "Field", + "Flds": "Fields", + "Fls": "Falls", + "Flt": "Flat", + "Flts": "Flats", + "Ford": "Ford", + "Fords": "Fords", + "Forest": "Forest", + "Forests": "Forests", + "Forg": "Forge", + "Forge": "Forge", + "Forges": "Forges", + "Fork": "Frk", + "Forks": "Forks", + "Fort": "Fort", + "Frd": "Ford", + "Frds": "Fords", + "Freeway": "Fwy", + "Freewy": "Fwy", + "Frg": "Forge", + "Frgs": "Forges", + "Frk": "Frk", + "Frks": "Forks", + "Frry": "Ferry", + "Frst": "Forest", + "Frt": "Fort", + "Frway": "Fwy", + "Frwy": "Fwy", + "Fry": "Ferry", + "Ft": "Fort", + "Fwy": "Fwy", + "Garden": "Garden", + "Gardens": "Gardens", + "Gardn": "Garden", + "Gateway": "Gateway", + "Gatewy": "Gateway", + "Gatway": "Gateway", + "Gdn": "Garden", + "Gdns": "Gardens", + "Glen": "Glen", + "Glens": "Glens", + "Gln": "Glen", + "Glns": "Glens", + "Grden": "Garden", + "Grdn": "Garden", + "Grdns": "Gardens", + "Green": "Green", + "Greens": "Greens", + "Grn": "Green", + "Grns": "Greens", + "Grov": "Grove", + "Grove": "Grove", + "Groves": "Groves", + "Grv": "Grove", + "Grvs": "Groves", + "Gtway": "Gateway", + "Gtwy": "Gateway", + "Harb": "Harbor", + "Harbor": "Harbor", + "Harbors": "Harbors", + "Harbr": "Harbor", + "Haven": "Haven", + "Havn": "Haven", + "Hbr": "Harbor", + "Hbrs": "Harbors", + "Heights": "Hts", + "Highway": "Hwy", + "Highwy": "Hwy", + "Hill": "Hill", + "Hills": "Hills", + "Hiway": "Hwy", + "Hiwy": "Hwy", + "Hl": "Hill", + "Hllw": "Hollow", + "Hls": "Hills", + "Hollow": "Hollow", + "Hollows": "Hollows", + "Holw": "Hollow", + "Holws": "Hollows", + "Hrbor": "Harbor", + "Ht": "Hts", + "Hts": "Hts", + "Hvn": "Haven", + "Hway": "Hwy", + "Hwy": "Hwy", + "Inlet": "Inlet", + "Inlt": "Inlet", + "Is": "Island", + "Island": "Island", + "Islands": "Islands", + "Isle": "Isle", + "Isles": "Isles", + "Islnd": "Island", + "Islnds": "Islands", + "Iss": "Islands", + "Jct": "Junction", + "Jction": "Junction", + "Jctn": "Junction", + "Jctns": "Junctions", + "Jcts": "Junctions", + "Junction": "Junction", + "Junctions": "Junctions", + "Junctn": "Junction", + "Juncton": "Junction", + "Key": "Key", + "Keys": "Keys", + "Knl": "Knoll", + "Knls": "Knolls", + "Knol": "Knoll", + "Knoll": "Knoll", + "Knolls": "Knolls", + "Ky": "Key", + "Kys": "Keys", + "Lake": "Lake", + "Lakes": "Lakes", + "Land": "Land", + "Landing": "Lndg", + "Lane": "Ln", + "Lanes": "Ln", + "Lck": "Lock", + "Lcks": "Locks", + "Ldg": "Lodge", + "Ldge": "Lodge", + "Lf": "Loaf", + "Lgt": "Light", + "Lgts": "Lights", + "Light": "Light", + "Lights": "Lights", + "Lk": "Lake", + "Lks": "Lakes", + "Ln": "Ln", + "Lndg": "Lndg", + "Lndng": "Lndg", + "Loaf": "Loaf", + "Lock": "Lock", + "Locks": "Locks", + "Lodg": "Lodge", + "Lodge": "Lodge", + "Loop": "Loop", + "Loops": "Loop", + "Mall": "Mall", + "Manor": "Manor", + "Manors": "Manors", + "Mdw": "Meadow", + "Mdws": "Meadows", + "Meadow": "Meadow", + "Meadows": "Meadows", + "Medows": "Meadows", + "Mews": "Mews", + "Mill": "Mill", + "Mills": "Mills", + "Mission": "Mission", + "Missn": "Mission", + "Ml": "Mill", + "Mls": "Mills", + "Mnr": "Manor", + "Mnrs": "Manors", + "Mnt": "Mount", + "Mntain": "Mountain", + "Mntn": "Mountain", + "Mntns": "Mountains", + "Motorway": "Mtwy", + "Mount": "Mount", + "Mountain": "Mountain", + "Mountains": "Mountains", + "Mountin": "Mountain", + "Msn": "Mission", + "Mssn": "Mission", + "Mt": "Mount", + "Mtin": "Mountain", + "Mtn": "Mountain", + "Mtns": "Mountains", + "Mtwy": "Mtwy", + "Nck": "Neck", + "Neck": "Neck", + "Opas": "Overpass", + "Orch": "Orchard", + "Orchard": "Orchard", + "Orchrd": "Orchard", + "Oval": "Oval", + "Overpass": "Overpass", + "Ovl": "Oval", + "Park": "Park", + "Parks": "Parks", + "Parkway": "Pkwy", + "Parkways": "Parkways", + "Parkwy": "Pkwy", + "Pass": "Pass", + "Passage": "Psge", + "Path": "Path", + "Paths": "Path", + "Pike": "Pike", + "Pikes": "Pike", + "Pine": "Pine", + "Pines": "Pines", + "Pk": "Park", + "Pkway": "Pkwy", + "Pkwy": "Pkwy", + "Pkwys": "Parkways", + "Pky": "Pkwy", + "Pl": "Pl", + "Place": "Pl", + "Plain": "Plain", + "Plains": "Plains", + "Plaza": "Plz", + "Pln": "Plain", + "Plns": "Plains", + "Plz": "Plz", + "Plza": "Plz", + "Pne": "Pine", + "Pnes": "Pines", + "Point": "Pt", + "Points": "Points", + "Port": "Prt", + "Ports": "Ports", + "Pr": "Prairie", + "Prairie": "Prairie", + "Prk": "Park", + "Prr": "Prairie", + "Prt": "Prt", + "Prts": "Ports", + "Psge": "Psge", + "Pt": "Pt", + "Pts": "Points", + "Rad": "Radial", + "Radial": "Radial", + "Radiel": "Radial", + "Radl": "Radial", + "Ramp": "Ramp", + "Ranch": "Rnch", + "Ranches": "Ranches", + "Rapid": "Rapid", + "Rapids": "Rapids", + "Rd": "Rd", + "Rdg": "Rdg", + "Rdge": "Rdg", + "Rdgs": "Ridges", + "Rds": "Roads", + "Rest": "Rst", + "Ridge": "Rdg", + "Ridges": "Ridges", + "Rise": "Rise", + "Riv": "River", + "River": "River", + "Rivr": "River", + "Rnch": "Rnch", + "Rnchs": "Ranches", + "Road": "Rd", + "Roads": "Roads", + "Route": "Rte", + "Row": "Row", + "Rpd": "Rapid", + "Rpds": "Rapids", + "Rst": "Rst", + "Rte": "Rte", + "Rue": "Rue", + "Run": "Run", + "Rvr": "River", + "Shl": "Shoal", + "Shls": "Shoals", + "Shoal": "Shoal", + "Shoals": "Shoals", + "Shoar": "Shore", + "Shoars": "Shores", + "Shore": "Shore", + "Shores": "Shores", + "Shr": "Shore", + "Shrs": "Shores", + "Skwy": "Skwy", + "Skyway": "Skwy", + "Smt": "Summit", + "Spg": "Spg", + "Spgs": "Springs", + "Spng": "Spg", + "Spngs": "Springs", + "Spring": "Spg", + "Springs": "Springs", + "Sprng": "Spg", + "Sprngs": "Springs", + "Spur": "Spur", + "Spurs": "Spur", + "Sq": "Sq", + "Sqr": "Sq", + "Sqre": "Sq", + "Sqrs": "Squares", + "Sqs": "Squares", + "Squ": "Sq", + "Square": "Sq", + "Squares": "Squares", + "St": "St", + "Sta": "Sta", + "Station": "Sta", + "Statn": "Sta", + "Stn": "Sta", + "Str": "St", + "Stra": "Stra", + "Strav": "Stra", + "Strave": "Stra", + "Straven": "Stra", + "Stravenue": "Stra", + "Stravn": "Stra", + "Stream": "Stream", + "Street": "St", + "Streets": "Streets", + "Streme": "Stream", + "Strm": "Stream", + "Strt": "St", + "Strvn": "Stra", + "Strvnue": "Stra", + "Sts": "Streets", + "Sumit": "Summit", + "Sumitt": "Summit", + "Summit": "Summit", + "Smt": "Summit", + "Ter": "Ter", + "Terr": "Ter", + "Terrace": "Ter", + "Throughway": "Trwy", + "Tpke": "Tpke", + "Tr": "Trl", + "Trace": "Trce", + "Traces": "Trce", + "Track": "Trak", + "Tracks": "Trak", + "Trafficway": "Trfy", + "Trail": "Trl", + "Trailer": "Trlr", + "Trails": "Trl", + "Trak": "Trak", + "Trce": "Trce", + "Trfy": "Trfy", + "Trk": "Trak", + "Trks": "Trak", + "Trl": "Trl", + "Trlr": "Trlr", + "Trlrs": "Trlrs", + "Trls": "Trl", + "Trnpk": "Tpke", + "Trwy": "Trwy", + "Tunel": "Tunl", + "Tunl": "Tunl", + "Tunls": "Tunl", + "Tunnel": "Tunl", + "Tunnels": "Tunl", + "Tunnl": "Tunl", + "Turnpike": "Tpke", + "Turnpk": "Tpke", + "Un": "Union", + "Underpass": "Unp", + "Union": "Union", + "Unions": "Unions", + "Unp": "Unp", + "Uns": "Unions", + "Upas": "Unp", + "Valley": "Vly", + "Valleys": "Valleys", + "Vally": "Vly", + "Vdct": "Via", + "Via": "Via", + "Viadct": "Via", + "Viaduct": "Via", + "View": "Vw", + "Views": "Views", + "Vill": "Village", + "Villag": "Village", + "Village": "Vlg", + "Villages": "Villages", + "Ville": "Ville", + "Villg": "Village", + "Villiage": "Village", + "Vist": "Vis", + "Vista": "Vis", + "Vl": "Ville", + "Vlg": "Vlg", + "Vlgs": "Villages", + "Vlly": "Vly", + "Vly": "Vly", + "Vlys": "Valleys", + "Vst": "Vis", + "Vsta": "Vis", + "Vw": "Vw", + "Vws": "Views", + "Walk": "Walk", + "Walks": "Walks", + "Wall": "Wall", + "Way": "Way", + "Ways": "Ways", + "Well": "Well", + "Wells": "Wls", + "Wl": "Well", + "Wls": "Wls", + "Wy": "Way", + "Xing": "Xing", + "Xrd": "Xrd", + "Xrds": "Xrds", +} + + +# Merged suffix types (canonical + alternates) +SUFFIX_TYPE = TwoWayMap({**SUFFIX_CANONICAL, **SUFFIX_ALTERNATE}) + + +# US States and territories +STATE = TwoWayMap({ + "Alabama": "AL", + "Alaska": "AK", + "American Samoa": "AS", + "Arizona": "AZ", + "Arkansas": "AR", + "California": "CA", + "Colorado": "CO", + "Connecticut": "CT", + "Delaware": "DE", + "District of Columbia": "DC", + "Federated States of Micronesia": "FM", + "Florida": "FL", + "Georgia": "GA", + "Guam": "GU", + "Hawaii": "HI", + "Idaho": "ID", + "Illinois": "IL", + "Indiana": "IN", + "Iowa": "IA", + "Kansas": "KS", + "Kentucky": "KY", + "Louisiana": "LA", + "Maine": "ME", + "Marshall Islands": "MH", + "Maryland": "MD", + "Massachusetts": "MA", + "Michigan": "MI", + "Minnesota": "MN", + "Mississippi": "MS", + "Missouri": "MO", + "Montana": "MT", + "Nebraska": "NE", + "Nevada": "NV", + "New Hampshire": "NH", + "New Jersey": "NJ", + "New Mexico": "NM", + "New York": "NY", + "North Carolina": "NC", + "North Dakota": "ND", + "Northern Mariana Islands": "MP", + "Ohio": "OH", + "Oklahoma": "OK", + "Oregon": "OR", + "Palau": "PW", + "Pennsylvania": "PA", + "Puerto Rico": "PR", + "Rhode Island": "RI", + "South Carolina": "SC", + "South Dakota": "SD", + "Tennessee": "TN", + "Texas": "TX", + "Utah": "UT", + "Vermont": "VT", + "Virgin Islands": "VI", + "Virginia": "VA", + "Washington": "WA", + "West Virginia": "WV", + "Wisconsin": "WI", + "Wyoming": "WY" +}) diff --git a/geocoder_us/database.py b/geocoder_us/database.py new file mode 100644 index 0000000..10ec8c9 --- /dev/null +++ b/geocoder_us/database.py @@ -0,0 +1,249 @@ +""" +Database interface for geocoding with DuckDB. + +This module provides the database layer for querying street address data +using DuckDB with spatial extensions. +""" + +import duckdb +from typing import List, Dict, Optional, Any +from pathlib import Path +import threading + + +class GeocoderDatabase: + """ + Interface to DuckDB geocoding database with spatial support. + + This class manages connections to the geocoder database and provides + methods for querying street range data, places, and features. + """ + + # Scoring weights for address matching + STREET_WEIGHT = 3.0 + NUMBER_WEIGHT = 2.0 + PARITY_WEIGHT = 1.25 + CITY_WEIGHT = 1.0 + + def __init__(self, db_path: str, threadsafe: bool = True): + """ + Initialize database connection. + + Args: + db_path: Path to DuckDB database file + threadsafe: Whether to use thread-safe access + """ + self.db_path = db_path + self.threadsafe = threadsafe + self._lock = threading.Lock() if threadsafe else None + self._conn: Optional[duckdb.DuckDBPyConnection] = None + + # Initialize connection + self._connect() + + def _connect(self) -> None: + """ + Establish connection to database and load extensions. + """ + if not Path(self.db_path).exists(): + raise FileNotFoundError(f"Database not found: {self.db_path}") + + # Create connection + self._conn = duckdb.connect(self.db_path, read_only=True) + + # Load spatial extension + try: + self._conn.execute("INSTALL spatial;") + self._conn.execute("LOAD spatial;") + except Exception as e: + print(f"Warning: Could not load spatial extension: {e}") + + # Load fuzzystrsim extension for Levenshtein distance + try: + self._conn.execute("INSTALL fuzzystrsim;") + self._conn.execute("LOAD fuzzystrsim;") + except Exception as e: + print(f"Warning: Could not load fuzzystrsim extension: {e}") + + def _execute(self, query: str, params: Optional[tuple] = None) -> List[Dict[str, Any]]: + """ + Execute query with optional parameters. + + Args: + query: SQL query string + params: Query parameters + + Returns: + List of result rows as dictionaries + """ + if self.threadsafe and self._lock: + with self._lock: + return self._execute_query(query, params) + else: + return self._execute_query(query, params) + + def _execute_query(self, query: str, params: Optional[tuple]) -> List[Dict[str, Any]]: + """ + Internal query execution. + + Args: + query: SQL query + params: Parameters + + Returns: + Query results + """ + if not self._conn: + self._connect() + + # Execute query + if params: + result = self._conn.execute(query, params) + else: + result = self._conn.execute(query) + + # Fetch all results + rows = result.fetchall() + columns = [desc[0] for desc in result.description] if result.description else [] + + # Convert to list of dictionaries + return [dict(zip(columns, row)) for row in rows] + + def places_by_zip(self, city: str, zip_code: str) -> List[Dict[str, Any]]: + """ + Query places by ZIP code. + + Args: + city: City name + zip_code: 5-digit ZIP code + + Returns: + List of matching places with Levenshtein distance scores + """ + # TODO: Implement once database schema is finalized + query = """ + SELECT *, levenshtein(?, city) AS city_score + FROM place + WHERE zip = ? + ORDER BY priority DESC + """ + return self._execute(query, (city, zip_code)) + + def places_by_city(self, city: str, city_tokens: List[str], state: Optional[str] = None) -> List[Dict[str, Any]]: + """ + Query places by city name with metaphone matching. + + Args: + city: City name + city_tokens: City name tokens for metaphone matching + state: Optional state filter + + Returns: + List of matching places + """ + # TODO: Implement with metaphone matching once schema is ready + # This will use DuckDB's ability to create custom functions or + # use the metaphone results from Python + pass + + def features_by_street(self, street: str, street_tokens: List[str]) -> List[Dict[str, Any]]: + """ + Query features (street segments) by street name. + + Args: + street: Street name + street_tokens: Street name tokens for matching + + Returns: + List of matching features with Levenshtein scores + """ + # TODO: Implement once database schema is finalized + # This will query the feature table with metaphone-based matching + pass + + def features_by_street_and_zip(self, street: str, street_tokens: List[str], + zip_codes: List[str]) -> List[Dict[str, Any]]: + """ + Query features by street name and ZIP codes. + + Args: + street: Street name + street_tokens: Street tokens + zip_codes: List of ZIP codes to filter + + Returns: + Matching features + """ + # TODO: Implement with ZIP filter + pass + + def ranges_by_feature(self, feature_ids: List[int], number: str, + prenum: Optional[str] = None) -> List[Dict[str, Any]]: + """ + Query address ranges for given features. + + Args: + feature_ids: Feature IDs to query + number: Street number + prenum: Optional prefix number + + Returns: + Matching ranges sorted by address number proximity + """ + # TODO: Implement range queries + pass + + def geocode_address(self, address: str) -> List[Dict[str, Any]]: + """ + Main geocoding method (placeholder). + + This will be the primary interface for geocoding an address string. + + Args: + address: Address string to geocode + + Returns: + List of geocoding results with scores + """ + # TODO: Implement full geocoding pipeline: + # 1. Parse address + # 2. Query places by ZIP/city + # 3. Query features by street + # 4. Query ranges for address number + # 5. Calculate coordinates + # 6. Rank results by score + + raise NotImplementedError("Full geocoding pipeline not yet implemented") + + def close(self) -> None: + """Close database connection.""" + if self._conn: + self._conn.close() + self._conn = None + + def __enter__(self): + """Context manager entry.""" + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + """Context manager exit.""" + self.close() + + def __del__(self): + """Cleanup on deletion.""" + self.close() + + +# Convenience function for creating database instance +def connect_database(db_path: str = "/opt/geocoder.db", threadsafe: bool = True) -> GeocoderDatabase: + """ + Create a database connection. + + Args: + db_path: Path to DuckDB database + threadsafe: Enable thread-safe access + + Returns: + GeocoderDatabase instance + """ + return GeocoderDatabase(db_path, threadsafe) diff --git a/geocoder_us/metaphone.py b/geocoder_us/metaphone.py new file mode 100644 index 0000000..bb2c711 --- /dev/null +++ b/geocoder_us/metaphone.py @@ -0,0 +1,195 @@ +""" +Metaphone phonetic matching for address geocoding. + +This module provides phonetic matching functionality using the Metaphone algorithm, +which helps match street names despite spelling variations. +""" + +from typing import Optional +import re + + +# Simple Metaphone implementation based on the Ruby version +# This is a simplified version - can be replaced with python-metaphone library +class Metaphone: + """ + Metaphone phonetic algorithm for fuzzy string matching. + + This implementation follows the standard Metaphone rules for converting + words into phonetic codes that sound similar. + """ + + # Metaphone transformation rules (pattern, replacement) + RULES = [ + # Remove doubled consonants except 'c' + (re.compile(r'([bcdfghjklmnpqrstvwxyz])\1+', re.IGNORECASE), r'\1'), + + # Initial patterns + (re.compile(r'^ae', re.IGNORECASE), 'E'), + (re.compile(r'^[gkp]n', re.IGNORECASE), 'N'), + (re.compile(r'^wr', re.IGNORECASE), 'R'), + (re.compile(r'^x', re.IGNORECASE), 'S'), + (re.compile(r'^wh', re.IGNORECASE), 'W'), + + # Terminal patterns + (re.compile(r'mb$', re.IGNORECASE), 'M'), + + # Middle patterns + (re.compile(r'(?!^)sch', re.IGNORECASE), 'SK'), + (re.compile(r'th', re.IGNORECASE), '0'), + (re.compile(r't?ch|sh', re.IGNORECASE), 'X'), + (re.compile(r'c(?=ia)', re.IGNORECASE), 'X'), + (re.compile(r'[st](?=i[ao])', re.IGNORECASE), 'X'), + (re.compile(r's?c(?=[iey])', re.IGNORECASE), 'S'), + (re.compile(r'[cq]', re.IGNORECASE), 'K'), + (re.compile(r'dg(?=[iey])', re.IGNORECASE), 'J'), + (re.compile(r'd', re.IGNORECASE), 'T'), + (re.compile(r'g(?=h[^aeiou])', re.IGNORECASE), ''), + (re.compile(r'gn(ed)?', re.IGNORECASE), 'N'), + (re.compile(r'([^g]|^)g(?=[iey])', re.IGNORECASE), r'\1J'), + (re.compile(r'g+', re.IGNORECASE), 'K'), + (re.compile(r'ph', re.IGNORECASE), 'F'), + (re.compile(r'([aeiou])h(?=\b|[^aeiou])', re.IGNORECASE), r'\1'), + (re.compile(r'[wy](?![aeiou])', re.IGNORECASE), ''), + (re.compile(r'z', re.IGNORECASE), 'S'), + (re.compile(r'v', re.IGNORECASE), 'F'), + (re.compile(r'(?!^)[aeiou]+', re.IGNORECASE), ''), + ] + + @classmethod + def encode(cls, text: str, max_length: int = 0) -> str: + """ + Convert text to Metaphone phonetic code. + + Args: + text: Input text to encode + max_length: Maximum length of output (0 = unlimited) + + Returns: + Metaphone code + """ + if not text: + return "" + + # Normalize: lowercase and remove non-alphabetic characters + text = re.sub(r'[^a-z]', '', text.lower()) + + if not text: + return "" + + # Apply Metaphone rules + for pattern, replacement in cls.RULES: + text = pattern.sub(replacement, text) + + # Uppercase result + result = text.upper() + + # Limit length if requested + if max_length > 0: + result = result[:max_length] + + return result + + @classmethod + def encode_multiple(cls, text: str, max_length: int = 0) -> str: + """ + Encode multiple words separated by spaces. + + Args: + text: Space-separated words + max_length: Maximum length per word + + Returns: + Space-separated Metaphone codes + """ + if not text: + return "" + + words = text.strip().split() + codes = [cls.encode(word, max_length) for word in words] + return ' '.join(code for code in codes if code) + + +def metaphone(text: str, max_length: int = 5) -> str: + """ + Convenience function for metaphone encoding. + + Args: + text: Text to encode + max_length: Maximum length of code (default: 5) + + Returns: + Metaphone phonetic code + """ + return Metaphone.encode(text, max_length) + + +def metaphone_match(text1: str, text2: str, max_length: int = 5) -> bool: + """ + Check if two texts match phonetically. + + Args: + text1: First text + text2: Second text + max_length: Maximum code length for comparison + + Returns: + True if metaphone codes match + """ + code1 = metaphone(text1, max_length) + code2 = metaphone(text2, max_length) + return code1 == code2 and len(code1) > 0 + + +def metaphone_similarity(text1: str, text2: str, max_length: int = 5) -> float: + """ + Calculate phonetic similarity between two texts. + + Args: + text1: First text + text2: Second text + max_length: Maximum code length + + Returns: + Similarity score between 0.0 and 1.0 + """ + code1 = metaphone(text1, max_length) + code2 = metaphone(text2, max_length) + + if not code1 or not code2: + return 0.0 + + if code1 == code2: + return 1.0 + + # Calculate character-level similarity + matches = sum(c1 == c2 for c1, c2 in zip(code1, code2)) + max_len = max(len(code1), len(code2)) + + return matches / max_len if max_len > 0 else 0.0 + + +# For compatibility with external metaphone libraries +try: + from metaphone import doublemetaphone + + def metaphone_double(text: str) -> tuple: + """ + Use Double Metaphone if available (more accurate). + + Args: + text: Text to encode + + Returns: + Tuple of (primary code, secondary code) + """ + return doublemetaphone(text) + + HAS_DOUBLE_METAPHONE = True +except ImportError: + HAS_DOUBLE_METAPHONE = False + + def metaphone_double(text: str) -> tuple: + """Fallback if doublemetaphone not available.""" + code = metaphone(text) + return (code, code) diff --git a/geocoder_us/preprocessing.py b/geocoder_us/preprocessing.py new file mode 100644 index 0000000..2a1d8a9 --- /dev/null +++ b/geocoder_us/preprocessing.py @@ -0,0 +1,99 @@ +""" +Address preprocessing utilities. + +This module provides functions for cleaning and validating addresses, +ported from the dht R package functionality. +""" + +import re +from typing import Optional + + +def clean_address(address: str) -> str: + """ + Clean an address string by normalizing whitespace and removing + special characters. + + Args: + address: Raw address string + + Returns: + Cleaned address string + """ + if not address or not isinstance(address, str): + return "" + + # Strip leading/trailing whitespace + cleaned = address.strip() + + # Normalize internal whitespace + cleaned = re.sub(r'\s+', ' ', cleaned) + + # Remove special characters but keep basic punctuation + cleaned = re.sub(r'[^a-zA-Z0-9 ,.\-#&@/]', '', cleaned) + + return cleaned + + +def address_is_po_box(address: str) -> bool: + """ + Check if an address is a PO Box. + + Args: + address: Address string to check + + Returns: + True if address appears to be a PO Box + """ + if not address: + return False + + # Pattern matches: PO Box, P.O. Box, P O Box, etc. + po_box_pattern = r'\b[Pp]*(OST|ost)*\.?\s*[Oo0]*(ffice|FFICE)*\.?\s*[Bb][Oo0][Xx]\b' + return bool(re.search(po_box_pattern, address)) + + +def address_is_institutional(address: str) -> bool: + """ + Check if an address is a known Cincinnati institutional address. + + This is specific to the Cincinnati area institutional/foster addresses + that should not be geocoded to protect privacy. + + Args: + address: Address string to check + + Returns: + True if address is flagged as institutional + """ + if not address: + return False + + # Cincinnati Children's Hospital Medical Center + if "3333 BURNET" in address.upper(): + return True + + # Add other institutional addresses as needed + return False + + +def address_is_nonaddress(address: str) -> bool: + """ + Check if the address field contains non-address text. + + Args: + address: Address string to check + + Returns: + True if field is blank or contains placeholder text + """ + if not address or not address.strip(): + return True + + # Check for common placeholder values + non_address_values = { + "foreign", "verify", "unknown", "na", "n/a", "none", + "not applicable", "missing" + } + + return address.lower().strip() in non_address_values diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..fa9cf73 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,22 @@ +# Python Geocoder Requirements +# For US address geocoding with DuckDB + +# Core dependencies +pandas>=2.0.0 +duckdb>=0.9.0 +tabulate>=0.9.0 + +# For address parsing and string matching +python-Levenshtein>=0.21.0 +metaphone>=0.6 + +# CLI and utilities +click>=8.1.0 # Alternative to argparse if needed +rich>=13.0.0 # For better console output + +# Parallel processing and caching +joblib>=1.3.0 + +# Testing (optional, for development) +pytest>=7.4.0 +pytest-cov>=4.1.0 diff --git a/test_integration.py b/test_integration.py new file mode 100644 index 0000000..b079b78 --- /dev/null +++ b/test_integration.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python3 +""" +Integration test for the geocoder pipeline. + +Tests the full workflow from CSV input to geocoded output. +""" + +import sys +import os +import pandas as pd +from pathlib import Path + +sys.path.insert(0, '/home/runner/work/geocoder/geocoder') + +# Create test CSV +test_data = pd.DataFrame({ + 'id': [1, 2, 3, 4, 5], + 'address': [ + '123 Main St, Springfield, IL 62701', + '1600 Pennsylvania Ave, Washington, DC 20500', + 'PO Box 123, Anytown, CA 90210', + '3333 BURNET AVE CINCINNATI, OH 45229', + 'unknown' + ] +}) + +test_file = '/tmp/test_addresses.csv' +test_data.to_csv(test_file, index=False) +print(f"Created test file: {test_file}") +print(f"Test data:\n{test_data}\n") + +# Test the entrypoint +print("=" * 60) +print("Testing geocoder entrypoint") +print("=" * 60) + +# Import and run +from entrypoint import ( + read_input_file, + preprocess_addresses, + geocode_addresses, + write_output_file, + print_summary +) + +try: + # Read input + df = read_input_file(test_file) + print(f"\n✓ Read {len(df)} addresses") + + # Preprocess + df = preprocess_addresses(df) + print(f"\n✓ Preprocessed addresses") + print(f" Flagged addresses:") + print(f" PO Box: {df['po_box'].sum()}") + print(f" Institutional: {df['cincy_inst_foster_addr'].sum()}") + print(f" Non-address: {df['non_address_text'].sum()}") + + # Geocode + df = geocode_addresses(df, score_threshold=0.5) + print(f"\n✓ Geocoded addresses") + + # Check results + print(f"\nResults preview:") + cols_to_show = ['address', 'matched_street', 'matched_city', 'matched_state', 'score', 'geocode_result'] + print(df[cols_to_show].to_string()) + + # Write output + output_file = write_output_file(df, test_file, 0.5) + print(f"\n✓ Wrote output file: {output_file}") + + # Print summary + print_summary(df) + + # Clean up + os.remove(test_file) + if os.path.exists(output_file): + os.remove(output_file) + + print("\n" + "=" * 60) + print("✓ Integration test passed!") + print("=" * 60) + +except Exception as e: + print(f"\n✗ Integration test failed: {e}") + import traceback + traceback.print_exc() + sys.exit(1) diff --git a/test_modules.py b/test_modules.py new file mode 100644 index 0000000..d04f6f5 --- /dev/null +++ b/test_modules.py @@ -0,0 +1,77 @@ +#!/usr/bin/env python3 +""" +Quick test script for geocoder_us modules. + +Tests address parsing, metaphone encoding, and basic functionality. +""" + +import sys +sys.path.insert(0, '/home/runner/work/geocoder/geocoder') + +from geocoder_us.address import Address +from geocoder_us.metaphone import metaphone, metaphone_similarity +from geocoder_us import constants + +print("=" * 60) +print("Testing geocoder_us modules") +print("=" * 60) + +# Test 1: Address Parsing +print("\n1. Address Parsing Test") +print("-" * 40) +test_addresses = [ + "1600 Pennsylvania Ave Washington DC 20500", + "3333 BURNET AVE CINCINNATI, OH 45229", + "123 Main St, Springfield, IL 62701", + "PO Box 123, Anytown, CA 90210" +] + +for addr_str in test_addresses: + try: + addr = Address(addr_str) + print(f"\nInput: {addr_str}") + print(f"Parsed: {addr}") + print(f" Number: {addr.number}") + print(f" Street: {addr.street[:2] if len(addr.street) > 2 else addr.street}") + print(f" City: {addr.city}") + print(f" State: {addr.state}") + print(f" ZIP: {addr.zip}") + print(f" PO Box: {addr.is_po_box()}") + except Exception as e: + print(f"Error parsing '{addr_str}': {e}") + +# Test 2: Metaphone Encoding +print("\n\n2. Metaphone Encoding Test") +print("-" * 40) +test_words = [ + ("Main", "Maine"), + ("Street", "Streat"), + ("Avenue", "Avenu"), + ("Washington", "Washinton") +] + +for word1, word2 in test_words: + code1 = metaphone(word1) + code2 = metaphone(word2) + similarity = metaphone_similarity(word1, word2) + print(f"{word1:15} -> {code1:10} | {word2:15} -> {code2:10} | Sim: {similarity:.2f}") + +# Test 3: Constants +print("\n\n3. Constants Test") +print("-" * 40) +print(f"States loaded: {len(constants.STATE)} entries") +print(f"Street suffixes: {len(constants.SUFFIX_TYPE)} entries") +print(f"Sample state lookup: 'Ohio' -> '{constants.STATE.get('Ohio', 'NOT FOUND')}'") +print(f"Sample state lookup: 'CA' -> '{constants.STATE.get('CA', 'NOT FOUND')}'") + +# Test 4: Street Parts Generation +print("\n\n4. Street Parts Generation Test") +print("-" * 40) +addr = Address("123 North Main Street Springfield IL") +parts = addr.street_parts() +print(f"Address: {addr.original_text}") +print(f"Street parts ({len(parts)}): {parts[:5]}") + +print("\n" + "=" * 60) +print("All tests completed!") +print("=" * 60)