A Python tool I built to find undervalued properties automatically. It scrapes Craigslist, cleans the data, and flags properties that are priced below market value.
- Scrapes - Pulls listings from Craigslist NY real estate
- Cleans - Removes spam, rentals, duplicates. Calculates price/sqft
- Stores - Saves everything to SQLite database
- Analyzes - Finds properties below average price/sqft
# Install dependencies
pip install -r requirements.txt
# Run the whole pipeline
python main.pyThat is it. Results go into the database.
main.py # Run this - executes full pipeline
config/
db_config.py # Database path
data/
raw/ # Scraped data (unprocessed)
cleaned/ # Cleaned data (ready for analysis)
modules/
scraper.py # Scrapes Craigslist
cleaner.py # Cleans data, filters rentals
database.py # Saves to SQLite
# Real Estate Arbitrage Scanner
Professional, end-to-end ETL pipeline to discover undervalued real estate. It scrapes listings from Craigslist NY, cleans and enriches the data, stores it in SQLite, and flags opportunities based on price-per-square-foot analysis.
## Highlights
- Automated scraping with resilient selectors (supports evolving Craigslist HTML)
- Smart cleaning: rental filtering, deduplication, normalization
- Location-aware price-per-sqft metrics and undervaluation detection
- SQLite storage for history, reproducibility, and downstream analytics
- Modular codebase with single-command pipeline run
## Repository Layout
Arbitrage/ │ ├── main.py # Entry point – runs the complete ETL pipeline ├── README.md # Project documentation (you are here) ├── requirements.txt # Python dependencies (pinned versions) └── wrok_done.txt # Development changelog & notes │ ├── config/ │ └── db_config.py # Database configuration (SQLite path) │ ├── data/ │ ├── raw/ │ │ └── raw_properties.csv # Raw scraped listings (unprocessed) │ └── cleaned/ │ └── cleaned_properties.csv # Cleaned data with computed metrics │ ├── modules/ # Core business logic │ ├── scraper.py # Web scraper (Craigslist → CSV) │ ├── cleaner.py # Data cleaning & transformation │ ├── database.py # SQLite persistence layer │ └── analyzer.py # Arbitrage detection & reporting │ └── scripts/ # Standalone runners ├── run_scraper.py # Execute scraper only ├── run_cleaner.py # Execute cleaner only └── run_db_updates.py # Execute database update only
| Directory | Purpose |
|-----------|---------|
| `config/` | Centralized configuration (database path, future API keys) |
| `data/raw/` | Immutable scraped data – preserved for reproducibility |
| `data/cleaned/` | Transformed data ready for analysis and storage |
| `modules/` | Reusable Python modules implementing core pipeline logic |
| `scripts/` | CLI entry points for running individual pipeline stages |
## Setup
### Prerequisites
- Windows (tested), Python 3.12+
### Create venv and install
```bash
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
SQLite path is set in config/db_config.py:
DB_PATH = "data/properties.db"Change if you need a different database location.
python main.pyThis will:
- Scrape → write data/raw/raw_properties.csv
- Clean → write data/cleaned/cleaned_properties.csv
- Load → update
propertiestable in SQLite - Analyze → save
arbitrage_opportunitiestable
python scripts/run_scraper.py
python scripts/run_cleaner.py
python scripts/run_db_updates.pyScraper (modules/scraper.py)
- Targets: https://newyork.craigslist.org/search/rea
- Extracts:
title,price,area_sqft(from title),location,bedrooms,bathrooms,url - Options:
fetch_details=Falseby default (fast); enable to fetch per-listing pages for missing sqft
Cleaner (modules/cleaner.py)
- Converts numeric fields
- Filters rentals:
/reb/URLs and prices < $100k - Deduplicates by
urland bytitle+price - Categorizes property type (Commercial, Land, Multi-Family, etc.)
- Computes
price_per_sqftand location averages - Flags
undervaluedusing location average; falls back to global average
Database Loader (modules/database.py)
- Writes cleaned data to
propertiestable in data/properties.db
Analyzer (modules/analyzer.py)
- Loads
properties, runsbasic_analysis() - Builds arbitrage view from
undervaluedor computed thresholds - Saves to
arbitrage_opportunitiestable
title(str)price(float)area_sqft(float)price_per_sqft(float)location(str)bedrooms(int, optional)bathrooms(float, optional)property_type(str)undervalued(bool)avg_price_per_sqft_location(float)url(str)
properties: mirrors cleaned CSVarbitrage_opportunities: subset flagged as undervalued
Found 4 undervalued properties (arbitrage opportunities)
Property Price $/SqFt Location
Commercial condo space , 22,000 sf $2,000,000 $90.91 New York
Prime Vacant Lot - 17,600 Sq Ft $4,250,000 $241.48 Brooklyn
... ... ... ...
- Craigslist HTML changes frequently; selectors are maintained but may need updates
- Title-based sqft extraction is heuristic; enable detail fetch for precision
- Avoid aggressive scraping; respect robots and site policies
Open to improvements—PRs welcome. Ideas: new sources, better heuristics, dashboards, alerting.
For educational and research purposes.
Dikshant Neupane