Real Estate Arbitrage Scanner

A Python tool I built to find undervalued properties automatically. It scrapes Craigslist, cleans the data, and flags properties that are priced below market value.

What it does

Scrapes - Pulls listings from Craigslist NY real estate
Cleans - Removes spam, rentals, duplicates. Calculates price/sqft
Stores - Saves everything to SQLite database
Analyzes - Finds properties below average price/sqft

Quick start

# Install dependencies
pip install -r requirements.txt

# Run the whole pipeline
python main.py

That is it. Results go into the database.

Project structure

 main.py              # Run this - executes full pipeline
 config/
    db_config.py     # Database path
 data/
    raw/             # Scraped data (unprocessed)
    cleaned/         # Cleaned data (ready for analysis)
 modules/
    scraper.py       # Scrapes Craigslist
    cleaner.py       # Cleans data, filters rentals
    database.py      # Saves to SQLite

   # Real Estate Arbitrage Scanner

   Professional, end-to-end ETL pipeline to discover undervalued real estate. It scrapes listings from Craigslist NY, cleans and enriches the data, stores it in SQLite, and flags opportunities based on price-per-square-foot analysis.

   ## Highlights

   - Automated scraping with resilient selectors (supports evolving Craigslist HTML)
   - Smart cleaning: rental filtering, deduplication, normalization
   - Location-aware price-per-sqft metrics and undervaluation detection
   - SQLite storage for history, reproducibility, and downstream analytics
   - Modular codebase with single-command pipeline run

   ## Repository Layout

Arbitrage/ │ ├── main.py # Entry point – runs the complete ETL pipeline ├── README.md # Project documentation (you are here) ├── requirements.txt # Python dependencies (pinned versions) └── wrok_done.txt # Development changelog & notes │ ├── config/ │ └── db_config.py # Database configuration (SQLite path) │ ├── data/ │ ├── raw/ │ │ └── raw_properties.csv # Raw scraped listings (unprocessed) │ └── cleaned/ │ └── cleaned_properties.csv # Cleaned data with computed metrics │ ├── modules/ # Core business logic │ ├── scraper.py # Web scraper (Craigslist → CSV) │ ├── cleaner.py # Data cleaning & transformation │ ├── database.py # SQLite persistence layer │ └── analyzer.py # Arbitrage detection & reporting │ └── scripts/ # Standalone runners ├── run_scraper.py # Execute scraper only ├── run_cleaner.py # Execute cleaner only └── run_db_updates.py # Execute database update only


| Directory | Purpose |
|-----------|---------|
| `config/` | Centralized configuration (database path, future API keys) |
| `data/raw/` | Immutable scraped data – preserved for reproducibility |
| `data/cleaned/` | Transformed data ready for analysis and storage |
| `modules/` | Reusable Python modules implementing core pipeline logic |
| `scripts/` | CLI entry points for running individual pipeline stages |

## Setup

### Prerequisites
- Windows (tested), Python 3.12+

### Create venv and install
```bash
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Configuration

SQLite path is set in config/db_config.py:

DB_PATH = "data/properties.db"

Change if you need a different database location.

Usage

Run the full pipeline

python main.py

This will:

Scrape → write data/raw/raw_properties.csv
Clean → write data/cleaned/cleaned_properties.csv
Load → update properties table in SQLite
Analyze → save arbitrage_opportunities table

Run steps individually

python scripts/run_scraper.py
python scripts/run_cleaner.py
python scripts/run_db_updates.py

How It Works

Scraper (modules/scraper.py)

Targets: https://newyork.craigslist.org/search/rea
Extracts: title, price, area_sqft (from title), location, bedrooms, bathrooms, url
Options: fetch_details=False by default (fast); enable to fetch per-listing pages for missing sqft

Cleaner (modules/cleaner.py)

Converts numeric fields
Filters rentals: /reb/ URLs and prices < $100k
Deduplicates by url and by title+price
Categorizes property type (Commercial, Land, Multi-Family, etc.)
Computes price_per_sqft and location averages
Flags undervalued using location average; falls back to global average

Database Loader (modules/database.py)

Writes cleaned data to properties table in data/properties.db

Analyzer (modules/analyzer.py)

Loads properties, runs basic_analysis()
Builds arbitrage view from undervalued or computed thresholds
Saves to arbitrage_opportunities table

Data Model

Cleaned CSV Columns

title (str)
price (float)
area_sqft (float)
price_per_sqft (float)
location (str)
bedrooms (int, optional)
bathrooms (float, optional)
property_type (str)
undervalued (bool)
avg_price_per_sqft_location (float)
url (str)

Database Tables

properties: mirrors cleaned CSV
arbitrage_opportunities: subset flagged as undervalued

Example Output (Analyzer)

Found 4 undervalued properties (arbitrage opportunities)

Property                                 Price       $/SqFt   Location
Commercial condo space , 22,000 sf       $2,000,000  $90.91    New York
Prime Vacant Lot - 17,600 Sq Ft          $4,250,000  $241.48   Brooklyn
...                                      ...         ...       ...

Notes & Limitations

Craigslist HTML changes frequently; selectors are maintained but may need updates
Title-based sqft extraction is heuristic; enable detail fetch for precision
Avoid aggressive scraping; respect robots and site policies

Contributing

Open to improvements—PRs welcome. Ideas: new sources, better heuristics, dashboards, alerting.

License

For educational and research purposes.

Author

Dikshant Neupane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real Estate Arbitrage Scanner

What it does

Quick start

Project structure

Configuration

Usage

Run the full pipeline

Run steps individually

How It Works

Scraper (modules/scraper.py)

Cleaner (modules/cleaner.py)

Database Loader (modules/database.py)

Analyzer (modules/analyzer.py)

Data Model

Cleaned CSV Columns

Database Tables

Example Output (Analyzer)

Notes & Limitations

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
config		config
data		data
modules		modules
scripts		scripts
venv		venv
.exit		.exit
README.md		README.md
SELECT		SELECT
main.py		main.py
requirements.txt		requirements.txt
wrok_done.txt		wrok_done.txt

Folders and files

Latest commit

History

Repository files navigation

Real Estate Arbitrage Scanner

What it does

Quick start

Project structure

Configuration

Usage

Run the full pipeline

Run steps individually

How It Works

Scraper (modules/scraper.py)

Cleaner (modules/cleaner.py)

Database Loader (modules/database.py)

Analyzer (modules/analyzer.py)

Data Model

Cleaned CSV Columns

Database Tables

Example Output (Analyzer)

Notes & Limitations

Contributing

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages