Skip to content

Dikshant-Neupane/Arbitrage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Real Estate Arbitrage Scanner

A Python tool I built to find undervalued properties automatically. It scrapes Craigslist, cleans the data, and flags properties that are priced below market value.

What it does

  1. Scrapes - Pulls listings from Craigslist NY real estate
  2. Cleans - Removes spam, rentals, duplicates. Calculates price/sqft
  3. Stores - Saves everything to SQLite database
  4. Analyzes - Finds properties below average price/sqft

Quick start

# Install dependencies
pip install -r requirements.txt

# Run the whole pipeline
python main.py

That is it. Results go into the database.

Project structure

 main.py              # Run this - executes full pipeline
 config/
    db_config.py     # Database path
 data/
    raw/             # Scraped data (unprocessed)
    cleaned/         # Cleaned data (ready for analysis)
 modules/
    scraper.py       # Scrapes Craigslist
    cleaner.py       # Cleans data, filters rentals
    database.py      # Saves to SQLite

   # Real Estate Arbitrage Scanner

   Professional, end-to-end ETL pipeline to discover undervalued real estate. It scrapes listings from Craigslist NY, cleans and enriches the data, stores it in SQLite, and flags opportunities based on price-per-square-foot analysis.

   ## Highlights

   - Automated scraping with resilient selectors (supports evolving Craigslist HTML)
   - Smart cleaning: rental filtering, deduplication, normalization
   - Location-aware price-per-sqft metrics and undervaluation detection
   - SQLite storage for history, reproducibility, and downstream analytics
   - Modular codebase with single-command pipeline run

   ## Repository Layout

Arbitrage/ │ ├── main.py # Entry point – runs the complete ETL pipeline ├── README.md # Project documentation (you are here) ├── requirements.txt # Python dependencies (pinned versions) └── wrok_done.txt # Development changelog & notes │ ├── config/ │ └── db_config.py # Database configuration (SQLite path) │ ├── data/ │ ├── raw/ │ │ └── raw_properties.csv # Raw scraped listings (unprocessed) │ └── cleaned/ │ └── cleaned_properties.csv # Cleaned data with computed metrics │ ├── modules/ # Core business logic │ ├── scraper.py # Web scraper (Craigslist → CSV) │ ├── cleaner.py # Data cleaning & transformation │ ├── database.py # SQLite persistence layer │ └── analyzer.py # Arbitrage detection & reporting │ └── scripts/ # Standalone runners ├── run_scraper.py # Execute scraper only ├── run_cleaner.py # Execute cleaner only └── run_db_updates.py # Execute database update only


| Directory | Purpose |
|-----------|---------|
| `config/` | Centralized configuration (database path, future API keys) |
| `data/raw/` | Immutable scraped data – preserved for reproducibility |
| `data/cleaned/` | Transformed data ready for analysis and storage |
| `modules/` | Reusable Python modules implementing core pipeline logic |
| `scripts/` | CLI entry points for running individual pipeline stages |

## Setup

### Prerequisites
- Windows (tested), Python 3.12+

### Create venv and install
```bash
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

Configuration

SQLite path is set in config/db_config.py:

DB_PATH = "data/properties.db"

Change if you need a different database location.

Usage

Run the full pipeline

python main.py

This will:

  1. Scrape → write data/raw/raw_properties.csv
  2. Clean → write data/cleaned/cleaned_properties.csv
  3. Load → update properties table in SQLite
  4. Analyze → save arbitrage_opportunities table

Run steps individually

python scripts/run_scraper.py
python scripts/run_cleaner.py
python scripts/run_db_updates.py

How It Works

  • Targets: https://newyork.craigslist.org/search/rea
  • Extracts: title, price, area_sqft (from title), location, bedrooms, bathrooms, url
  • Options: fetch_details=False by default (fast); enable to fetch per-listing pages for missing sqft
  • Converts numeric fields
  • Filters rentals: /reb/ URLs and prices < $100k
  • Deduplicates by url and by title+price
  • Categorizes property type (Commercial, Land, Multi-Family, etc.)
  • Computes price_per_sqft and location averages
  • Flags undervalued using location average; falls back to global average

Database Loader (modules/database.py)

  • Loads properties, runs basic_analysis()
  • Builds arbitrage view from undervalued or computed thresholds
  • Saves to arbitrage_opportunities table

Data Model

Cleaned CSV Columns

  • title (str)
  • price (float)
  • area_sqft (float)
  • price_per_sqft (float)
  • location (str)
  • bedrooms (int, optional)
  • bathrooms (float, optional)
  • property_type (str)
  • undervalued (bool)
  • avg_price_per_sqft_location (float)
  • url (str)

Database Tables

  • properties: mirrors cleaned CSV
  • arbitrage_opportunities: subset flagged as undervalued

Example Output (Analyzer)

Found 4 undervalued properties (arbitrage opportunities)

Property                                 Price       $/SqFt   Location
Commercial condo space , 22,000 sf       $2,000,000  $90.91    New York
Prime Vacant Lot - 17,600 Sq Ft          $4,250,000  $241.48   Brooklyn
...                                      ...         ...       ...

Notes & Limitations

  • Craigslist HTML changes frequently; selectors are maintained but may need updates
  • Title-based sqft extraction is heuristic; enable detail fetch for precision
  • Avoid aggressive scraping; respect robots and site policies

Contributing

Open to improvements—PRs welcome. Ideas: new sources, better heuristics, dashboards, alerting.

License

For educational and research purposes.

Author

Dikshant Neupane

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors