Skip to content

lapnitnelav/JOBSPARSER

Repository files navigation

Jobs.ie Dublin Scraper

A Python-based web scraper for collecting job listings from jobs.ie in the Dublin area. Features automatic deduplication, resume capability, and SQLite database storage.

Features

SQLite Database - All data stored in a single file ✓ Automatic Deduplication - Never scrape the same job twice ✓ Resume Support - Continue from where you left off after interruptions ✓ Retry Failed Jobs - Re-attempt failed scrapes ✓ Status Tracking - Track pending/success/failed status for each job ✓ Incremental Updates - Run daily to collect only new job listings ✓ Low Server Load - Minimal requests with rate limiting ✓ Easy Querying - Use SQL to filter and analyze data ✓ CSV/JSON Export - Export database contents anytime ✓ Minimal Dependencies - Only requests + BeautifulSoup required

Installation

1. Clone or Download

Create a project directory and download the source files.

2. Install Dependencies

pip install -r requirements.txt

requirements.txt:

requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0

Note: sqlite3 is built-in to Python 3.x

3. Create Directory Structure

jobsparser/
├── src/
│   ├── __init__.py
│   ├── main.py
│   ├── scraper.py
│   ├── navigator.py
│   ├── collector.py
│   ├── detail_scraper.py
│   ├── parser.py
│   ├── database.py
│   ├── config.py
│   └── utils/
│       ├── __init__.py
│       ├── request_utils.py
│       ├── validation.py
│       └── logging_utils.py
├── data/                    # Created automatically
├── logs/                    # Created automatically
└── requirements.txt

Quick Start

Basic Usage

from src.scraper import JobScraper
from src.config import Config

# Initialize scraper
scraper = JobScraper(
    base_url=Config.BASE_URL,
    db_path=Config.DATABASE_PATH
)

# Run full scraping pipeline
scraper.run()

# Get statistics
stats = scraper.get_scraping_stats()
print(f"Total jobs: {stats['total']}")
print(f"Successfully scraped: {stats['success']}")
print(f"Pending: {stats['pending']}")
print(f"Failed: {stats['failed']}")

# Optional: Export to CSV
scraper.export_to_csv('jobs_export.csv')

Command Line

cd jobsparser
python -m src.main

Common Operations

1. Initial Scrape

First time running - collects all jobs:

scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.run()
# Finds 500 jobs, scrapes all 500

2. Update Scrape (Daily Updates)

Run again to get only new jobs:

scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.run()
# Finds 520 jobs (500 existing + 20 new), scrapes only 20 new

3. Resume Interrupted Scrape

If scraping was interrupted, resume from where it left off:

scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
unscraped_ids = scraper.db.get_unscraped_jobs()
print(f"Resuming {len(unscraped_ids)} pending jobs...")
scraper.scrape_job_details(unscraped_ids)

4. Retry Failed Jobs

Re-attempt jobs that failed during scraping:

scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
failed_ids = scraper.db.get_failed_jobs()
print(f"Retrying {len(failed_ids)} failed jobs...")
scraper.scrape_job_details(failed_ids)

5. Query Database

Access job data directly:

from src.database import DatabaseManager

db = DatabaseManager('./data/jobs.db')

# Get all successful jobs
all_jobs = db.get_all_jobs(status='success')

# Get statistics
stats = db.get_stats()
print(f"Total: {stats['total']}, Success: {stats['success']}")

# Check if specific job exists
if db.job_exists('12345'):
    job = db.get_job_by_id('12345')
    print(job['title'], job['company'])

6. Export Data

from src.database import DatabaseManager

db = DatabaseManager('./data/jobs.db')

# Export to CSV
db.export_to_csv('jobs_export.csv')

# Or use the scraper's export method
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.export_to_csv('jobs_export.csv')

Configuration

Edit src/config.py to customize:

class Config:
    # Target URL
    BASE_URL = "https://www.jobs.ie/jobs/in-dublin"

    # Rate limiting (seconds between requests)
    RATE_LIMIT_DELAY = 2

    # Database location
    DATABASE_PATH = './data/jobs.db'

    # CSS Selectors (update after inspecting website)
    SELECTORS = {
        'job_listing': '.job-listing-item',
        'job_link': 'a.job-link',
        'title': 'h1.job-title',
        'company': '.company-name',
        # ... etc
    }

Important: You must inspect the jobs.ie website and update the CSS selectors in Config.SELECTORS to match the actual HTML structure.

How Deduplication Works

  1. First Run: Database is empty, all jobs are new

    • Collects job IDs from listing pages
    • Inserts all IDs with status='pending'
    • Scrapes each job and updates status to 'success' or 'failed'
  2. Subsequent Runs: Database has existing jobs

    • Collects job IDs from listing pages
    • Queries database for existing IDs
    • Only scrapes: new_jobs = all_jobs - existing_jobs
    • Minimal load on jobs.ie server
  3. Resume After Crash: No need to re-collect IDs

    • Queries database for status='pending' jobs
    • Continues scraping only unfinished jobs

Database Schema

Jobs are stored in SQLite with the following structure:

CREATE TABLE jobs (
    job_id TEXT PRIMARY KEY,
    title TEXT,
    company TEXT,
    location TEXT,
    salary TEXT,
    job_type TEXT,
    posted_date TEXT,
    description TEXT,
    requirements TEXT,      -- JSON array
    benefits TEXT,          -- JSON array
    url TEXT,
    scrape_status TEXT,     -- pending, success, failed
    error_message TEXT,
    scraped_at TIMESTAMP,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

Status Values:

  • pending - Job ID collected, details not yet scraped
  • success - Successfully scraped with full details
  • failed - Scraping failed (network/parsing error)

Viewing Database

Use any SQLite browser or command-line:

# Command line
sqlite3 data/jobs.db

# Query examples
SELECT COUNT(*) FROM jobs;
SELECT COUNT(*) FROM jobs WHERE scrape_status = 'success';
SELECT title, company FROM jobs WHERE company LIKE '%Google%';
SELECT * FROM jobs WHERE scrape_status = 'failed';

Recommended GUI Tools:

Workflow Examples

Daily Job Collection

Run this script daily to keep database updated:

# daily_update.py
from src.scraper import JobScraper
from src.config import Config

def daily_update():
    scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)

    print("Starting daily job update...")
    scraper.run()

    stats = scraper.get_scraping_stats()
    print(f"Update complete!")
    print(f"Total jobs in DB: {stats['total']}")
    print(f"New jobs scraped: {stats['total'] - stats['success']}")

    # Export updated data
    scraper.export_to_csv('daily_jobs_export.csv')

if __name__ == "__main__":
    daily_update()

Maintenance Script

# maintenance.py
from src.database import DatabaseManager
from src.scraper import JobScraper
from src.config import Config

def retry_failed():
    """Retry all failed jobs"""
    scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
    failed = scraper.db.get_failed_jobs()

    if failed:
        print(f"Retrying {len(failed)} failed jobs...")
        scraper.scrape_job_details(failed)
    else:
        print("No failed jobs to retry")

def show_stats():
    """Display database statistics"""
    db = DatabaseManager(Config.DATABASE_PATH)
    stats = db.get_stats()

    print("\nDatabase Statistics:")
    print(f"  Total jobs: {stats['total']}")
    print(f"  Successful: {stats['success']}")
    print(f"  Pending: {stats['pending']}")
    print(f"  Failed: {stats['failed']}")

if __name__ == "__main__":
    show_stats()
    retry_failed()

Rate Limiting & Best Practices

The scraper includes built-in rate limiting to avoid overloading jobs.ie:

  • Default: 2 seconds between requests
  • Configurable via Config.RATE_LIMIT_DELAY
  • Uses proper User-Agent headers
  • Implements exponential backoff on errors

Recommendations:

  • Don't reduce rate limit below 1 second
  • Run updates once daily, not continuously
  • Monitor for HTTP 429 (Too Many Requests) errors
  • Use database queries instead of re-scraping for analysis

Troubleshooting

No jobs found

  • Issue: CSS selectors may be outdated
  • Solution: Inspect jobs.ie HTML and update Config.SELECTORS

Jobs marked as failed

  • Issue: Network errors or HTML structure changed
  • Solution: Check error_message in database, update selectors if needed

Database locked

  • Issue: Multiple processes accessing database
  • Solution: Only run one scraper instance at a time

Missing data fields

  • Issue: Selectors don't match HTML structure
  • Solution: Inspect job detail pages, update extraction selectors

File Structure

jobsparser/
├── src/
│   ├── scraper.py          # Main orchestrator
│   ├── navigator.py        # Pagination handler
│   ├── collector.py        # Job ID collector
│   ├── detail_scraper.py   # Job detail extractor
│   ├── parser.py           # HTML parsing utilities
│   ├── database.py         # SQLite manager
│   ├── config.py           # Configuration
│   └── utils/              # Utility functions
├── data/
│   ├── jobs.db             # SQLite database (auto-created)
│   └── jobs_export.csv     # CSV exports
├── logs/
│   └── scraper.log         # Application logs
└── requirements.txt

Support

For issues, questions, or contributions:

License

This scraper is for educational and personal use. Please review jobs.ie's Terms of Service and robots.txt before use.


Created: 2026-01-22 Target: https://www.jobs.ie/jobs/in-dublin Python: 3.8+ Database: SQLite 3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors