A Python-based web scraper for collecting job listings from jobs.ie in the Dublin area. Features automatic deduplication, resume capability, and SQLite database storage.
✓ SQLite Database - All data stored in a single file ✓ Automatic Deduplication - Never scrape the same job twice ✓ Resume Support - Continue from where you left off after interruptions ✓ Retry Failed Jobs - Re-attempt failed scrapes ✓ Status Tracking - Track pending/success/failed status for each job ✓ Incremental Updates - Run daily to collect only new job listings ✓ Low Server Load - Minimal requests with rate limiting ✓ Easy Querying - Use SQL to filter and analyze data ✓ CSV/JSON Export - Export database contents anytime ✓ Minimal Dependencies - Only requests + BeautifulSoup required
Create a project directory and download the source files.
pip install -r requirements.txtrequirements.txt:
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
Note: sqlite3 is built-in to Python 3.x
jobsparser/
├── src/
│ ├── __init__.py
│ ├── main.py
│ ├── scraper.py
│ ├── navigator.py
│ ├── collector.py
│ ├── detail_scraper.py
│ ├── parser.py
│ ├── database.py
│ ├── config.py
│ └── utils/
│ ├── __init__.py
│ ├── request_utils.py
│ ├── validation.py
│ └── logging_utils.py
├── data/ # Created automatically
├── logs/ # Created automatically
└── requirements.txt
from src.scraper import JobScraper
from src.config import Config
# Initialize scraper
scraper = JobScraper(
base_url=Config.BASE_URL,
db_path=Config.DATABASE_PATH
)
# Run full scraping pipeline
scraper.run()
# Get statistics
stats = scraper.get_scraping_stats()
print(f"Total jobs: {stats['total']}")
print(f"Successfully scraped: {stats['success']}")
print(f"Pending: {stats['pending']}")
print(f"Failed: {stats['failed']}")
# Optional: Export to CSV
scraper.export_to_csv('jobs_export.csv')cd jobsparser
python -m src.mainFirst time running - collects all jobs:
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.run()
# Finds 500 jobs, scrapes all 500Run again to get only new jobs:
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.run()
# Finds 520 jobs (500 existing + 20 new), scrapes only 20 newIf scraping was interrupted, resume from where it left off:
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
unscraped_ids = scraper.db.get_unscraped_jobs()
print(f"Resuming {len(unscraped_ids)} pending jobs...")
scraper.scrape_job_details(unscraped_ids)Re-attempt jobs that failed during scraping:
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
failed_ids = scraper.db.get_failed_jobs()
print(f"Retrying {len(failed_ids)} failed jobs...")
scraper.scrape_job_details(failed_ids)Access job data directly:
from src.database import DatabaseManager
db = DatabaseManager('./data/jobs.db')
# Get all successful jobs
all_jobs = db.get_all_jobs(status='success')
# Get statistics
stats = db.get_stats()
print(f"Total: {stats['total']}, Success: {stats['success']}")
# Check if specific job exists
if db.job_exists('12345'):
job = db.get_job_by_id('12345')
print(job['title'], job['company'])from src.database import DatabaseManager
db = DatabaseManager('./data/jobs.db')
# Export to CSV
db.export_to_csv('jobs_export.csv')
# Or use the scraper's export method
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
scraper.export_to_csv('jobs_export.csv')Edit src/config.py to customize:
class Config:
# Target URL
BASE_URL = "https://www.jobs.ie/jobs/in-dublin"
# Rate limiting (seconds between requests)
RATE_LIMIT_DELAY = 2
# Database location
DATABASE_PATH = './data/jobs.db'
# CSS Selectors (update after inspecting website)
SELECTORS = {
'job_listing': '.job-listing-item',
'job_link': 'a.job-link',
'title': 'h1.job-title',
'company': '.company-name',
# ... etc
}Important: You must inspect the jobs.ie website and update the CSS selectors in Config.SELECTORS to match the actual HTML structure.
-
First Run: Database is empty, all jobs are new
- Collects job IDs from listing pages
- Inserts all IDs with status='pending'
- Scrapes each job and updates status to 'success' or 'failed'
-
Subsequent Runs: Database has existing jobs
- Collects job IDs from listing pages
- Queries database for existing IDs
- Only scrapes:
new_jobs = all_jobs - existing_jobs - Minimal load on jobs.ie server
-
Resume After Crash: No need to re-collect IDs
- Queries database for status='pending' jobs
- Continues scraping only unfinished jobs
Jobs are stored in SQLite with the following structure:
CREATE TABLE jobs (
job_id TEXT PRIMARY KEY,
title TEXT,
company TEXT,
location TEXT,
salary TEXT,
job_type TEXT,
posted_date TEXT,
description TEXT,
requirements TEXT, -- JSON array
benefits TEXT, -- JSON array
url TEXT,
scrape_status TEXT, -- pending, success, failed
error_message TEXT,
scraped_at TIMESTAMP,
created_at TIMESTAMP,
updated_at TIMESTAMP
);Status Values:
pending- Job ID collected, details not yet scrapedsuccess- Successfully scraped with full detailsfailed- Scraping failed (network/parsing error)
Use any SQLite browser or command-line:
# Command line
sqlite3 data/jobs.db
# Query examples
SELECT COUNT(*) FROM jobs;
SELECT COUNT(*) FROM jobs WHERE scrape_status = 'success';
SELECT title, company FROM jobs WHERE company LIKE '%Google%';
SELECT * FROM jobs WHERE scrape_status = 'failed';Recommended GUI Tools:
- DB Browser for SQLite (Free, cross-platform)
- SQLiteStudio (Free, cross-platform)
- DBeaver (Free, cross-platform)
Run this script daily to keep database updated:
# daily_update.py
from src.scraper import JobScraper
from src.config import Config
def daily_update():
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
print("Starting daily job update...")
scraper.run()
stats = scraper.get_scraping_stats()
print(f"Update complete!")
print(f"Total jobs in DB: {stats['total']}")
print(f"New jobs scraped: {stats['total'] - stats['success']}")
# Export updated data
scraper.export_to_csv('daily_jobs_export.csv')
if __name__ == "__main__":
daily_update()# maintenance.py
from src.database import DatabaseManager
from src.scraper import JobScraper
from src.config import Config
def retry_failed():
"""Retry all failed jobs"""
scraper = JobScraper(Config.BASE_URL, Config.DATABASE_PATH)
failed = scraper.db.get_failed_jobs()
if failed:
print(f"Retrying {len(failed)} failed jobs...")
scraper.scrape_job_details(failed)
else:
print("No failed jobs to retry")
def show_stats():
"""Display database statistics"""
db = DatabaseManager(Config.DATABASE_PATH)
stats = db.get_stats()
print("\nDatabase Statistics:")
print(f" Total jobs: {stats['total']}")
print(f" Successful: {stats['success']}")
print(f" Pending: {stats['pending']}")
print(f" Failed: {stats['failed']}")
if __name__ == "__main__":
show_stats()
retry_failed()The scraper includes built-in rate limiting to avoid overloading jobs.ie:
- Default: 2 seconds between requests
- Configurable via
Config.RATE_LIMIT_DELAY - Uses proper User-Agent headers
- Implements exponential backoff on errors
Recommendations:
- Don't reduce rate limit below 1 second
- Run updates once daily, not continuously
- Monitor for HTTP 429 (Too Many Requests) errors
- Use database queries instead of re-scraping for analysis
- Issue: CSS selectors may be outdated
- Solution: Inspect jobs.ie HTML and update
Config.SELECTORS
- Issue: Network errors or HTML structure changed
- Solution: Check error_message in database, update selectors if needed
- Issue: Multiple processes accessing database
- Solution: Only run one scraper instance at a time
- Issue: Selectors don't match HTML structure
- Solution: Inspect job detail pages, update extraction selectors
jobsparser/
├── src/
│ ├── scraper.py # Main orchestrator
│ ├── navigator.py # Pagination handler
│ ├── collector.py # Job ID collector
│ ├── detail_scraper.py # Job detail extractor
│ ├── parser.py # HTML parsing utilities
│ ├── database.py # SQLite manager
│ ├── config.py # Configuration
│ └── utils/ # Utility functions
├── data/
│ ├── jobs.db # SQLite database (auto-created)
│ └── jobs_export.csv # CSV exports
├── logs/
│ └── scraper.log # Application logs
└── requirements.txt
For issues, questions, or contributions:
- Check technical-breakdown.md for implementation details
- Review architecture-plan.md for system design
This scraper is for educational and personal use. Please review jobs.ie's Terms of Service and robots.txt before use.
Created: 2026-01-22 Target: https://www.jobs.ie/jobs/in-dublin Python: 3.8+ Database: SQLite 3