Project Submission repository
The scraped data is saved in the following structure:
data/
├── problems/
│ ├── <problem_title>.json # Problem metadata
│ └── <problem_title>.txt # Problem statement and test cases
└── editorials/
└── <problem_title>.txt # Editorial content
- Selenium
- BeautifulSoup (bs4)
- Undetected ChromeDriver (
undetected-chromedriver) (to bypass captchas)
Install the required Python packages using:
pip install selenium beautifulsoup4 undetected-chromedriver- Ensure Google Chrome is installed on your system.
- Download and install the required Python packages.
- Create the following directories for storing scraped data:
mkdir -p data/problems data/editorials
- Add the target URL of the Codeforces problem set in the
urlvariable in the code. - Execute the script:
Problems can be scrapped individually using the fetch_problems function
python3 scraper.py
- Problem statements, test cases, and metadata will be saved in the
data/problemsdirectory. - Editorials will be saved in the
data/editorialsdirectory.
data/problems/ProblemTitle.json:{ "title": "ProblemTitle", "tags": ["dp", "greedy"], "time_limit": "2 seconds", "memory_limit": "256 MB" }data/problems/ProblemTitle.txt:<Problem Statement> Input <Input Description> Output <Output Description> Input <Input Test Case> Output <Output Test Case>data/editorials/ProblemTitle.txt:<Editorial Content>
- Scrapes problem metadata, statement, input/output specs, and test cases.
- Saves metadata as a JSON file and problem details as a text file.
- Extracts editorial content from the provided URL.
- Saves the editorial in a text file.
- For large-scale scraping, adjust the delay (
time.sleep()) to avoid being flagged. - The code detects if the editorial link directs to a pdf and rejects the problem
- The code doesn't work if the editorial is on a different website or the structure of the editorial is far unusual.