GitHub - kaizen108/Planning_Scapers: UKPlanning project @ University of Warwick.

Planning_Scapers

----- ----- ----- Project structure ----- ----- -----
The project is structured as follows:
root/
├── 📂 Lists/: The list of all applications from Local Authorities.
├── 📂 ScrapedApplications/: The data and documents scraped from the Local Authorities/PlanIt API.
├── 📂 UKPlanning/: All scripts/scrapers.
│ ├── 📄 requirements.txt
│ ├── 📂 general/: General-purpose scraper logic (not tied to a specific framework).
│ │ ├── 📄 base_scraper.py: Common Scrapy Spider base class.
│ │ ├── 📄 utils.py: General utility functions.
│ │ ├── 📄 items.py: Define class Items for download files.
│ │ └── parsers.py: Shared parsing utilities.
│ │
│ ├── 📂 scrapers/: Individual scraper instances for each framework.
│ │ ├── Idox/
│ │ │ ├── Idox1_scraper.py: Idox scraper 1 (inherits from IdoxBaseSpider).
│ │ │ ├── Idox2_scraper.py
│ │ │ └── ...
│ │ ├── Atrium/
│ │ │ ├── Atrium1_scraper.py
│ │ │ ├── Atrium2_scraper.py
│ │ │ └── ...
│ │ ├── PlanningExplorer/
│ │ │ ├── PlanningExplorer1_scraper.py
│ │ │ └── ...
│ │ ├── 📄 Agile_scraper.py: CannockChase, Flintshire, LakeDistrict ... (13 LAs)
│ │ ├── 📄 CCED_scraper.py: DorsetCouncil, Christchurch.
│ │ ├── 📄 CivicaJason_scraper.py: Ashfield, Denbighshire, Eastbourne, StAlbans, Waverley.
│ │ ├── 📄 Custom_scraper.py: Wiltshire.
│ │ ├── 📄 Ocella_scraper.py: Arun, GreatYarmouth, Havering, Hillingdon, SouthHolland.
│ │ ├── 📄 Tascomi.py: Barking, Coventry, Liverpool ... (14 LAs)
│ │ ├── 📄 Thames.py: Richmond (need update).
│ │ └── others/
│ │ ├── pdf_scraper.py
│ │ ├── sitemap_scraper.py
│ │ └── ...
│ │
│ ├── 📂 middlewares/: Globally available middleware modules.
│ │ ├── 📄 middlewares.py: Base middlewares.
│ │ ├── middlewares_IP.py: Middlewares for using IP proxies.
│ │ ├── middlewares_IP_rotation.py: Middlewares for rotating IP proxies frequently.
│ │ ├── user_agent_mw.py
│ │ └── custom/
│ │ ├── idox_proxy_mw.py: Idox-specific custom middleware.
│ │ └── atrium_auth_mw.py: Atrium-specific custom middleware.
│ │
│ ├── 📂 pipelines/: Globally available pipeline modules.
│ │ ├── 📄 pipelines.py: Base pipelines.
│ │ ├── 📄 pipelines_extension.py: Pipelines for obtaining file extensions.
│ │ ├── 📄 pipelines_form_extension.py: Pipelines for obtaining file extensions with FormRequest.
│ │ └── custom/
│ │ ├── idox_custom_pipeline.py: Idox-specific custom pipeline.
│ │ └── atrium_custom_pipeline.py: Atrium-specific custom pipeline.
│ │
│ ├── 📂 tools/: External tool modules.
│ │ ├── 📂 reCAPTCHA/
│ │ │ ├── 📄 reCAPTCHA_model.py: Data pre-processing, model training and prediction.
│ │ │ ├── 📄 reCAPTCHA_API.py: APIs for scrapers.
│ │ │ ├── 📂 model/: ML models for solving reCAPTCHA puzzles.
│ │ │ │ └── 📄 image_classifier.h5
│ │ │ ├── 📂 raw_training/: Raw training data before pre-processing.
│ │ │ ├── 📂 training/: Training data.
│ │ │ ├── 📂 test/: Test data.
│ │ │ ├── 📂 predicted/: Prediction results of test data.
│ │ │ ├── 📂 deleted/: Deleted duplicate training samples.
│ │ │ └── 📄 class_names.txt
│ │ ├── ip_rotation/
│ │ │ ├── rotator_proxy_service.py
│ │ │ └── rotator_custom_pool.py
│ │ ├── 📄 data_process.py
│ │ ├── 📄 data_validation.py
│ │ ├── 📄 utils.py
│ │ └── email_sender.py: Notification tools (Slack, email, etc. Available in local repository only).
│ │
│ ├── 📂 configs/: Project-wide configuration files.
│ │ ├── 📄 settings.py: Global Scrapy settings.
│ │ ├── frameworks/
│ │ │ ├── Idox_settings.py
│ │ │ └── Atrium_settings.py
│ │ └── scrapers/
│ │ ├── Idox1_settings.py
│ │ ├── Idox2_settings.py
│ │ └── Atrium1_settings.py
│ │
│ └── tests/: Unit and integration tests.
│ ├── test_frameworks.py
│ ├── test_scrapers.py
│ └── test_middlewares.py
├── 📄 EC2_commands: EC2 shell script for configuring EC2 instances.
├── 📄 local_commands: Local shell script for configuring EC2 instances.
├── 📄 scraper_document.pdf: User guidance for using scrapers on local machines (Scrapers).
├── 📄 scrapy.cfg: Scrapy entry configuration.
└── 📄 README.md

----- ----- ----- Run scraper on local machines ----- ----- -----

See scraper_document.pdf for details.

----- ----- ----- Configure EC2 instances ----- ----- -----

Follow the instructions to start a new EC2 instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html
Get your EC2 instance's public IPv4 DNS, i.e: ec2-18-130-206-213
Execute shell command from your local machine:

	python {your local command path}/local_commands.py {your EC2 instance's IPv4 DNS} init

Execute shell commands from your EC2 instance:

	python3 EC2_commands.py init
	source scraper_env/bin/activate

Execute shell command from your local machine:

	python {your local command path}/local_commands.py {your EC2 instance's IPv4 DNS} Data

Execute shell commands from your EC2 instance:

	python EC2_commands.py install_chromedriver
	python EC2_commands.py install_chrome
	python EC2_commands.py configure_env
	cd UKPlanning
	python main.py

----- ----- ----- Develop new scraper ----- ----- -----
Currently, the scraper (UKPlanning_Scraper) is able to scrape most information items from Idox portals.
To develop new scrapers by adapting the existing scraper, you can create a new scraper class as a subclass of UKPlanning_Scraper and overwrite its parse methods.

--- --- END of UKPlanning_Scraper Guidance --- ---

###########        
Below is the guidance for UKPlanIt_API.py, not for local authorities.      Please ignore them.      
###########

----- ----- ----- UKPlanIt APIs ----- ----- -----

File 'main.py' contains all APIs related to the scraper. Most APIs contain two parameters which are used to clarify the range of authorities to scrape or process. There are 424 authorities.

scrape(start, end): To scrape data from the PlanIt API. Results are stored in 'Data_Temp'.
    i.e. scrape(2, 10) will scrape applications from the 2nd to the 10th authorities.
         scrape(5, 5) will scrape applications from the 5th authority.
append_all(temp): append all csv files into a single csv file. By default, temp = 'True'.
    i.e. append_all(temp=True) will append all csv files in 'Data_Temp' folder.
         append_all(temp=False) will append all csv files in 'Data' folder.

inverse(start, end): The scraped raw data is stored in an inverse order. This method will make applications in csv files stored in a chronological order. 'Data_Temp' -> 'Data_Temp'.
append_by_year(start, end): append csv files from each authority by years. 'Data_Temp' -> 'Data'.

----- ----- ----- Quick start ----- ----- -----

Run the following pieces of code to get a csv file with applications from the first 10 authorities.

Option1:

scrape(1, 10)

append_all()

Option2:

scrape(1, 10)

inverse(1, 10)

append_by_year(1, 10)

append_all(False)

Two options will produce the same csv file named "UKPlanning.csv". But option2 will also produce many csv files in 'Data' folder, these files are useful for further comments and documents scraping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
Lists/Bassetlaw		Lists/Bassetlaw
UKPlanning		UKPlanning
img		img
EC2_commands.py		EC2_commands.py
README.md		README.md
local_commands.py		local_commands.py
scraper_document.pdf		scraper_document.pdf
scrapy.cfg		scrapy.cfg

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages