machinelearning

Exam of Basic and Advanced Machine Learning

Analysing the Discourse on Overtourism using Topic Modelling

Project Aim and Overview

This project conducts a computational analysis of the academic literature on overtourism, a phenomenon where excessive tourist presence negatively impacts local communities and environments. The primary goal is to use Natural Language Processing (NLP) and a suite of topic modelling techniques to uncover the latent themes, key concepts, and dominant areas of discussion within this emerging field of research. By applying four different algorithms (LDA, LSI, HDP, and NMF), the project not only identifies these themes but also compares the effectiveness of each model, providing a methodological insight into their application for literature review. The overall aim is to demonstrate how computational methods can accelerate and enrich the process of understanding a complex, multidisciplinary research topic.

Project Structure

The project is organised into a two-part workflow, contained in two separate Jupyter Notebooks. This structure ensures a logical separation between data preparation and analysis.

Part 1: 1_preprocessing_dataset.ipynb

This notebook is dedicated to cleaning and preparing the raw textual data for analysis. The quality of the topic models is highly dependent on the quality of the input data, making this a critical first step.

Objective: To process the abstracts and keywords from the academic dataset (Scopus/Web of Science).
Input: scopus_overtourism.csv, webofscience_overtourism.txt, cloud.png
Steps:
1. Data Loading: Imports the initial dataset.
2. Text Cleaning: Removes irrelevant characters, HTML tags, and punctuation.
3. Normalisation: Converts text to lowercase.
4. Tokenisation: Splits text into individual words (tokens).
5. Stopword Removal: Filters out common words that carry little semantic meaning.
6. Lemmatisation: Reduces words to their base dictionary form to group related terms.
Output: A cleaned, processed dataset (overtourism_processed.csv) where each document is ready for numerical vectorisation.

Part 2: 2_LDA_LSI_HDP_NMF.ipynb

This notebook takes the cleaned data from Part 1 and applies the topic modelling algorithms. This is where the core analysis and discovery of themes take place.

Objective: To build, evaluate, and compare four different topic models.
How Part 1 Connects to Part 2: The preprocessed_data.csv file generated by the first notebook is the direct input for this second part. The modelling cannot proceed without the essential cleaning and structuring performed in Part 1.
Input: overtourism_processed.csv, cloud.png
Steps:
1. Vectorisation: Converts the cleaned text into numerical representations (Bag-of-Words and TF-IDF).
2. Model Training: Trains the LDA, LSI, HDP, and NMF models on the vectorised corpus.
3. Hyperparameter Tuning: Systematically tests different numbers of topics to find the optimal configuration based on coherence scores.
4. Analysis & Visualisation: Interprets the topics generated by each model and uses visualisations like word clouds and bar charts to present the findings.
5. Comparison: Compares the models based on quantitative metrics (coherence) and qualitative interpretability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

machinelearning

Analysing the Discourse on Overtourism using Topic Modelling

Project Aim and Overview

Project Structure

Part 1: 1_preprocessing_dataset.ipynb

Part 2: 2_LDA_LSI_HDP_NMF.ipynb

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
1_preprocessing_dataset.ipynb		1_preprocessing_dataset.ipynb
2_LDA_LSI_HDP_NMF.ipynb		2_LDA_LSI_HDP_NMF.ipynb
README.md		README.md
cloud.png		cloud.png
overtourism_processed.csv		overtourism_processed.csv
scopus_overtourism.csv		scopus_overtourism.csv
webofscience_overtourism.txt		webofscience_overtourism.txt

Folders and files

Latest commit

History

Repository files navigation

machinelearning

Analysing the Discourse on Overtourism using Topic Modelling

Project Aim and Overview

Project Structure

Part 1: 1_preprocessing_dataset.ipynb

Part 2: 2_LDA_LSI_HDP_NMF.ipynb

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages