Exam of Basic and Advanced Machine Learning
This project conducts a computational analysis of the academic literature on overtourism, a phenomenon where excessive tourist presence negatively impacts local communities and environments. The primary goal is to use Natural Language Processing (NLP) and a suite of topic modelling techniques to uncover the latent themes, key concepts, and dominant areas of discussion within this emerging field of research. By applying four different algorithms (LDA, LSI, HDP, and NMF), the project not only identifies these themes but also compares the effectiveness of each model, providing a methodological insight into their application for literature review. The overall aim is to demonstrate how computational methods can accelerate and enrich the process of understanding a complex, multidisciplinary research topic.
The project is organised into a two-part workflow, contained in two separate Jupyter Notebooks. This structure ensures a logical separation between data preparation and analysis.
This notebook is dedicated to cleaning and preparing the raw textual data for analysis. The quality of the topic models is highly dependent on the quality of the input data, making this a critical first step.
- Objective: To process the abstracts and keywords from the academic dataset (Scopus/Web of Science).
- Input: scopus_overtourism.csv, webofscience_overtourism.txt, cloud.png
- Steps:
- Data Loading: Imports the initial dataset.
- Text Cleaning: Removes irrelevant characters, HTML tags, and punctuation.
- Normalisation: Converts text to lowercase.
- Tokenisation: Splits text into individual words (tokens).
- Stopword Removal: Filters out common words that carry little semantic meaning.
- Lemmatisation: Reduces words to their base dictionary form to group related terms.
- Output: A cleaned, processed dataset (overtourism_processed.csv) where each document is ready for numerical vectorisation.
This notebook takes the cleaned data from Part 1 and applies the topic modelling algorithms. This is where the core analysis and discovery of themes take place.
- Objective: To build, evaluate, and compare four different topic models.
- How Part 1 Connects to Part 2: The preprocessed_data.csv file generated by the first notebook is the direct input for this second part. The modelling cannot proceed without the essential cleaning and structuring performed in Part 1.
- Input: overtourism_processed.csv, cloud.png
- Steps:
- Vectorisation: Converts the cleaned text into numerical representations (Bag-of-Words and TF-IDF).
- Model Training: Trains the LDA, LSI, HDP, and NMF models on the vectorised corpus.
- Hyperparameter Tuning: Systematically tests different numbers of topics to find the optimal configuration based on coherence scores.
- Analysis & Visualisation: Interprets the topics generated by each model and uses visualisations like word clouds and bar charts to present the findings.
- Comparison: Compares the models based on quantitative metrics (coherence) and qualitative interpretability.