Skip to content

LauraVerdesca/machinelearning

Repository files navigation

machinelearning

Exam of Basic and Advanced Machine Learning

Analysing the Discourse on Overtourism using Topic Modelling

Project Aim and Overview

This project conducts a computational analysis of the academic literature on overtourism, a phenomenon where excessive tourist presence negatively impacts local communities and environments. The primary goal is to use Natural Language Processing (NLP) and a suite of topic modelling techniques to uncover the latent themes, key concepts, and dominant areas of discussion within this emerging field of research. By applying four different algorithms (LDA, LSI, HDP, and NMF), the project not only identifies these themes but also compares the effectiveness of each model, providing a methodological insight into their application for literature review. The overall aim is to demonstrate how computational methods can accelerate and enrich the process of understanding a complex, multidisciplinary research topic.

Project Structure

The project is organised into a two-part workflow, contained in two separate Jupyter Notebooks. This structure ensures a logical separation between data preparation and analysis.

Part 1: 1_preprocessing_dataset.ipynb

This notebook is dedicated to cleaning and preparing the raw textual data for analysis. The quality of the topic models is highly dependent on the quality of the input data, making this a critical first step.

  • Objective: To process the abstracts and keywords from the academic dataset (Scopus/Web of Science).
  • Input: scopus_overtourism.csv, webofscience_overtourism.txt, cloud.png
  • Steps:
    1. Data Loading: Imports the initial dataset.
    2. Text Cleaning: Removes irrelevant characters, HTML tags, and punctuation.
    3. Normalisation: Converts text to lowercase.
    4. Tokenisation: Splits text into individual words (tokens).
    5. Stopword Removal: Filters out common words that carry little semantic meaning.
    6. Lemmatisation: Reduces words to their base dictionary form to group related terms.
  • Output: A cleaned, processed dataset (overtourism_processed.csv) where each document is ready for numerical vectorisation.

Part 2: 2_LDA_LSI_HDP_NMF.ipynb

This notebook takes the cleaned data from Part 1 and applies the topic modelling algorithms. This is where the core analysis and discovery of themes take place.

  • Objective: To build, evaluate, and compare four different topic models.
  • How Part 1 Connects to Part 2: The preprocessed_data.csv file generated by the first notebook is the direct input for this second part. The modelling cannot proceed without the essential cleaning and structuring performed in Part 1.
  • Input: overtourism_processed.csv, cloud.png
  • Steps:
    1. Vectorisation: Converts the cleaned text into numerical representations (Bag-of-Words and TF-IDF).
    2. Model Training: Trains the LDA, LSI, HDP, and NMF models on the vectorised corpus.
    3. Hyperparameter Tuning: Systematically tests different numbers of topics to find the optimal configuration based on coherence scores.
    4. Analysis & Visualisation: Interprets the topics generated by each model and uses visualisations like word clouds and bar charts to present the findings.
    5. Comparison: Compares the models based on quantitative metrics (coherence) and qualitative interpretability.

About

Exam of Basic and Advanced Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors