This project applies large language model (LLM) text extraction and unsupervised clustering to a corpus of 66 UK public dialogue reports, spanning 2004–2025 and covering technologies including AI, gene editing, nanotechnology, nuclear energy, geoengineering, drones, and quantum technologies.
The goal is to ask: what do members of the public consistently say they are concerned about, or see as beneficial, when asked about new and emerging technologies — and how cross-cutting are those themes?
The analysis pipeline:
- Extracts concern and benefit phrases from each paragraph using a structured LLM prompt (GPT-4o-mini).
- Embeds the extracted phrases using the OpenAI embeddings API.
- Clusters the embeddings (k-means, 75 clusters) to identify recurring themes.
- Labels each cluster using the LLM, producing human-readable summaries.
- Characterises clusters as either cross-cutting (appearing across many technologies) or technology-specific (concentrated in one or two domains), using Shannon entropy.
- Tracks how concern and benefit themes vary over time (by dialogue year) and across technology domains using document-level binary weighting (CIP-0009).
The work contributes to a research paper on the structure of public attitudes towards science and technology in the UK.
The corpus consists of 66 publicly available UK public dialogue reports:
- PDFs: Public dialogue PDFs on Google Drive
- Metadata: tech_metadata Google Sheet — maps each PDF to its technology category and year
| Notebook | Stage | Description |
|---|---|---|
00_data_quality.ipynb |
Assess | Corpus quality assessment — chunk lengths, coverage by technology and year, missing-text diagnostics |
01_processing.ipynb |
Access → Address | PDF chunking, LLM concern/benefit extraction, embedding generation |
01a_clustering.ipynb |
Address | k-means clustering, LLM cluster labelling, framing-lens assignment |
02_shared_structure.ipynb |
Address | Cross-technology concern and benefit structure, cross-cutting cluster identification |
03_ai_distinctiveness.ipynb |
Address | AI-specific cluster salience compared to other technologies |
04_temporal_dynamics.ipynb |
Address | Year-by-year trend analysis using document-level binary weighting |
05_robustness.ipynb |
Address | Sensitivity analyses — alternative cluster counts, prompt wording, temporal stability |
The legacy monolith public_dialogue_analyser_v19.ipynb remains in the repository
for reference but is no longer the active pipeline.
The pub_dialogue/ package implements the Fynesse
Access → Assess → Address pipeline:
| Module | Stage | Description |
|---|---|---|
pub_dialogue/access.py |
Access | PDF chunking, artifact loading, AccessStage config class |
pub_dialogue/assess.py |
Assess | Question-agnostic quality plots and diagnostics, AssessStage |
pub_dialogue/address.py |
Address | Extraction, clustering, labelling, temporal analysis, AddressStage |
pub_dialogue/client.py |
— | LLMClient abstraction over litellm (supports OpenAI, Anthropic, Gemini) |
pub_dialogue/utils.py |
— | Shared re-exports for notebook convenience |
| Path | Description |
|---|---|
tests/ |
336-test pytest suite covering all three modules and stage classes |
outputs/ |
Pipeline artefacts (CSVs, JSONs, figures) — not committed |
checkpoints/ |
Embedding and soft-membership numpy arrays — not committed |
cip/ |
Code Improvement Plans — design decisions (16 CIPs, all closed) |
backlog/ |
Task tracking — bugs, features, documentation |
requirements/ |
Project requirements (VibeSafe governance) |
validation_playbook.md |
Researcher guide for reviewing and validating LLM outputs |
If you have not used Python or the command line before, start with SETUP.md. It walks you through every step — installing Python, creating a virtual environment, configuring your API key, and launching the notebooks — with no assumed knowledge. macOS, Windows, and Linux are all covered.
git clone https://github.com/mlatcl/pub-dialogue.git
cd pub-dialogue
pip install -e ".[dev]"
cp .env.example .env # add your API key(s)Run notebooks in order, starting with 01_processing.ipynb to build the artefact
cache. Subsequent notebooks (01a, 02–05) load pre-computed artefacts and do
not call the LLM.
pytest tests/ -v
# 336 tests across test_access.py, test_assess.py, test_address.py, test_dialogue_utils.pyThe package uses three stage-configuration dataclasses that centralise all path and parameter constants (no more hard-coded values across notebooks).
Access → Assess → Address
| Stage | Owns | Question-agnostic? |
|---|---|---|
| Access | Obtaining raw data (PDFs → chunks → embeddings) | Yes |
| Assess | Characterising data quality without knowing the research question | Yes |
| Address | Answering the research question (extraction, clustering, labelling, analysis) | No |
from pub_dialogue.access import AccessStage
from pub_dialogue.assess import AssessStage
from pub_dialogue.address import AddressStage
# Instantiate (all parameters have sensible defaults)
access = AccessStage() # output_folder="outputs", checkpoint_folder="checkpoints"
assess = AssessStage(access=access)
address = AddressStage(access=access) # n_concern_clusters=75, random_seed=42AccessStage — paths and chunking parameters:
artifacts = access.load_artifacts() # load all pre-computed CSVs + numpy arrays
# Returns dict with: chunks_df, concerns_df, benefits_df,
# concern_embeddings, benefit_embeddings, concern_ids, benefit_idsAssessStage — question-agnostic quality helpers:
assess.plot_quality(chunks_df) # write data_quality_overview.png to outputs/
assess.validate_cache(cache, kind) # check extraction cache for partial-failure runs
assess.validation_summary() # write validation_summary.txt to outputs/AddressStage — analysis computations:
# Year × cluster matrices (document-level binary weighting, CIP-0009)
ai_year = address.concern_year_matrix(concerns_df, chunks_df)
benefit_ai_yr = address.benefit_year_matrix(benefits_df, chunks_df)
# PCA embedding trajectories
traj = address.concern_trajectory(concerns_df, embeddings, phrase_ids)
# Technology × cluster salience
salience = address.concern_salience(concerns_df)
# Pipeline methods (used in 01a_clustering.ipynb)
result = address.cluster_phrases(phrases_df, embeddings, kind='concern',
output_folder=access.output_folder,
checkpoint_folder=access.checkpoint_folder)
# result keys: phrases_df, assignments, embeddings_normalized, centroids_normalized, soft_membership
labels = address.label_clusters(exemplars, kind='concern',
output_folder=access.output_folder, client=client)
mappings = address.assign_framing_lenses(exemplars, labels, n_clusters=75,
kind='concern',
output_folder=access.output_folder, client=client)Every analysis notebook starts with:
from pub_dialogue.utils import AccessStage, AddressStage, AssessStage, show_status
_access = AccessStage()
_address = AddressStage(access=_access)
_assess = AssessStage(access=_access)
OUTPUT_FOLDER = _access.output_folder
CHECKPOINT_FOLDER = _access.checkpoint_folder
TECH_COL = _address.tech_col
artifacts = _access.load_artifacts()The pipeline uses litellm — switch providers by
changing LLM_MODEL in 01_processing.ipynb or 01a_clustering.ipynb.
| Provider | Example LLM_MODEL |
Required env-var |
|---|---|---|
| OpenAI (default) | gpt-4o-mini |
OPENAI_API_KEY |
| Anthropic | claude-3-5-haiku-latest |
ANTHROPIC_API_KEY |
| Google Gemini | gemini/gemini-2.0-flash |
GOOGLE_API_KEY |
Note on embeddings:
EMBEDDING_MODEL(defaulttext-embedding-3-large) is separate fromLLM_MODEL. Changing it invalidates all saved*.npyembedding artefacts and requires a full re-run of01_processing.ipynb.
Running the full pipeline produces outputs in outputs/:
| File | Description |
|---|---|
paragraph_chunks.csv |
All extracted text chunks with chunking_method and was_truncated |
extracted_concerns.csv |
Concern phrases with source chunk, cluster assignment, technology, year |
extracted_benefits.csv |
Benefit phrases with same fields |
cluster_labels.json |
LLM-generated concern cluster labels and descriptions |
benefit_cluster_labels.json |
LLM-generated benefit cluster labels |
cluster_summary.csv |
Concern cluster sizes, entropy, cross-cutting classification |
framing_lens_mappings.json |
Concern framing lens → cluster assignments |
benefit_framing_lens_mappings.json |
Benefit framing lens → cluster assignments |
ai_distinctive_concerns.csv |
Concern clusters most over/under-represented in AI dialogues |
ai_distinctive_benefits.csv |
Benefit clusters most distinctive to AI |
validation_summary.txt |
Key counts and file checklist for result validation |
sensitivity_*_k{60,75,90}.* |
Concern k-sensitivity outputs |
benefit_sensitivity_*_k{60,75,90}.* |
Benefit k-sensitivity outputs |
This project uses VibeSafe for structured project management. The development workflow is:
- Tenets — guiding principles (Fynesse AAA separation, Access/Assess/Address invariant)
- Requirements (
requirements/) — what the system should do - CIPs (
cip/) — design plans and architectural decisions (16 CIPs, all closed — seecip/README.md) - Backlog (
backlog/) — concrete tasks in progress or queued