Public Dialogue Analyser

What this project does

This project applies large language model (LLM) text extraction and unsupervised clustering to a corpus of 66 UK public dialogue reports, spanning 2004–2025 and covering technologies including AI, gene editing, nanotechnology, nuclear energy, geoengineering, drones, and quantum technologies.

The goal is to ask: what do members of the public consistently say they are concerned about, or see as beneficial, when asked about new and emerging technologies — and how cross-cutting are those themes?

The analysis pipeline:

Extracts concern and benefit phrases from each paragraph using a structured LLM prompt (GPT-4o-mini).
Embeds the extracted phrases using the OpenAI embeddings API.
Clusters the embeddings (k-means, 75 clusters) to identify recurring themes.
Labels each cluster using the LLM, producing human-readable summaries.
Characterises clusters as either cross-cutting (appearing across many technologies) or technology-specific (concentrated in one or two domains), using Shannon entropy.
Tracks how concern and benefit themes vary over time (by dialogue year) and across technology domains using document-level binary weighting (CIP-0009).

The work contributes to a research paper on the structure of public attitudes towards science and technology in the UK.

Data

The corpus consists of 66 publicly available UK public dialogue reports:

PDFs: Public dialogue PDFs on Google Drive
Metadata: tech_metadata Google Sheet — maps each PDF to its technology category and year

Repository structure

Analysis notebooks (run in order)

Notebook	Stage	Description
`00_data_quality.ipynb`	Assess	Corpus quality assessment — chunk lengths, coverage by technology and year, missing-text diagnostics
`01_processing.ipynb`	Access → Address	PDF chunking, LLM concern/benefit extraction, embedding generation
`01a_clustering.ipynb`	Address	k-means clustering, LLM cluster labelling, framing-lens assignment
`02_shared_structure.ipynb`	Address	Cross-technology concern and benefit structure, cross-cutting cluster identification
`03_ai_distinctiveness.ipynb`	Address	AI-specific cluster salience compared to other technologies
`04_temporal_dynamics.ipynb`	Address	Year-by-year trend analysis using document-level binary weighting
`05_robustness.ipynb`	Address	Sensitivity analyses — alternative cluster counts, prompt wording, temporal stability

The legacy monolith public_dialogue_analyser_v19.ipynb remains in the repository for reference but is no longer the active pipeline.

Python package

The pub_dialogue/ package implements the Fynesse Access → Assess → Address pipeline:

Module	Stage	Description
`pub_dialogue/access.py`	Access	PDF chunking, artifact loading, `AccessStage` config class
`pub_dialogue/assess.py`	Assess	Question-agnostic quality plots and diagnostics, `AssessStage`
`pub_dialogue/address.py`	Address	Extraction, clustering, labelling, temporal analysis, `AddressStage`
`pub_dialogue/client.py`	—	`LLMClient` abstraction over `litellm` (supports OpenAI, Anthropic, Gemini)
`pub_dialogue/utils.py`	—	Shared re-exports for notebook convenience

Other files

Path	Description
`tests/`	336-test pytest suite covering all three modules and stage classes
`outputs/`	Pipeline artefacts (CSVs, JSONs, figures) — not committed
`checkpoints/`	Embedding and soft-membership numpy arrays — not committed
`cip/`	Code Improvement Plans — design decisions (16 CIPs, all closed)
`backlog/`	Task tracking — bugs, features, documentation
`requirements/`	Project requirements (VibeSafe governance)
`validation_playbook.md`	Researcher guide for reviewing and validating LLM outputs

New to Python?

If you have not used Python or the command line before, start with SETUP.md. It walks you through every step — installing Python, creating a virtual environment, configuring your API key, and launching the notebooks — with no assumed knowledge. macOS, Windows, and Linux are all covered.

Quick start (local)

git clone https://github.com/mlatcl/pub-dialogue.git
cd pub-dialogue
pip install -e ".[dev]"
cp .env.example .env   # add your API key(s)

Run notebooks in order, starting with 01_processing.ipynb to build the artefact cache. Subsequent notebooks (01a, 02–05) load pre-computed artefacts and do not call the LLM.

Running tests

pytest tests/ -v
# 336 tests across test_access.py, test_assess.py, test_address.py, test_dialogue_utils.py

pub_dialogue package API

The package uses three stage-configuration dataclasses that centralise all path and parameter constants (no more hard-coded values across notebooks).

The three stages

Access → Assess → Address

Stage	Owns	Question-agnostic?
Access	Obtaining raw data (PDFs → chunks → embeddings)	Yes
Assess	Characterising data quality without knowing the research question	Yes
Address	Answering the research question (extraction, clustering, labelling, analysis)	No

Stage classes

from pub_dialogue.access import AccessStage
from pub_dialogue.assess import AssessStage
from pub_dialogue.address import AddressStage

# Instantiate (all parameters have sensible defaults)
access  = AccessStage()                   # output_folder="outputs", checkpoint_folder="checkpoints"
assess  = AssessStage(access=access)
address = AddressStage(access=access)     # n_concern_clusters=75, random_seed=42

AccessStage — paths and chunking parameters:

artifacts = access.load_artifacts()       # load all pre-computed CSVs + numpy arrays
# Returns dict with: chunks_df, concerns_df, benefits_df,
#                    concern_embeddings, benefit_embeddings, concern_ids, benefit_ids

AssessStage — question-agnostic quality helpers:

assess.plot_quality(chunks_df)            # write data_quality_overview.png to outputs/
assess.validate_cache(cache, kind)        # check extraction cache for partial-failure runs
assess.validation_summary()              # write validation_summary.txt to outputs/

AddressStage — analysis computations:

# Year × cluster matrices (document-level binary weighting, CIP-0009)
ai_year       = address.concern_year_matrix(concerns_df, chunks_df)
benefit_ai_yr = address.benefit_year_matrix(benefits_df, chunks_df)

# PCA embedding trajectories
traj = address.concern_trajectory(concerns_df, embeddings, phrase_ids)

# Technology × cluster salience
salience = address.concern_salience(concerns_df)

# Pipeline methods (used in 01a_clustering.ipynb)
result = address.cluster_phrases(phrases_df, embeddings, kind='concern',
                                 output_folder=access.output_folder,
                                 checkpoint_folder=access.checkpoint_folder)
# result keys: phrases_df, assignments, embeddings_normalized, centroids_normalized, soft_membership

labels = address.label_clusters(exemplars, kind='concern',
                                output_folder=access.output_folder, client=client)

mappings = address.assign_framing_lenses(exemplars, labels, n_clusters=75,
                                         kind='concern',
                                         output_folder=access.output_folder, client=client)

Notebook setup pattern

Every analysis notebook starts with:

from pub_dialogue.utils import AccessStage, AddressStage, AssessStage, show_status

_access  = AccessStage()
_address = AddressStage(access=_access)
_assess  = AssessStage(access=_access)

OUTPUT_FOLDER     = _access.output_folder
CHECKPOINT_FOLDER = _access.checkpoint_folder
TECH_COL          = _address.tech_col

artifacts = _access.load_artifacts()

Supported LLM providers

The pipeline uses litellm — switch providers by changing LLM_MODEL in 01_processing.ipynb or 01a_clustering.ipynb.

Provider	Example `LLM_MODEL`	Required env-var
OpenAI (default)	`gpt-4o-mini`	`OPENAI_API_KEY`
Anthropic	`claude-3-5-haiku-latest`	`ANTHROPIC_API_KEY`
Google Gemini	`gemini/gemini-2.0-flash`	`GOOGLE_API_KEY`

Note on embeddings: EMBEDDING_MODEL (default text-embedding-3-large) is separate from LLM_MODEL. Changing it invalidates all saved *.npy embedding artefacts and requires a full re-run of 01_processing.ipynb.

Analysis outputs

Running the full pipeline produces outputs in outputs/:

File	Description
`paragraph_chunks.csv`	All extracted text chunks with `chunking_method` and `was_truncated`
`extracted_concerns.csv`	Concern phrases with source chunk, cluster assignment, technology, year
`extracted_benefits.csv`	Benefit phrases with same fields
`cluster_labels.json`	LLM-generated concern cluster labels and descriptions
`benefit_cluster_labels.json`	LLM-generated benefit cluster labels
`cluster_summary.csv`	Concern cluster sizes, entropy, cross-cutting classification
`framing_lens_mappings.json`	Concern framing lens → cluster assignments
`benefit_framing_lens_mappings.json`	Benefit framing lens → cluster assignments
`ai_distinctive_concerns.csv`	Concern clusters most over/under-represented in AI dialogues
`ai_distinctive_benefits.csv`	Benefit clusters most distinctive to AI
`validation_summary.txt`	Key counts and file checklist for result validation
`sensitivity__k{60,75,90}.`	Concern k-sensitivity outputs
`benefit_sensitivity__k{60,75,90}.`	Benefit k-sensitivity outputs

Project management

This project uses VibeSafe for structured project management. The development workflow is:

Tenets — guiding principles (Fynesse AAA separation, Access/Assess/Address invariant)
Requirements (requirements/) — what the system should do
CIPs (cip/) — design plans and architectural decisions (16 CIPs, all closed — see cip/README.md)
Backlog (backlog/) — concrete tasks in progress or queued

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Public Dialogue Analyser

What this project does

Data

Repository structure

Analysis notebooks (run in order)

Python package

Other files

New to Python?

Quick start (local)

Running tests

pub_dialogue package API

The three stages

Stage classes

Notebook setup pattern

Supported LLM providers

Analysis outputs

Project management

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.devcontainer		.devcontainer
backlog		backlog
cip		cip
fynesse		fynesse
pub_dialogue		pub_dialogue
requirements		requirements
scripts		scripts
tenets		tenets
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
00_data_quality.ipynb		00_data_quality.ipynb
01_processing.ipynb		01_processing.ipynb
01a_clustering.ipynb		01a_clustering.ipynb
02_shared_structure.ipynb		02_shared_structure.ipynb
03_ai_distinctiveness.ipynb		03_ai_distinctiveness.ipynb
04_temporal_dynamics.ipynb		04_temporal_dynamics.ipynb
05_robustness.ipynb		05_robustness.ipynb
README.md		README.md
SETUP.md		SETUP.md
dialogue_utils.py		dialogue_utils.py
figure_benefit_space_pca.png		figure_benefit_space_pca.png
figure_concern_space_pca.png		figure_concern_space_pca.png
notebook_cell_mapping.json		notebook_cell_mapping.json
prompt_sensitivity_v16.ipynb		prompt_sensitivity_v16.ipynb
pub-dialogue-analysis.tex		pub-dialogue-analysis.tex
pub-dialogue.code-workspace		pub-dialogue.code-workspace
public_dialogue_analyser_v12b_4.ipynb		public_dialogue_analyser_v12b_4.ipynb
public_dialogue_analyser_v19.ipynb		public_dialogue_analyser_v19.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
validation_playbook.md		validation_playbook.md

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Public Dialogue Analyser

What this project does

Data

Repository structure

Analysis notebooks (run in order)

Python package

Other files

New to Python?

Quick start (local)

Running tests

pub_dialogue package API

The three stages

Stage classes

Notebook setup pattern

Supported LLM providers

Analysis outputs

Project management

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages