Methodology

How Politicker.ai turns political language into structured insight

Politicker.ai exists to make policymaking discourse measurable, searchable, and accountable. Below, we outline how we collect, process, and analyse UK parliamentary speech using a combination of scraping, structured parsing, and large language models (LLMs).

1. Data Collection

We extract data primarily from Hansard, the official record of UK Parliament debates and question sessions. This includes:

Oral Questions
Ministerial Statements
Debates
Written Questions

We maintain a daily scraping pipeline to collect new content shortly after publication.

2. Cleaning & Structuring

Hansard is published as raw HTML, often with inconsistent formatting. We use a custom parsing layer to:

Extract speaker metadata (name, role, party)
Segment text into discrete question–answer pairs, debate turns, or statements
Normalise date, session, and topic information
Deduplicate and timestamp content

Each item is stored in a structured PostgreSQL database, enabling fast querying and versioning.

3. LLM-Powered Analysis

We apply large language models to the cleaned text to extract multiple dimensions of meaning. These include:

✅ Factual Extraction

MP information (e.g. role, party)
Department or portfolio being addressed
Entities and issues mentioned

🧠 Semantic Analysis

Stance detection: supportive, critical, neutral, evasive
Tone: defensive, assertive, conciliatory, hostile
Sentiment: positive, negative, neutral
Narrative framing (e.g. moral, nationalistic, populist)

💡 Bespoke insights for clients

Ideological markers (e.g. privatisation vs nationalisation, open-trade vs protectionism)
Qualitative depth on single issues (e.g. Net Zero, regulation of tech)
MP profiling: attitudes towards specific measures, summary of activity

Each model is prompt-tuned for Hansard-specific language, and run in batches with internal consistency checks.

4. Validation & Monitoring

We evaluate model output through:

Human spot checks across random samples of data
Statistical modelling of reliability
Consistency tests across time series
Outlier detection

Our work is endorsed by independent academics who specialise in using quantitative methods to study political communication. Their independent assessment of the reliability of our data ensures that we are independently verified.

We’re working on adding public feedback mechanisms so users can flag questionable outputs directly.

5. Outputs

The processed data powers:

Searchable databases of political speech
MP and topic profiles
Time series charts showing frequency and change over time
APIs for programmatic access
Custom reports for NGOs, media, and policy teams

6. Transparency and Limitations

We are not a source of political truth. We extract structure, not certainty. LLMs are fallible - especially with sarcasm, nuance, or poor-quality transcripts. For this reason:

We link every data point to the original transcript
We don’t editorialise or target individual politicians
Our goal is augmentation, not judgment

While our data has its limitations, our methodology as a whole ensures that we produce more reliable, and arguably more objective, data than any other source of policymaking information.

Upcoming Improvements

Improved speaker disambiguation (e.g. MPs with similar names)
Multi-model ensemble voting for higher reliability
Incorporation of votes, press releases, and committee transcripts
Full dataset release for researchers
Expansion into different legislative environments
Analysis of media sources and partnership with public opinion polling sources, enabling comparisons between groups

Politicker.ai is a living project. If you have suggestions, critiques, or want to collaborate on validation - contact us.