Methodology

How Politicker.ai turns political language into structured insight

Politicker.ai exists to make policymaking discourse measurable, searchable, and accountable. Below, we outline how we collect, process, and analyse UK parliamentary speech using a combination of scraping, structured parsing, and large language models (LLMs).


1. Data Collection

We extract data primarily from Hansard, the official record of UK Parliament debates and question sessions. This includes:

  • Oral Questions
  • Ministerial Statements
  • Debates
  • Written Questions

We maintain a daily scraping pipeline to collect new content shortly after publication.


2. Cleaning & Structuring

Hansard is published as raw HTML, often with inconsistent formatting. We use a custom parsing layer to:

  • Extract speaker metadata (name, role, party)
  • Segment text into discrete question–answer pairs, debate turns, or statements
  • Normalise date, session, and topic information
  • Deduplicate and timestamp content

Each item is stored in a structured PostgreSQL database, enabling fast querying and versioning.


3. LLM-Powered Analysis

We apply large language models to the cleaned text to extract multiple dimensions of meaning. These include:

✅ Factual Extraction

  • MP information (e.g. role, party)
  • Department or portfolio being addressed
  • Entities and issues mentioned

🧠 Semantic Analysis

  • Stance detection: supportive, critical, neutral, evasive
  • Tone: defensive, assertive, conciliatory, hostile
  • Sentiment: positive, negative, neutral
  • Narrative framing (e.g. moral, nationalistic, populist)

💡 Bespoke insights for clients

  • Ideological markers (e.g. privatisation vs nationalisation, open-trade vs protectionism)
  • Qualitative depth on single issues (e.g. Net Zero, regulation of tech)
  • MP profiling: attitudes towards specific measures, summary of activity

Each model is prompt-tuned for Hansard-specific language, and run in batches with internal consistency checks.


4. Validation & Monitoring

We evaluate model output through:

  • Human spot checks across random samples of data
  • Statistical modelling of reliability
  • Consistency tests across time series
  • Outlier detection

Our work is endorsed by independent academics who specialise in using quantitative methods to study political communication. Their independent assessment of the reliability of our data ensures that we are independently verified.

We’re working on adding public feedback mechanisms so users can flag questionable outputs directly.


5. Outputs

The processed data powers:

  • Searchable databases of political speech
  • MP and topic profiles
  • Time series charts showing frequency and change over time
  • APIs for programmatic access
  • Custom reports for NGOs, media, and policy teams

6. Transparency and Limitations

We are not a source of political truth. We extract structure, not certainty. LLMs are fallible - especially with sarcasm, nuance, or poor-quality transcripts. For this reason:

  • We link every data point to the original transcript
  • We don’t editorialise or target individual politicians
  • Our goal is augmentation, not judgment

While our data has its limitations, our methodology as a whole ensures that we produce more reliable, and arguably more objective, data than any other source of policymaking information.


Upcoming Improvements

  • Improved speaker disambiguation (e.g. MPs with similar names)
  • Multi-model ensemble voting for higher reliability
  • Incorporation of votes, press releases, and committee transcripts
  • Full dataset release for researchers
  • Expansion into different legislative environments
  • Analysis of media sources and partnership with public opinion polling sources, enabling comparisons between groups

Politicker.ai is a living project. If you have suggestions, critiques, or want to collaborate on validation - contact us.