How the odnelazm-ingest pipeline works
- 10 minsBunge Hub is currently in preview. It exposes Kenya’s parliamentary record as structured, searchable data; every bill, debate, and member contribution from the 13th Parliament. This post walks through the pipeline that builds and maintains the database behind it.
The pipeline lives in odnelazm-ingest, a Rust crate that handles scraping, extraction, storage, and AI enrichment.
The source
Kenya’s Hansard is published through mzalendo.com as structured HTML. Each sitting is a flat document of sections, subsections, and contributions with speaker names. The odnelazm core library parses these into typed Rust structs. odnelazm-ingest takes those structs and writes them to PostgreSQL.
The pipeline
Everything runs through IngestPipeline, which is generic over any DataStore implementation. It holds a scraper, an optional embedder, an optional summarizer, and an optional metrics sink.
pub struct IngestPipeline<S: DataStore> {
scraper: HansardScraper,
store: S,
embedder: Option<Arc<dyn Embedder>>,
pub summarizer: Option<Arc<dyn Summarizer>>,
pub metrics: Option<Arc<dyn MetricsSink>>,
}
The public surface is small: ingest_all_sittings, ingest_sittings_in_range, ingest_members, and ingest_member_profiles. Everything else is private.
What happens when a sitting is ingested
ingest_sitting is the core unit of work. For each sitting:
-
Sitting: upserted by URL. The full transcript is stored as JSONB in
raw_json. On conflict, the raw JSON is overwritten but existing summaries are preserved. -
Speakers: every distinct contributor name becomes a
speakersrow. Speaker names are extracted as-is from the HTML, which means the same person can appear under multiple names across sittings (“Hon. Kaluma”, “Peter Kaluma”, “Hon. Peter Kaluma”). -
Bills: the bill extractor scans section headings for patterns ending in “BILL” or “ACT”. Each match becomes a
bill_mentionsrow joining abillsrecord to the sitting, with stage detected from the heading or contribution text. Each speaker who contributed to that section is linked viabill_mention_speakers, with their full contribution text stored for enrichment. -
Topics: “topic” is a generic term used across the codebase for any non-bill discussion item: questions and statements, notices of motion, and communications from the chair. These are identified by section type rather than heading pattern. Each topic is linked to its contributors via
topic_speakers, again with full contribution text stored for enrichment.
Because bills are linked across sittings rather than stored per-sitting, this schema makes it possible to reconstruct a bill’s full legislative journey: every sitting it appeared in, the stage at each appearance, and everyone who spoke. Bunge Hub surfaces this as an interactive timeline on each bill page. You can trace a bill across sittings, seeing who spoke at each appearance and reading AI-generated summaries of each debate. Stages are inferred from section headings in the transcript so they are a close approximation rather than a guaranteed reflection of the formal legislative record.
The linking problem
After sittings are ingested, member profiles are imported from mzalendo’s member performance tracker. This gives us canonical names, constituencies, parties, and profile URLs.
Linking speakers to members is a three-step pass:
-
URL matching: when mzalendo includes a profile link on a contribution, exact URL match is used. Most reliable. This accounts for 1,871 of 3,021 speaker records in the current dataset.
-
Fuzzy name matching: for speakers without a URL,
pg_trgmtrigram similarity is computed against all member names via amatch_memberSQL function. Matches above a 0.45 score are accepted. -
Role-based matching: presiding officers like “Hon. Speaker” and “Hon. Deputy Speaker” don’t fuzzy-match well because the names are too generic. These are resolved by looking up members with
role = 'Speaker'orrole ILIKE '%deputy speaker%'scoped to the relevant house and parliament.
Steps 2 and 3 combined account for a further 722 matches. 428 speaker records remain unmatched. Most are extraction noise where speech content leaked into the speaker name field during parsing. Others are genuinely ambiguous: presiding officers recorded under role-only labels like “Hon. Chairlady” that are too generic to resolve to a specific member without additional context.
The match_member function drives step 2. It cleans the raw speaker name, then scores it against every member using both word_similarity and similarity, taking whichever is higher:
CREATE OR REPLACE FUNCTION match_member(
query_name TEXT,
min_score FLOAT DEFAULT 0.3
)
RETURNS TABLE (id UUID, name TEXT, url TEXT, house TEXT, constituency TEXT, score FLOAT)
LANGUAGE sql STABLE AS $$
WITH cleaned AS (
SELECT clean_speaker_name(query_name) AS cn
)
SELECT
m.id, m.name, m.url, m.house, m.constituency,
greatest(
word_similarity(c.cn, m.name),
similarity(c.cn, m.name)
)::FLOAT AS score
FROM members m, cleaned c
WHERE greatest(
word_similarity(c.cn, m.name),
similarity(c.cn, m.name)
) >= min_score
ORDER BY score DESC
LIMIT 5
$$;
word_similarity is the key function. It checks whether the cleaned name appears as a substring of the member name, which handles the common case where the Hansard records only part of a member’s full name. “Kimani Ichung’wah” scores 0.52 against “Anthony Kimani Ichung’wah” via word_similarity even though similarity alone would score lower. link_speakers_by_name calls match_member for each unlinked speaker and takes the top result.
The clean_speaker_name SQL function strips honorifics, titles, constituency parentheticals, and presiding-role prefixes before matching. The Hansard produces a surprising variety of name formats:
-- Constituency and party stripped
"Hon. Kimani Ichung'wah (Kikuyu, UDA)" → "Kimani Ichung'wah"
-- Military/professional title stripped
"Hon. (Capt.) Ronald Karauri (Kasarani, Ind)" → "Ronald Karauri"
-- Temporary Speaker with nested honorific and title
"The Temporary Speaker (Hon. (Dr) Rachael Nyamai)" → "Rachael Nyamai"
-- Hon. prefix instead of The
"Hon. Temporary Speaker (Hon. Farah Maalim)" → "Farah Maalim"
-- Unclosed parenthesis from malformed HTML
"Hon. Kimani Ichung'wah (Kikuyu, UDA" → "Kimani Ichung'wah"
After cleaning, trigram similarity is computed against all member names. “Kimani Ichung’wah” matches Anthony Kimani Ichung'wah at 0.52, well above the 0.45 threshold. Presiding officer patterns like “Hon. Speaker” and “The Deputy Speaker” produce false positives with fuzzy matching, so those are resolved separately via the role-based pass.
AI enrichment
Summaries are generated separately from ingestion, via the enrich subcommand of odnelazm-pipeline. The pipeline supports six enrichment targets:
| Target | What it generates |
|---|---|
bill-mentions |
Summary of each bill’s appearance in a sitting, using the full transcript as context |
bill-journeys |
Narrative summary of a bill’s full legislative journey across all sittings |
bill-speakers |
Per-speaker summary of their contributions to a specific bill debate |
topics |
Summary of a topic’s appearance in a sitting |
topic-speakers |
Per-speaker summary of their contributions to a topic |
sittings |
Full structured summary of a sitting |
All enrichment is done locally using open-source models via LM Studio. The model used to generate each summary is stored alongside it so it can be attributed in the UI and re-enriched with a better model later.
The current model is Qwen3.5 9B (qwen/qwen3.5-9b). I chose it for two reasons: its 262K native context window comfortably holds a full sitting transcript in a single call, and its built-in chain-of-thought reasoning produces noticeably more nuanced summaries than models that don’t reason before generating.
The tradeoff is inference time. At 65K context, Qwen spends the bulk of its budget on reasoning before producing the actual answer. From a recent run:
- ~14 tokens/second on Apple M-series (CPU inference)
- ~4 seconds time to first token
- ~18,000 input tokens per bill appearance summary
- ~5,000 reasoning tokens generated before the final output
At concurrency 1 on a MacBook, that works out to around 3-4 minutes per summary. A full run over 1,000 bill appearances takes several days of wall-clock time. Running on a machine with a dedicated GPU could bring this down to under 30 seconds per summary.
I will also be testing Gemma 4 e2b (google/gemma-4-e2b) as an alternative. It’s a 2B parameter model with a 128K context window, designed for on-device inference, and should be significantly faster. The quality tradeoff for complex parliamentary text is still to be evaluated.
The Summarizer trait is simple:
pub trait Summarizer: Send + Sync {
async fn summarize(&self, prompt: &str) -> Result<String>;
}
The caller is responsible for building the full prompt. Prompt builders for each target live in enricher/prompts.rs. Bill appearance prompts include the full sitting transcript as context so the model can reference cross-bill discussions. Member contribution prompts are narrower, only including that speaker’s contributions.
Metrics
The pipeline emits metrics to a Prometheus pushgateway after each enrichment batch. Grafana is used to visualise progress, including throughput, token counts, and time to first token per call.
Running it
The pipeline is driven by a single binary, odnelazm-pipeline:
# Ingest 13th Parliament sittings
odnelazm-pipeline ingest --start-date 2022-09-01 --end-date 2026-05-19
# Enrich bill appearance summaries
odnelazm-pipeline --metrics-url http://localhost:9091 \
enrich bill-mentions --model qwen/qwen3.5-9b --concurrency 1
Enrichment is idempotent. Items that already have a summary are skipped, so runs can be interrupted and resumed without duplicating work.
The full source is at mwananchi-tech/odnelazm. Bunge Hub is live at bunge-hub.mwananchi.tech.