Search without embeddings: Postgres tsvector, LLM rerank, and 30-second clips
For a few hundred podcast episodes, Postgres full-text search plus an LLM rerank beats embedding-based RAG on both quality and operational simplicity. No vector DB, no embedding pipeline.
Prerequisites. This post assumes you’ve skimmed the
uv-2026series (for the Python toolchain) andpython-monorepo-2026(for the workspace layout). Each post in this series stands alone if you want to dive in cold — but the package paths (packages/api/,packages/shared-schema/) and thetaskrunner come from those two.
Previously: Building a provider-switching LLM client — one complete() interface, three providers, cheap-vs-smart routing. Now we use that client for the search pipeline’s second hop.
Every RAG tutorial published in the last two years opens the same way: ingest documents, chunk them, embed each chunk into a vector, store the vectors in a specialty database, then at query time embed the query and do a nearest-neighbor lookup. It is a perfectly fine architecture. It is also wildly over-engineered for a few hundred podcast episodes.
This post argues — and demonstrates — that for our reference channel’s corpus size, the right answer is Postgres full-text search for recall, an LLM rerank pass for precision, and clip extraction for the UX. No pgvector. No embedding pipeline. No dimensionality hand-wringing.
Why not embeddings, for now
There are three honest reasons.
Cost at ingest. Every transcript segment becomes a vector. A 60-minute episode is ~10k tokens of transcript split into thousands of short segments. Multiply by the channel’s archive and you’re paying for embedding API calls you’d rather spend on the synthesis steps that actually move the product (extraction, dedup, question generation). Local embedding models work but add a model-management problem you didn’t have before.
Operational complexity. A vector DB is another service, another schema, another backup story. pgvector collapses that — same Postgres — but you still own an HNSW index, dimension choices, and a separate query path next to the SQL one. Nothing wrong with it. Just not free.
Recall is fine at this size. Postgres tsvector with English stemming + a GIN index gives you sub-50ms full-text search across hundreds of thousands of segments. Stemming handles “fundraise” / “fundraising” / “fundraised”. Phrase queries (websearch_to_tsquery) handle multi-word intent. The gaps — synonyms, paraphrases, conceptual matches — are exactly what the LLM rerank pass closes.
The rule I keep coming back to: add embeddings when FTS recall actually breaks, not before. For our corpus, that’s somewhere past tens of thousands of documents, and our podcast index is two orders of magnitude under that.
The pipeline at a glance
user query
│
▼
┌──────────────────────────────────────┐
│ FTS over transcript_segments_fts │ ← top 50, ranked by ts_rank
│ (Postgres, GIN-indexed tsvector) │
└──────────────────────────────────────┘
│ top 50 candidates
▼
┌──────────────────────────────────────┐
│ LLM rerank (Haiku via llm_client) │ ← top 10 + one-line justification
│ tier="cheap", schema=RerankResult │
└──────────────────────────────────────┘
│ top 10 segment ids
▼
┌──────────────────────────────────────┐
│ Clip extraction │ ← expand each hit into a 30–60s clip,
│ (window N segments, snap to . │ snapped to sentence boundaries
│ sentence boundaries) │
└──────────────────────────────────────┘
│ ranked clips
▼
┌──────────────────────────────────────┐
│ Cache write + response │ ← cache (query_hash, candidate_set)
│ with YouTube deep links │ for 7 days
└──────────────────────────────────────┘Two model calls would be cheaper still — one for rerank, none for embedding — but the rerank is the leverage point. It’s where FTS recall becomes search-quality precision.
The FTS layer
We already have a transcript_segments table from post 2 shaped like (id, episode_id, start_ms, end_ms, text). We add a tsvector column populated by a trigger, then index it.
ALTER TABLE transcript_segments
ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', coalesce(text, ''))) STORED;
CREATE INDEX transcript_segments_search_tsv_idx
ON transcript_segments
USING GIN (search_tsv);
-- A view that joins what the rerank prompt needs in one shot.
CREATE OR REPLACE VIEW transcript_segments_search AS
SELECT
s.id AS segment_id,
s.episode_id,
e.video_id,
e.title AS episode_title,
s.start_ms,
s.end_ms,
s.text,
s.search_tsv
FROM transcript_segments s
JOIN episodes e ON e.id = s.episode_id;A stored generated column is the right shape here: Postgres maintains search_tsv for us on insert and update; no trigger to forget; no UPDATE migration to retroactively backfill old rows. The GIN index makes lookups logarithmic in the corpus size.
The query path:
from dataclasses import dataclass
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncConnection
@dataclass(frozen=True)
class Candidate:
segment_id: int
episode_id: int
video_id: str
episode_title: str
start_ms: int
end_ms: int
text: str
rank: float
_FTS_SQL = text(
"""
SELECT
segment_id, episode_id, video_id, episode_title,
start_ms, end_ms, text,
ts_rank_cd(search_tsv, q) AS rank
FROM transcript_segments_search,
websearch_to_tsquery('english', :query) AS q
WHERE search_tsv @@ q
ORDER BY rank DESC, start_ms ASC
LIMIT :limit
"""
)
async def fts_candidates(
conn: AsyncConnection, *, query: str, limit: int = 50
) -> list[Candidate]:
rows = await conn.execute(_FTS_SQL, {"query": query, "limit": limit})
return [Candidate(**dict(r._mapping)) for r in rows]Three things earn their keep:
websearch_to_tsqueryparses the kind of query a human types — quoted phrases,OR, leading minus for negation — without you writing a parser.ts_rank_cd(the cover density variant) weights matches that appear close together in the segment higher than scattered ones. For transcripts, where one segment is ~5–10 seconds of speech, that’s exactly what we want.LIMIT 50is deliberate. The rerank model takes ~50 short snippets comfortably; going wider mostly adds noise the LLM has to spend tokens filtering out.
A quick sanity check, run against the local DB:
[fts] 50 candidates in 38ms
1 (rank 0.412) ep_204 @ 00:14:31 "...the early fundraising landscape in Nepal was..."
2 (rank 0.397) ep_188 @ 00:42:08 "...we raised our seed in Kathmandu, and..."
3 (rank 0.355) ep_172 @ 01:03:55 "...angel investors here mostly come from..."
...
50 (rank 0.041) ep_119 @ 00:08:17 "...fundraise next year if the runway..."ts_rank_cd is monotone enough to be useful but flat enough that 50 candidates can include both the obvious hit (ep_204 discussing fundraising in Nepal directly) and tangentially-related ones (ep_119 mentioning a future fundraise in passing). That spread is the rerank’s job to sort.
The rerank pass
This is where the provider-switching LLM client pays for itself. We send the top-50 candidates plus the user query to the cheap tier and ask for a ranked top-10 with a one-line justification each. Cheap, fast, structured.
from pydantic import BaseModel, Field
from llm_client import complete
class RerankedHit(BaseModel):
segment_id: int
score: float = Field(ge=0, le=1, description="0–1, higher = more relevant")
why: str = Field(max_length=140, description="One-line justification")
class RerankResult(BaseModel):
hits: list[RerankedHit]
_SYSTEM = """\
You are a search reranker for a podcast transcript index.
You receive a user's query and 50 candidate transcript segments retrieved by
full-text search. Return the 10 most relevant segments, ranked.
Rules:
- Prefer segments where the speaker is directly discussing the query topic.
- Demote segments where the query terms appear in passing or as off-topic asides.
- A segment is more relevant if it contains a complete thought rather than a
partial sentence.
- Score 1.0 = a strong direct match. Score 0.5 = related but not central.
Score below 0.3 should not appear in the output.
Return JSON matching the provided schema. Do not invent segment_ids.
"""
async def rerank(*, query: str, candidates: list[Candidate]) -> RerankResult:
user = "QUERY:\n" + query + "\n\nCANDIDATES:\n" + "\n".join(
f"[{c.segment_id}] {c.text}" for c in candidates
)
return await complete(
system=_SYSTEM,
messages=[{"role": "user", "content": user}],
schema=RerankResult,
tier="cheap",
cache_system=True,
)A few notes that aren’t obvious until you ship this:
- The system prompt is cached. It is identical across every query.
cache_system=Trueflips oncache_control: ephemeralin the Anthropic adapter (see post 5). At ingest scale that’s hundreds of thousands of cached tokens per day. - We send
segment_idand ask the model to echo it back. This is the cheapest possible “structured output” — no hallucinated text, no fuzzy reconciliation. If the model returns an id we didn’t send, we drop it. tier="cheap". Reranking 50 short snippets is a classification-shaped task, not a synthesis task. Haiku-class models are better at this because they’re less likely to over-explain. Smart-tier here is a waste of money.whyis bounded. 140 chars is enough for “Speaker walks through their own Series A pitch in Nepal” and short enough that the model can’t ramble.
One response, lightly cleaned, looks like:
{
"hits": [
{ "segment_id": 81204, "score": 0.94, "why": "Founder describes their own Nepal fundraising round in detail." },
{ "segment_id": 79018, "score": 0.88, "why": "Guest contrasts Nepali angel scene with India in 2024." },
{ "segment_id": 76553, "score": 0.71, "why": "Quick mention of fundraising regulation changes in Nepal." },
...
]
}Clip extraction
A search hit is a single segment — typically 1–10 seconds of speech, often a partial sentence. That’s a terrible thing to drop a user into. What we actually want to surface is a clip: 30–60 seconds of surrounding context, snapped to sentence boundaries where possible.
from dataclasses import dataclass
MIN_CLIP_MS = 30_000
MAX_CLIP_MS = 60_000
SENTENCE_END = ".!?"
@dataclass(frozen=True)
class Clip:
episode_id: int
video_id: str
episode_title: str
start_ms: int
end_ms: int
text: str
hit_segment_id: int
async def build_clip(conn, *, hit: Candidate) -> Clip:
# Pull a generous window of neighbors, then trim to the target span.
rows = (
await conn.execute(
text(
"""
SELECT id, start_ms, end_ms, text
FROM transcript_segments
WHERE episode_id = :episode_id
AND start_ms BETWEEN :lo AND :hi
ORDER BY start_ms
"""
),
{
"episode_id": hit.episode_id,
"lo": max(0, hit.start_ms - MAX_CLIP_MS),
"hi": hit.end_ms + MAX_CLIP_MS,
},
)
).all()
# Center the window on the hit, then expand outward until we hit
# min duration OR a sentence boundary, capped at max duration.
center = next(i for i, r in enumerate(rows) if r.id == hit.segment_id)
lo = hi = center
while True:
span = rows[hi].end_ms - rows[lo].start_ms
boundary_ok = rows[lo].text.lstrip()[:1].isupper() or lo == 0
boundary_ok &= rows[hi].text.rstrip()[-1:] in SENTENCE_END or hi == len(rows) - 1
if span >= MIN_CLIP_MS and boundary_ok:
break
if span >= MAX_CLIP_MS:
break
if lo > 0:
lo -= 1
if hi < len(rows) - 1:
hi += 1
if lo == 0 and hi == len(rows) - 1:
break
return Clip(
episode_id=hit.episode_id,
video_id=hit.video_id,
episode_title=hit.episode_title,
start_ms=rows[lo].start_ms,
end_ms=rows[hi].end_ms,
text=" ".join(r.text for r in rows[lo : hi + 1]),
hit_segment_id=hit.segment_id,
)The shape of this is more important than the exact numbers:
- A floor and a ceiling. Below 30 seconds the clip is too short to follow; above 60 seconds it’s no longer a clip, it’s a chapter. Anything in between is acceptable.
- Expand outward, not just forward. The hit segment can land anywhere — start of a thought, end of a thought, middle of a tangent. Expanding both ways gives a balanced clip.
- Snap to sentence boundaries when possible, give up gracefully when not. Auto-captions sometimes don’t punctuate. We don’t want the algorithm to loop forever waiting for a period that’s never coming, so we fall back to the duration cap.
The cap behavior — “give up gracefully when the data is messy” — is the kind of thing you only notice the third time you ship something like this. Worth writing down.
YouTube deep links
The whole point of a clip card is that the user clicks it and lands at the right second of the video.
def youtube_deep_link(video_id: str, start_ms: int) -> str:
return f"https://youtu.be/{video_id}?t={start_ms // 1000}"YouTube rounds to the second; that’s fine — we already overshoot the clip start by a hair (the sentence-boundary expansion). A user landing on the start of a complete sentence beats a user landing mid-word every time.
The API response, then, is just this:
class SearchHit(BaseModel):
video_id: str
episode_title: str
start_ms: int
end_ms: int
text: str
youtube_url: str
score: float
why: str
class SearchResponse(BaseModel):
query: str
hits: list[SearchHit]
cached: bool
@router.get("/search", response_model=SearchResponse)
async def search(q: str, conn: AsyncConnection = Depends(get_conn)) -> SearchResponse:
cached = await search_cache_get(conn, query=q)
if cached is not None:
return SearchResponse(query=q, hits=cached, cached=True)
candidates = await fts_candidates(conn, query=q, limit=50)
if not candidates:
return SearchResponse(query=q, hits=[], cached=False)
reranked = await rerank(query=q, candidates=candidates)
by_id = {c.segment_id: c for c in candidates}
clips = [
await build_clip(conn, hit=by_id[h.segment_id])
for h in reranked.hits
if h.segment_id in by_id
]
hits = [
SearchHit(
video_id=c.video_id,
episode_title=c.episode_title,
start_ms=c.start_ms,
end_ms=c.end_ms,
text=c.text,
youtube_url=youtube_deep_link(c.video_id, c.start_ms),
score=h.score,
why=h.why,
)
for c, h in zip(clips, reranked.hits)
]
await search_cache_put(conn, query=q, hits=hits)
return SearchResponse(query=q, hits=hits, cached=False)Caching the rerank
The expensive hop is the rerank — every other step is sub-50ms Postgres. So we cache the output of the whole pipeline, keyed by a hash of (normalized_query, candidate_id_set). Two identical queries return instantly; a query that’s nearly identical but pulls slightly different FTS candidates does the rerank again (rare; the FTS result for a stable corpus is stable).
CREATE TABLE search_cache (
cache_key text PRIMARY KEY,
query text NOT NULL,
candidate_hash text NOT NULL,
hits_json jsonb NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX search_cache_created_at_idx ON search_cache (created_at);import hashlib
import json
from datetime import timedelta
CACHE_TTL = timedelta(days=7)
def _key(query: str, candidate_ids: list[int]) -> tuple[str, str]:
norm = " ".join(query.lower().split())
candidate_hash = hashlib.sha256(
",".join(str(i) for i in sorted(candidate_ids)).encode()
).hexdigest()[:16]
return hashlib.sha256(f"{norm}|{candidate_hash}".encode()).hexdigest(), candidate_hashA 7-day TTL is the sweet spot: long enough that repeat traffic is free, short enough that new episodes change the candidate set before the cache outlives its usefulness. A weekly task search:cache:prune (covered in post 8) drops stale rows.
A real query, end to end
HTTP/1.1 200 OK
content-type: application/json
{
"query": "fundraising in Nepal",
"cached": false,
"hits": [
{
"video_id": "abc123",
"episode_title": "Ep 204 — Building Khalti, with Binay Mainali",
"start_ms": 854000,
"end_ms": 902000,
"text": "So when we tried to raise our Series A in Nepal, the angel
base was maybe twenty people. We ended up...",
"youtube_url": "https://youtu.be/abc123?t=854",
"score": 0.94,
"why": "Founder describes their own Nepal fundraising round in detail."
},
...
]
}48ms FTS + ~600ms rerank + ~30ms clip building. Under a second cold, instant on cache hit. Without embeddings.
When you should add embeddings
Three signals, any one of which would push me to layer in pgvector on top of this:
- Recall is actually breaking. Users search for “raising money in South Asia” and miss the obvious Khalti episode about fundraising in Nepal because no segment contains both raising and South Asia literally. FTS stemming handles inflection, not concepts.
- The corpus is past ~50k segments. Cover-density ranking starts to flatten and the rerank prompt’s top-50 begins to miss things.
- Multilingual. Our reference channel is English-heavy; the day it isn’t, FTS with the
englishconfig stops being enough and embeddings become structurally necessary, not just nice-to-have.
When that day arrives, the path is short: add a pgvector column to transcript_segments, embed each segment at ingest, and turn the FTS query into a hybrid — tsvector match UNION vector neighbors, both feeding the same rerank. Same Postgres. No new infra. The rerank pass is provider-agnostic and corpus-agnostic; only the recall layer changes shape. That’s a sequel post, not a v1 problem.
Next: the React side — guest pages, search UI, and codegen’d types. The search response shape we just built is what the frontend’s useSearch(query) hook consumes.
Full source: github.com/poudelprakash/ai-podcast-index (tag series3-post6). This series is being written in parallel with the repo build — tagged commits will be added as posts publish, and the URL is the source of truth.
Building an AI Podcast Index
8 parts in this series.
An eight-part build-along: a locally-running tool that ingests a YouTube podcast channel, extracts guests and topics, lets you clip-search by intent, and generates questions for future episodes — using uv, FastAPI, Vite + React, and a provider-switchable LLM client.
- 01 Building an AI Podcast Index: the Project, the Stack, and What You'll Have at the End
- 02 Ingesting YouTube transcripts: the YT Data API path, with yt-dlp + Whisper as honest fallback
- 03 Structured extraction with Pydantic + Claude: guests, topics, and quotes from raw transcripts
- 04 Entity resolution for guests: fuzzy matching first, LLM disambiguation second
- 05 Building a provider-switching LLM client: one interface, three providers, task-tier routing previous
- 06 Search without embeddings: Postgres tsvector, LLM rerank, and 30-second clips ← you are here
- 07 The React side: guest pages, search UI, and codegen'd types up next
- 08 The question generator, the cron job, and shipping it locally
What did you take away?
Thoughts, pushback, or a story of your own? Drop a reply below — I read every one.
Comments are powered by Disqus. By posting you agree to their terms.