Adopting Alembic in clipdex without rewriting the query layer
The previous post named four cracks in clipdex's raw-SQL migration story. This one closes them. Declarative models that mirror the live schema, a no-op baseline revision, `alembic stamp head`, and a frozen `migrations/legacy/` directory. No query code changes. Ninety minutes.
Last post named four cracks in clipdex’s raw-SQL migration approach: no applied-state tracking, ordering collisions across branches, no downgrade story, and no autogenerate diff between models and live schema. The fix is Alembic. This post does the migration, in commits the reader can follow.
The constraint that shapes the whole thing: don’t rewrite the query layer. Clipdex’s API uses psycopg with raw SQL strings. That works, the posts that introduced those queries link to them verbatim, and rewriting them to SQLAlchemy ORM would be a separate project. Alembic doesn’t need the query layer. It needs metadata. We can get metadata from declarative models that exist only for autogenerate to diff against.
This is the move. The query layer doesn’t change. Five small steps. Ninety minutes.
What we’re aiming at
After the migration:
packages/api/
├── alembic.ini
├── alembic/
│ ├── env.py
│ └── versions/
│ └── cad2f1272a8a_baseline_raw_sql_migrations_001_004.py
└── src/clipdex_api/
├── db.py # unchanged — still raw SQL via psycopg
├── search.py # unchanged
├── guests.py # unchanged
└── models.py # NEW — declarative tables for autogenerate
migrations/
└── legacy/ # 001-004 moved here, frozentask db:upgrade becomes the only command anyone runs for schema changes. task db:bootstrap handles fresh DBs: apply the legacy SQL once, stamp the Alembic baseline, then upgrade.
Step 1 — add the dependency, scaffold Alembic
In packages/api/pyproject.toml:
dependencies = [
...
"sqlalchemy[asyncio]>=2.0",
"alembic>=1.14",
...
]uv sync to install. Then from the API package directory:
uv run --package clipdex-api alembic init -t async alembicThat creates packages/api/alembic.ini, packages/api/alembic/env.py, packages/api/alembic/script.py.mako, and an empty versions/ directory. The -t async template matters: clipdex’s API uses async SQLAlchemy, and the async template’s env.py already knows how to bridge sync Alembic operations onto an async engine via connection.run_sync().
You can ignore everything in alembic.ini except script_location. The connection URL is going to come from .env, not the ini file.
Step 2 — declarative models that mirror the live schema
This is the trick. We write models.py not as the future shape of the database, but as the current shape — what 001_init.sql through 004_search.sql have already produced. Every column type, every index, every check constraint, every server default.
Excerpt from packages/api/src/clipdex_api/models.py:
"""SQLAlchemy declarative models mirroring the live Postgres schema.
These exist for Alembic's --autogenerate diff. Application query code still
uses raw SQL via psycopg — these models are NOT imported by the request path.
"""
from sqlalchemy import (
BigInteger, CheckConstraint, Column, DateTime, ForeignKey, Index,
Integer, PrimaryKeyConstraint, REAL, Text, text,
)
from sqlalchemy.dialects.postgresql import JSONB, UUID
from sqlalchemy.orm import DeclarativeBase
class Base(DeclarativeBase):
pass
class ProcessedVideo(Base):
__tablename__ = "processed_videos"
video_id = Column(Text, primary_key=True)
title = Column(Text, nullable=False)
published_at = Column(DateTime(timezone=True), nullable=False)
status = Column(Text, nullable=False)
source = Column(Text)
segment_count = Column(Integer)
ingested_at = Column(DateTime(timezone=True))
error = Column(Text)
__table_args__ = (
CheckConstraint(
"status IN ('pending', 'done', 'failed')",
name="processed_videos_status_check",
),
)Five things worth pointing out in that one class:
1. DeclarativeBase (SQLAlchemy 2.x). Not the legacy declarative_base() factory — the new style is a class you subclass. Cleaner, types better with mypy, what every new project should use.
2. text("now()") for server defaults, not Python’s datetime.utcnow. The default is in the database, written by Postgres. If you use a Python callable here, autogenerate will think the live schema is wrong and emit a spurious “drop the server default” migration on first run.
3. Named check constraints. name="processed_videos_status_check" matters because Postgres names check constraints automatically (processed_videos_status_check is the conventional pattern), and autogenerate uses the name to diff. Without the name, autogenerate sees the anonymous constraint in the model and the named one in the DB and emits drop+create.
4. The reserved word text. transcript_segments and quotes_raw have a text column. text is also the SQLAlchemy function we imported. Resolve the conflict like this:
class TranscriptSegment(Base):
__tablename__ = "transcript_segments"
text_ = Column("text", Text, nullable=False) # attr name vs column nameThe Python attribute is text_; the SQL column stays text. Both are correct.
5. The docstring says “NOT imported by the request path.” Important to write down. The temptation in a future PR will be to start using models.ProcessedVideo in queries. That’s fine eventually — but mixing the ORM and raw SQL prematurely makes both worse. Hold the line: models.py is for Alembic, not for query construction.
Step 3 — wire env.py to the DATABASE_URL and to Base.metadata
The alembic init -t async template leaves a placeholder target_metadata = None. We want it pointing at our Base.metadata so autogenerate has something to diff against. And we want the connection URL to come from .env, not from alembic.ini.
packages/api/alembic/env.py, top of file:
import asyncio
import os
from logging.config import fileConfig
from sqlalchemy import pool
from sqlalchemy.engine import Connection
from sqlalchemy.ext.asyncio import async_engine_from_config
from alembic import context
from clipdex_api.models import Base
config = context.config
if config.config_file_name is not None:
fileConfig(config.config_file_name)
# DATABASE_URL is the source of truth. Normalise to +psycopg async driver,
# matching clipdex_api.db.
_db_url = os.getenv("DATABASE_URL", "postgresql://localhost:5432/clipdex")
if _db_url.startswith("postgresql://"):
_db_url = "postgresql+psycopg://" + _db_url[len("postgresql://"):]
config.set_main_option("sqlalchemy.url", _db_url)
target_metadata = Base.metadataAnd in do_run_migrations, two flags worth turning on:
def do_run_migrations(connection: Connection) -> None:
context.configure(
connection=connection,
target_metadata=target_metadata,
compare_type=True,
compare_server_default=True,
)
with context.begin_transaction():
context.run_migrations()compare_type=True makes autogenerate notice column-type changes (e.g., Integer → BigInteger). compare_server_default=True makes it notice changes to server_default. Without these, you can swap a column’s type in models.py and autogenerate will emit an empty revision — a particularly demoralising failure mode.
Step 4 — the baseline revision (the trick)
The schema already exists in the live DB. There’s nothing to upgrade to get there. But Alembic needs a revision in versions/ to be the parent of every future revision. So we create one whose upgrade() is empty:
uv run --package clipdex-api alembic -c packages/api/alembic.ini \
revision -m "baseline (raw SQL migrations 001-004)"Edit the generated file:
"""baseline (raw SQL migrations 001-004)
Revision ID: cad2f1272a8a
Revises:
"""
revision: str = 'cad2f1272a8a'
down_revision = None
def upgrade() -> None:
"""No-op baseline.
The schema this revision represents was produced by migrations/legacy/
001_init.sql through 004_search.sql. Existing DBs are stamped to this
revision (`alembic stamp head`) so future revisions stack on top.
"""
pass
def downgrade() -> None:
"""No inverse for the baseline — drop the database to undo."""
passThen stamp the live DB to declare it caught up to this revision:
DATABASE_URL=postgresql://localhost:5432/clipdex \
uv run --package clipdex-api alembic -c packages/api/alembic.ini stamp headThis creates the alembic_version table and inserts one row: cad2f1272a8a. It runs no DDL. The DB is unchanged. But Alembic now knows where it is.
Verify:
$ psql -d clipdex -X -c "select * from alembic_version;"
version_num
--------------
cad2f1272a8a
$ uv run --package clipdex-api alembic -c packages/api/alembic.ini check
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.ddl.postgresql] Detected sequence named 'guest_aliases_id_seq' as owned by integer column 'guest_aliases(id)', assuming SERIAL and omitting
...
No new upgrade operations detected.No new upgrade operations detected. is the most important line in the whole migration. It means models.py and the live schema agree. If autogenerate sees a diff here, your model is wrong somewhere — fix it before continuing.
Step 5 — retire the legacy SQL and rewire the Taskfile
Move the four legacy files into a migrations/legacy/ directory:
mkdir -p migrations/legacy
git mv migrations/00*.sql migrations/legacy/Add a migrations/legacy/README.md that explains they’re frozen and only used for bootstrap.
The Taskfile changes from:
db:migrate:
cmds:
- psql -d clipdex -f migrations/001_init.sql
- psql -d clipdex -f migrations/002_enrich.sql
- psql -d clipdex -f migrations/003_resolution.sql
- psql -d clipdex -f migrations/004_search.sql…to:
db:bootstrap:
desc: Create the DB if missing, apply legacy SQL, stamp Alembic baseline, then upgrade.
cmds:
- createdb clipdex 2>/dev/null || true
- |
if ! psql -d clipdex -X -tAc \
"SELECT 1 FROM information_schema.tables WHERE table_name='alembic_version'" \
| grep -q 1; then
echo "Bootstrapping fresh DB from migrations/legacy/..."
psql -d clipdex -f migrations/legacy/001_init.sql
psql -d clipdex -f migrations/legacy/002_enrich.sql
psql -d clipdex -f migrations/legacy/003_resolution.sql
psql -d clipdex -f migrations/legacy/004_search.sql
uv run --package clipdex-api alembic -c packages/api/alembic.ini stamp head
fi
- task: db:upgrade
db:upgrade:
cmd: uv run --package clipdex-api alembic -c packages/api/alembic.ini upgrade head
db:revision:
desc: 'Generate a new Alembic revision. Usage: task db:revision -- "short message"'
cmd: uv run --package clipdex-api alembic -c packages/api/alembic.ini revision --autogenerate -m "{{.CLI_ARGS}}"
db:current:
cmd: uv run --package clipdex-api alembic -c packages/api/alembic.ini currentThe conditional in db:bootstrap is the trick that makes it idempotent: if alembic_version exists, the legacy SQL doesn’t run again — only alembic upgrade head does. A fresh createdb hits the first branch once and then never again.
The materialized view caveat
Alembic’s autogenerate does not track materialized views. The transcript_segments_search view from 004_search.sql is invisible to the metadata diff. We can’t put it in models.py and have autogenerate manage it.
So:
-
The legacy file stays as the source of the view’s original definition.
-
Future changes to the view live in hand-written revisions —
alembic revision -m "..."without--autogenerate, then add the DDL inop.execute(...):def upgrade(): op.execute("DROP MATERIALIZED VIEW IF EXISTS transcript_segments_search") op.execute(""" CREATE MATERIALIZED VIEW transcript_segments_search AS SELECT ... """) op.execute("CREATE UNIQUE INDEX ... ON transcript_segments_search (...)")
This is documented in MIGRATIONS.md under the “Materialized views” recipe. Anything that needs hand-written DDL — views, extensions, custom functions, CREATE INDEX CONCURRENTLY — gets the same treatment: skip --autogenerate, write the op.execute calls by hand, commit.
The cost of accepting this caveat is small. Materialized views change rarely; when they do, the hand-written DDL is the same DDL you’d write anyway. The benefit — autogenerate for the 95% of changes that are tables, columns, indexes — is enormous.
What the daily flow looks like now
Adding a column to processed_videos:
# 1. Edit models.py — add the column to ProcessedVideo
vim packages/api/src/clipdex_api/models.py
# 2. Autogenerate the revision
task db:revision -- "add view_count to processed_videos"
# -> Generating .../alembic/versions/<hash>_add_view_count_to_processed_videos.py
# 3. Review the generated file. Edit if it got something wrong (e.g., a rename
# came out as drop+add).
# 4. Apply
task db:upgrade
# 5. Confirm
task db:current
# -> <new hash> (head)Five commands, one of which is opening an editor. The four cracks from last post are all closed:
- Applied-state tracking:
alembic_versiontable. - Ordering across branches: revision hashes +
down_revisionpointers; merge conflicts surface attask db:upgradetime, not silently in production. - Downgrade path:
downgrade()exists in every revision file. Empty by default; fill in when the change is one you might want to roll back. - Autogenerate diff:
task db:revisionwrites the migration for you, catching the missed index / missed NOT NULL / forgotten foreign key class of bug at edit time.
What didn’t change
Worth being explicit about. The query layer didn’t move. packages/api/src/clipdex_api/db.py still creates a psycopg-backed async engine. search.py, guests.py, questions.py, stats.py still execute raw SQL strings. The 25 tests pass without modification.
This is the unlock that made the migration ninety minutes instead of two weeks. Alembic doesn’t require you to commit to the SQLAlchemy ORM. It requires metadata. A small models.py that exists only for target_metadata = Base.metadata is enough.
If clipdex eventually moves the query layer to SQLAlchemy Core or the ORM — for type-safe joins, for repository patterns, for whatever — those models are already there to build on. Or not. Both futures stay open.
What I’d do differently next time
Two small things, both about the baseline.
Port the legacy SQL into the baseline’s upgrade(). I chose not to, because keeping the original blog posts link-able to the literal .sql files mattered for the series. For any project that’s not a teaching artifact, porting the bootstrap into Alembic’s op.create_table(...) calls means a fresh DB needs only alembic upgrade head — no special bootstrap branch in the Taskfile. Simpler.
Generate the baseline from the live DB. There’s an alembic revision --autogenerate you can run against an empty DB with the models pointed at metadata, which will generate a create_all baseline for you. I wrote the models by hand because I wanted them to mirror the legacy SQL exactly, line-by-line, and to catch any drift between what the legacy SQL said and what the live DB actually had. (There was none, in the end. But the verification was worth it.)
Neither of these is a fix. They’re the trade-offs I’d make differently in a non-teaching codebase.
Next up in the series: the first real Alembic revision. A new feature that adds a table, a column, an index, and tests the autogenerate path end-to-end. That’s where the bet pays off — or doesn’t.
Building an AI Podcast Index
11 parts in this series.
An eight-part build-along: a locally-running tool that ingests a YouTube podcast channel, extracts guests and topics, lets you clip-search by intent, and generates questions for future episodes — using uv, FastAPI, Vite + React, and a provider-switchable LLM client.
- 01 Building an AI Podcast Index: the Project, the Stack, and What You'll Have at the End
- 02 Ingesting YouTube transcripts: yt-dlp for subs, Whisper when subs don't exist
- 03 Structured extraction with Pydantic + Claude: guests, topics, and quotes from raw transcripts
- 04 Entity resolution for guests: fuzzy matching first, LLM disambiguation second
- 05 Building a provider-switching LLM client: one interface, three providers, task-tier routing
- 06 Search without embeddings: Postgres tsvector, LLM rerank, and 30-second clips
- 07 The React side: guest pages, search UI, and codegen'd types
- 08 The question generator, the cron job, and shipping it locally
- 09 Raw SQL migrations: when they're enough, and the four cracks that force Alembic previous
- 10 Adopting Alembic in clipdex without rewriting the query layer ← you are here
- 11 The first real Alembic revision: a column, a backfill, and the parts autogenerate can't do up next
What did you take away?
Thoughts, pushback, or a story of your own? Drop a reply below — I read every one.
Comments are powered by Disqus. By posting you agree to their terms.