Phase 3: Embeddings and clustering

t-488.3·WorkTask·
·
·
·newsreader.hs
Parent:t-488·Created3 weeks ago·Updatedyesterday

Dependencies

Description

Edit

LLM-powered topic detection.

  • Call ollama API for embeddings (nomic-embed-text)
  • Store embeddings in SQLite as BLOBs
  • Cosine similarity for article comparison
  • Hierarchical or k-means clustering for topic detection
  • Auto-generate topic names (optional: use LLM to name clusters)

Timeline (13)

💬[human]2 days ago

Current clustering groups articles by single keyword only. Needs proper multi-word topic extraction (TF-IDF, LLM summarization, or embedding-based clustering). The infrastructure exists in Omni/Newsreader/Cluster.hs.

🔄[human]Open → InProgress2 days ago
🔄[human]InProgress → Review2 days ago
💬[human]2 days ago

Replaced single-word clustering with n-gram (bigram/trigram) approach using TF-IDF weighting and greedy cluster dedup. Filters boring common words. Commit: e4117ae2

🔄[human]Review → Openyesterday
💬[human]yesterday

Reopening: the previous implementation used n-gram TF-IDF instead of actual embeddings. This time we want real embedding-based clustering: 1) Generate embeddings on article ingest using ollama (nomic-embed-text), store in the existing embedding_vector column. 2) Use cosine similarity on embedding vectors for clustering, not TF-IDF n-grams. 3) The current n-gram approach produces poor cluster labels ('Breakfast Cereal', 'Day Kinks') — embeddings should give semantically meaningful groupings. The embedding storage plumbing already exists (updateArticleEmbedding, vectorToBlob/blobToVector in Article.hs) — it just needs to be wired into ingest and used in Cluster.hs.

🔄[human]Open → InProgressyesterday
🔄[human]InProgress → Reviewyesterday
💬[human]yesterday

Implemented embedding-based clustering using ollama nomic-embed-text (768-dim). Cosine similarity threshold 0.55, greedy neighborhood selection. Backfilled 4231 articles. Topics page responds in ~2s. Falls back to n-gram when embeddings unavailable. Commit: 0a9b5434