t-470 - omni

t-470·WorkTask···

Created1 month ago·Updated4 weeks ago

Description

Overview

Build a local document search system for ~/notes and ~/org that supports hybrid search (BM25 + vector + optional LLM rerank). This will enable both interactive CLI search and programmatic access as an agentd tool.

Reference Implementation

qmd by Tobi Lütke: https://github.com/tobi/qmd

Key ideas from qmd:

BM25 (keyword) + vector (semantic) hybrid search
LLM re-ranking for final result quality
Local-first via ollama for embeddings/rerank
Chunking by markdown sections

Proposed Architecture

Instead of copying qmd directly, build a Python implementation that:

1. Uses pandoc for parsing - handles both markdown and orgmode natively via JSON AST 2. Chunking by heading - split pandoc AST on Header nodes, preserving hierarchy 3. SQLite FTS5 for BM25 - fast, embedded, no external deps 4. Ollama for embeddings - local vector embeddings (nomic-embed-text or similar) 5. Optional LLM rerank - can add later if BM25+vector isn't sufficient

Implementation Steps

1. Walk ~/notes and ~/org, parse each file with pandoc -t json 2. Chunk by heading/section (pandoc AST makes this straightforward) 3. Store chunks in SQLite with FTS5 for BM25, plus a vector table for embeddings 4. CLI interface: notes-search "query" returns ranked results with file:line references 5. Agentd tool endpoint for programmatic access

Key Design Decisions

Format agnostic: pandoc IR means we can add more formats later (rst, asciidoc, etc)
Incremental updates: track file mtimes, only re-index changed files
Chunk metadata: preserve source file, heading path, line numbers for navigation
Configurable: allow tuning of chunk size, overlap, search weights

Files to Create

Suggested location: Omni/Notes/ or Omni/Search/

Omni/Notes/Search.py - core indexing and search logic
Omni/Notes/Cli.py - CLI interface
Omni/Notes/Tool.py - agentd tool wrapper (optional, can add later)

Libraries

subprocess for pandoc calls
sqlite3 for FTS5
ollama python client for embeddings
consider numpy for vector similarity if not using sqlite-vec

Open Questions

Should we use sqlite-vec extension for vector search, or just numpy cosine similarity?
Chunk overlap strategy: fixed overlap vs semantic boundaries?
How to handle very large files (>1000 lines)?

Success Criteria

notes-search "agentd architecture" returns relevant results from ~/notes
Orgmode files in ~/org are searchable
Index refresh takes <30s for incremental updates
Can be called as an agentd tool for Ava to search notes during conversations

Timeline (5)

💬[human]4 weeks ago

Phase 1 complete: BM25 search working via notes-search CLI

Implemented:

Pandoc parsing for md/org files
Chunking by headings
SQLite FTS5 for keyword search
CLI with --json and --reindex options

Next: Phase 2 (vector embeddings via ollama)