Build local notes search with hybrid BM25+vector

t-470·WorkTask·
·
·
Created1 month ago·Updated4 weeks ago

Description

Edit

Overview

Build a local document search system for ~/notes and ~/org that supports hybrid search (BM25 + vector + optional LLM rerank). This will enable both interactive CLI search and programmatic access as an agentd tool.

Reference Implementation

qmd by Tobi Lütke: https://github.com/tobi/qmd

Key ideas from qmd:

  • BM25 (keyword) + vector (semantic) hybrid search
  • LLM re-ranking for final result quality
  • Local-first via ollama for embeddings/rerank
  • Chunking by markdown sections

Proposed Architecture

Instead of copying qmd directly, build a Python implementation that:

1. Uses pandoc for parsing - handles both markdown and orgmode natively via JSON AST 2. Chunking by heading - split pandoc AST on Header nodes, preserving hierarchy 3. SQLite FTS5 for BM25 - fast, embedded, no external deps 4. Ollama for embeddings - local vector embeddings (nomic-embed-text or similar) 5. Optional LLM rerank - can add later if BM25+vector isn't sufficient

Implementation Steps

1. Walk ~/notes and ~/org, parse each file with pandoc -t json 2. Chunk by heading/section (pandoc AST makes this straightforward) 3. Store chunks in SQLite with FTS5 for BM25, plus a vector table for embeddings 4. CLI interface: notes-search "query" returns ranked results with file:line references 5. Agentd tool endpoint for programmatic access

Key Design Decisions

  • Format agnostic: pandoc IR means we can add more formats later (rst, asciidoc, etc)
  • Incremental updates: track file mtimes, only re-index changed files
  • Chunk metadata: preserve source file, heading path, line numbers for navigation
  • Configurable: allow tuning of chunk size, overlap, search weights

Files to Create

Suggested location: Omni/Notes/ or Omni/Search/

  • Omni/Notes/Search.py - core indexing and search logic
  • Omni/Notes/Cli.py - CLI interface
  • Omni/Notes/Tool.py - agentd tool wrapper (optional, can add later)

Libraries

  • subprocess for pandoc calls
  • sqlite3 for FTS5
  • ollama python client for embeddings
  • consider numpy for vector similarity if not using sqlite-vec

Open Questions

  • Should we use sqlite-vec extension for vector search, or just numpy cosine similarity?
  • Chunk overlap strategy: fixed overlap vs semantic boundaries?
  • How to handle very large files (>1000 lines)?

Success Criteria

  • notes-search "agentd architecture" returns relevant results from ~/notes
  • Orgmode files in ~/org are searchable
  • Index refresh takes <30s for incremental updates
  • Can be called as an agentd tool for Ava to search notes during conversations

Timeline (5)

💬[human]4 weeks ago

Phase 1 complete: BM25 search working via notes-search CLI

Implemented:

  • Pandoc parsing for md/org files
  • Chunking by headings
  • SQLite FTS5 for keyword search
  • CLI with --json and --reindex options

Next: Phase 2 (vector embeddings via ollama)

💬[human]4 weeks ago

Phase 2 complete: Vector embeddings and hybrid search

Implemented:

  • Embeddings via ollama nomic-embed-text (768 dims)
  • Cosine similarity vector search
  • RRF fusion for hybrid BM25+vector
  • CLI: --embed, --vector, --hybrid, --bm25

Both phases now complete. Remaining optional enhancements:

  • LLM reranking
  • Agentd tool integration
  • Batch embedding for performance
🔄[human]Open → Done4 weeks ago