Benchmark memory system against LoCoMo eval

t-602·WorkTask·
·
·
·Omni/Agent/Memory.hs
Created2 months ago·Updated1 week ago·pipeline runs →

Description

Edit

Goal

Benchmark Ava's memory system against the LoCoMo (Long-term Conversational Memory) eval to measure how well it performs on long-term conversational memory tasks compared to state of the art.

Background

LoCoMo (Snap Research, ACL 2024) is a benchmark for evaluating very long-term conversational memory in LLM agents. It consists of 10 annotated conversations with 300+ turns spanning 32 sessions each. Tasks include question answering, event summarization, and multimodal reasoning.

Key reference scores:

  • Letta Filesystem (simple file read/write): 74.0%
  • Mem0 (specialized memory system): ~68.5%
  • Backboard IO (best public score as of late 2025): claims to beat these
  • MemMachine v0.2: claims top scores as of Dec 2025

References

  • LoCoMo paper: https://arxiv.org/pdf/2402.17753
  • LoCoMo project page: https://snap-research.github.io/locomo/
  • LoCoMo GitHub (dataset + eval code): https://github.com/snap-research/locomo
  • EasyLocomo (simplified eval framework): https://github.com/playeriv65/EasyLocomo
  • Letta benchmark blog post (filesystem baseline): https://www.letta.com/blog/benchmarking-ai-agent-memory
  • Zep's LoCoMo harness: https://github.com/getzep/zep/tree/main/benchmarks/locomo

Future benchmarks to consider

  • MemoryAgentBench (ICLR 2026): https://github.com/HUST-AI-HYZ/MemoryAgentBench - Tests retrieval, test-time learning, knowledge updating, refusal of outdated info. Good fit for our graph-based memory with supersedes/contradicts links.
  • MemoryBench: https://openreview.net/forum?id=wU4Tjlzg3h - Tests memory + continual learning including user feedback.
  • MemBench (ACL Findings 2025): Tests factual + reflective memory.

Approach

1. Clone the LoCoMo dataset from GitHub 2. Write an adapter that feeds LoCoMo conversation sessions into Ava's memory system (remember, recall, link_memories, query_graph) 3. Run the QA eval: for each test question, use Ava's recall/query_graph tools to retrieve relevant memories, then answer 4. Score using LoCoMo's provided evaluation metrics 5. Compare against published baselines

EasyLocomo may simplify the eval harness setup. Zep's harness is another reference implementation.

Key question

Our memory system uses semantic search + knowledge graph links. This is more sophisticated than Letta's filesystem approach but simpler than some vector DB solutions. The interesting question is whether the graph structure (especially contradiction/supersession tracking) gives us an edge on the harder questions that require reasoning over evolving facts.

Timeline (17)

🔄[system]Open → InProgress1 month ago
💬[system]1 month ago

Pipeline: dev completed (run=dev-t-602-1771511912, cost=0.0c)

🔄[system]InProgress → Open1 month ago
💬[system]1 month ago

Pipeline: verification failed: Build failed for Omni/Agent/Memory.hs (exit 1): 7[10000;10000H7[10000;10000Hthese 15 derivations will be built: /nix/store/0m5fa2krxa2d7m1rd67xnplb90yj9vbw-hs-mod-Omni_Agent_Prompt_IR.drv /nix/store/wrciq3ha3bcvsvdjjjphm4ispziykj2k-hs-mod-Omni_Agent_Trace.drv /nix/store/ppsjkiss9gjb8xvsmalw6lfbk98ajbjk-hs-mod-Omni_Agent_Op.drv /nix/store/kkmxaw0ks2y1ndih75clcsjsq4wc7dy9-hs-mod-Omni_Agent_Models.drv /nix/store/rvrifzh4ra6glx7w7p4znb7z0pgjbwi5-hs-mod-Omni_Agent_Provider.drv /nix/store/xyn2scqg0ygjhz73md9gbw8al9ragcsb-hs-mod-Omni_Agent_Prompt_Hydrate.drv /nix/store/ylwsiw8a9dr2ljc9siwn876bv7gizmnf-hs-mod-Omni_Agent_Prompt_Compile.drv /nix/store/11gi7wrxbsiq1x5g4c7cnn7lq3r4vf16-hs-mod-Omni_Agent_Interpreter_Sequential.drv /nix/store/rwzsc50issjjj8k3i0x582z269j4vv93-hs-mod-Omni_Agent_Programs_Compaction.drv /nix/store/837lb1y7w5fqpa8nfxkds9driy1z4z28-hs-mod-Omni_Agent_Programs_Agent.drv /nix/store/lgpvbrwgjmlp7d0vphgigwn6ik39klbf-hs-mod-Omni_Time.drv /nix/store/x0q0anj5xyg8pmdl2hq1cw17p6jh4qa9-hs-mod-Omni_Agent_Engine.drv /nix/store/riwas6d6ghspjx64h8qckldkaf1s89bi-hs-mod-Omni_Agent_Op_Bridge.drv /nix/store/zb6am8iyw1spnvd1h9dmyzcszq36gc5b-hs-mod-Omni_Agent_Memory.drv /nix/store/mwhg9r78d2wnhaz5ygy55h3w18gm66x8-omni-agent-memory.drv building '/nix/store/kkmxaw0ks2y1ndih75clcsjsq4wc7dy9-hs-mod-Omni_Agent_Models.drv'... building '/nix/store/0m5fa2krxa2d7m1rd67xnplb90yj9vbw-hs-mod-Omni_Agent_Prompt_IR.drv'... building '/nix/store/wrciq3ha3bcvsvdjjjphm4ispziykj2k-hs-mod-Omni_Agent_Trace.drv'... building '/nix/store/lgpvbrwgjmlp7d0vphgigwn6ik39klbf-hs-mod-Omni_Time.drv'...

Omni/Agent/Models.hs:35:1: error: Could not find module Data.Yaml' Use -v (or :set -v in ghci) to see a list of the files searched for. | 35 | import qualified Data.Yaml as Yaml | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ error: builder for '/nix/store/kkmxaw0ks2y1ndih75clcsjsq4wc7dy9-hs-mod-Omni_Agent_Models.drv' failed with exit code 1; last 7 log lines: > > Omni/Agent/Models.hs:35:1: error: > Could not find module Data.Yaml' > Use -v (or :set -v in ghci) to see a list of the files searched for. > | > 35 | import qualified Data.Yaml as Yaml > | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For full logs, run: nix log /nix/store/kkmxaw0ks2y1ndih75clcsjsq4wc7dy9-hs-mod-Omni_Agent_Models.drv error: 1 dependencies of derivation '/nix/store/mwhg9r78d2wnhaz5ygy55h3w18gm66x8-omni-agent-memory.drv' failed to build

[1A[1G[2K[+] Omni/Agent/Memory.hs [1A[1G[2K[0m[…] Omni/Agent/Memory.hs[0m[1B

[1A[1G[2K[+] Omni/Agent/Memory.hs [1A[1G[2K[~] Omni/Agent/Memory.hs: warning: you did not specify '--add-root'; the res...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: /nix/store/mwhg9r78d2wnhaz5ygy55h3w18gm66x8-omni-a...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: these 15 derivations will be built:…[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: /nix/store/0m5fa2krxa2d7m1rd67xnplb90yj9vbw-hs-m...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: /nix/store/zb6am8iyw1spnvd1h9dmyzcszq36gc5b-hs-m...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: building '/nix/store/kkmxaw0ks2y1ndih75clcsjsq4wc7...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: building '/nix/store/0m5fa2krxa2d7m1rd67xnplb90yj9...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: building '/nix/store/wrciq3ha3bcvsvdjjjphm4ispziyk...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: building '/nix/store/lgpvbrwgjmlp7d0vphgigwn6ik39k...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: Omni/Agent/Models.hs:35:1: error: Could not fin...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: error: builder for '/nix/store/kkmxaw0ks2y1ndih75c...[1B[1A[1G[2K[~] Omni/Agent/Memory.hs: error: 1 dependencies of derivation '/nix/store/mw...[1B[0m[38;5;1m[2Kfail: bild: realise: Omni/Agent/Memory.hs [0m[0m [0m[1A[1G[2K[0m[38;5;1m[x] Omni/Agent/Memory.hs[0m[1B 1

🔄[system]Open → InProgress1 month ago
💬[human]1 month ago

Pipeline scheduler: started run=pipeline-omni-agent-memory-hs-t-602-1771561818 domain=Omni/Agent/Memory.hs

🔄[human]InProgress → Review1 month ago
💬[human]1 month ago

Pipeline scheduler: run=pipeline-omni-agent-memory-hs-t-602-1771561818 domain=Omni/Agent/Memory.hs status=done cost=41c (fund-spend=failed)

💬[human]1 week ago

Ava triage: pipeline auto-run reached status=done but the agent made NO git commits and reported blockers (missing files, path mismatches, or need clarification). This task is not actually in review — there's nothing to review. Resetting status to Open so it can be re-scoped.

🔄[human]Review → Open1 week ago
💬[human]1 week ago

ORPHAN COMMIT: coder agent produced commit a6739cd61cb80c88e9d68a50fb176b3aaf69ebf4 on 2026-02-19 but it was never merged into live. Reachable only via branchless reflog. Pipeline scheduler bug — see separate task. To recover: git cherry-pick a6739cd61cb80c88e9d68a50fb176b3aaf69ebf4 from omni/live (expect conflicts after 6+ weeks of drift). Otherwise re-implement from scratch.