Benchmark memory system against LoCoMo eval

t-602·WorkTask·
·
·
Created6 days ago·Updated6 days ago

Description

Edit

Goal

Benchmark Ava's memory system against the LoCoMo (Long-term Conversational Memory) eval to measure how well it performs on long-term conversational memory tasks compared to state of the art.

Background

LoCoMo (Snap Research, ACL 2024) is a benchmark for evaluating very long-term conversational memory in LLM agents. It consists of 10 annotated conversations with 300+ turns spanning 32 sessions each. Tasks include question answering, event summarization, and multimodal reasoning.

Key reference scores:

  • Letta Filesystem (simple file read/write): 74.0%
  • Mem0 (specialized memory system): ~68.5%
  • Backboard IO (best public score as of late 2025): claims to beat these
  • MemMachine v0.2: claims top scores as of Dec 2025

References

  • LoCoMo paper: https://arxiv.org/pdf/2402.17753
  • LoCoMo project page: https://snap-research.github.io/locomo/
  • LoCoMo GitHub (dataset + eval code): https://github.com/snap-research/locomo
  • EasyLocomo (simplified eval framework): https://github.com/playeriv65/EasyLocomo
  • Letta benchmark blog post (filesystem baseline): https://www.letta.com/blog/benchmarking-ai-agent-memory
  • Zep's LoCoMo harness: https://github.com/getzep/zep/tree/main/benchmarks/locomo

Future benchmarks to consider

  • MemoryAgentBench (ICLR 2026): https://github.com/HUST-AI-HYZ/MemoryAgentBench - Tests retrieval, test-time learning, knowledge updating, refusal of outdated info. Good fit for our graph-based memory with supersedes/contradicts links.
  • MemoryBench: https://openreview.net/forum?id=wU4Tjlzg3h - Tests memory + continual learning including user feedback.
  • MemBench (ACL Findings 2025): Tests factual + reflective memory.

Approach

1. Clone the LoCoMo dataset from GitHub 2. Write an adapter that feeds LoCoMo conversation sessions into Ava's memory system (remember, recall, link_memories, query_graph) 3. Run the QA eval: for each test question, use Ava's recall/query_graph tools to retrieve relevant memories, then answer 4. Score using LoCoMo's provided evaluation metrics 5. Compare against published baselines

EasyLocomo may simplify the eval harness setup. Zep's harness is another reference implementation.

Key question

Our memory system uses semantic search + knowledge graph links. This is more sophisticated than Letta's filesystem approach but simpler than some vector DB solutions. The interesting question is whether the graph structure (especially contradiction/supersession tracking) gives us an edge on the harder questions that require reasoning over evolving facts.

Timeline (0)

No activity yet.