Benchmark Ava's memory system against the LoCoMo (Long-term Conversational Memory) eval to measure how well it performs on long-term conversational memory tasks compared to state of the art.
LoCoMo (Snap Research, ACL 2024) is a benchmark for evaluating very long-term conversational memory in LLM agents. It consists of 10 annotated conversations with 300+ turns spanning 32 sessions each. Tasks include question answering, event summarization, and multimodal reasoning.
Key reference scores:
1. Clone the LoCoMo dataset from GitHub 2. Write an adapter that feeds LoCoMo conversation sessions into Ava's memory system (remember, recall, link_memories, query_graph) 3. Run the QA eval: for each test question, use Ava's recall/query_graph tools to retrieve relevant memories, then answer 4. Score using LoCoMo's provided evaluation metrics 5. Compare against published baselines
EasyLocomo may simplify the eval harness setup. Zep's harness is another reference implementation.
Our memory system uses semantic search + knowledge graph links. This is more sophisticated than Letta's filesystem approach but simpler than some vector DB solutions. The interesting question is whether the graph structure (especially contradiction/supersession tracking) gives us an edge on the harder questions that require reasoning over evolving facts.
No activity yet.