LoCoMo10 Benchmark · March 2026

Cortex Memory Best Accuracy. 11× Fewer Tokens.

#1 on LoCoMo10 — outperforms every published AI memory system ~2,900 tokens/query vs 33,490 with OpenClaw's built-in memory · upgrade it with MemClaw

Get Cortex Memory ⚡ Use MemClaw for OpenClaw
68.42%
Cortex Memory Score
vs 35.65% OpenClaw built-in
+32.8pp
Accuracy Gain vs OpenClaw
LoCoMo10, same judge methodology
11×
Fewer Tokens vs OpenClaw
~2,900 vs ~33,490 tokens/Q
18×
Better Score-per-Token
23.6 vs 1.3 score/1K tokens

How Cortex Memory Compares

LLM-as-a-Judge evaluation on LoCoMo10 (long-conversation memory QA). All results use the same judge methodology.

Cortex Memory + OpenClaw
Intent ON
68.42%
Cortex Memory + OpenClaw
Intent OFF
49.67%
Competitors
OpenViking + OpenClaw −memory-core
52.08%
OpenViking + OpenClaw +memory-core
51.23%
LanceDB + OpenClaw
44.55%
OpenClaw (built-in memory)
35.65%

* OpenViking tested on 1,540 questions (10 samples). Cortex Memory tested on 152 questions (conv-26 sample). Same LLM-as-a-Judge methodology throughout. Cortex Memory Intent OFF score is estimated from measured delta on the same dataset.

Wins Across All Question Types

LoCoMo10 tests four distinct memory capabilities. Cortex Memory (Intent ON) leads in every category.

Cat 1 · Factual Recall
37.5%
Cortex (Intent ON)
37.5%
Cortex (Intent OFF)
28.1%
Cat 2 · Temporal Reasoning
62.2%
Cortex (Intent ON)
62.2%
Cortex (Intent OFF)
40.5%
Cat 3 · Commonsense Inference
76.9%
Cortex (Intent ON)
76.9%
Cortex (Intent OFF)
69.2%
Cat 4 · Multi-hop Reasoning
84.3%
Cortex (Intent ON)
84.3%
Cortex (Intent OFF)
65.7%

Intent Analysis: A Meaningful Upgrade

Cortex Memory's Intent Analysis classifies each query before retrieval, routing to the right memory scope. The improvement is most pronounced on multi-hop and temporal questions.

Intent ON Query-aware retrieval
68.42%
104 / 152 correct · LoCoMo10 #1
Cat 1 Factual
37.5%
Cat 2 Temporal
62.2%
Cat 3 Inference
76.9%
Cat 4 Multi-hop
84.3%
Intent OFF Standard retrieval
49.67%
75 / 152 correct · estimated
Cat 1 Factual
28.1%
Cat 2 Temporal
40.5%
Cat 3 Inference
69.2%
Cat 4 Multi-hop
65.7%
+18.75pp on Cat 4 (Multi-hop) — Intent Analysis is most effective when questions require combining multiple memories across time. By routing multi-hop queries to entity and relational memory scopes first, retrieval precision improves significantly.

Feature Comparison

A side-by-side look at what each system supports.

Capability Cortex Memory + OpenClaw OpenViking + OpenClaw OpenClaw built-in
LoCoMo10 Best Score 68.42% 🏆 52.08% 35.65%
Avg Tokens / Question ~2,900 ~2,769 ~15,982
Hierarchical Memory
(L0 / L1 / L2)
✓ 3 layered context loading ✓ 3 layered context loading
Intent-Driven Retrieval ✓ multi intent types ✓ multi intent types
Open Source 🦀 Rust (Efficient & Secure) 🐢 Python (Slow) 🗑️ JavaScript (Bloated)

Higher Accuracy.
Far Fewer Tokens.

Cortex Memory's hierarchical L0/L1/L2 architecture means you only pay for the context you actually need — precision retrieval without the bloat.

Cortex Memory · Tokens per Question
~2,900
Input tokens per query · same tier as OpenViking
Score-per-1K-Tokens Efficiency Ratio
23.6×
vs 1.3 for OpenClaw LanceDB — 18× better value
Token Savings vs OpenClaw LanceDB
11×
33,490 → ~2,900 tokens/Q while accuracy goes up +23.9pp
Cortex Memory + OpenClaw (The MemClaw Plugin) This Work ~2,900 tokens/Q
OpenViking + OpenClaw (+memory-core) ~1,363 tokens/Q
OpenViking + OpenClaw (−memory-core) ~2,769 tokens/Q
OpenClaw (built-in memory) ~15,982 tokens/Q
LanceDB + OpenClaw (−memory-core) ~33,490 tokens/Q
💡
Why Cortex Memory is token-efficient
The hierarchical L0 / L1 / L2 architecture retrieves starting from ultra-compact ~100-token abstracts for fast relevance filtering, then fetches only the necessary detail layers — you never pay for full-document context you don't need. The result speaks for itself: 68.42% accuracy at ~2,900 tokens/question, while OpenClaw LanceDB spends 11× more tokens to achieve only 44.55%.

Built Differently

The architecture choices that drive superior benchmark performance.

🧠
Hierarchical L0 / L1 / L2 Memory
Three-level abstraction: ~100-token abstracts for fast relevance filtering, ~2K-token overviews for context, full content for precise answers. Reduces token cost without sacrificing accuracy.
🎯
Query Intent Classification
Before retrieval, each query is classified into one of 5 intent types (Event, Activity, Entity, Relational, Generic). Multi-hop and temporal queries are routed to the most relevant memory scopes.
⏱️
Temporal-Aware Answering
Conversation timestamps are injected into the answer generation context. Relative time expressions ("last Sunday", "next month") are correctly resolved to absolute dates — a major accuracy driver.
📁
Structured Memory Extraction
Memories are automatically typed and organized: events, entities, preferences, goals, personal info, relationships, work history. Category-scoped retrieval boosts precision on factual queries.
Token-Efficient by Design
~2,900 input tokens per question — on par with OpenViking and 5–11× cheaper than naive full-context retrieval systems, while delivering significantly higher accuracy.
🔌
Flexible Deployment
Ships as an HTTP REST service, MCP server, or embeddable Rust library. Works with any OpenAI-compatible LLM. Single binary, no cloud dependency.

Evaluation Artifacts

All raw outputs, judge reports, and methodology details are publicly available in the repository.

📊
BENCHMARK.md
Full benchmark results table, token efficiency analysis, methodology notes, and reproduction guide
🗂️
qa-s0-v5.judge.md
v5 judge report — CORRECT/WRONG verdict per question, category scores, wrong-answer examples
⚙️
eval.py
Evaluation script — QA runner, LLM-as-a-Judge, time-aware answer prompt, retry mechanism
Methodology: LLM-as-a-Judge (CORRECT / WRONG) on LoCoMo10 conv-26 (152 questions covering 19 sessions, May–October 2023). OpenViking competitor scores are sourced from the official OpenViking README (1,540 questions, 10 samples). Cortex Memory Intent OFF scores estimated from v2 delta on same dataset. Full raw outputs (QA responses + judge JSON) available at examples/locomo-evaluation/benchmark/.