LoCoMo10 Benchmark · March 2026

Cortex Memory Best Accuracy. 11× Fewer Tokens.

#1 on LoCoMo10 — outperforms every published AI memory system ~2,900 tokens/query vs 33,490 with OpenClaw's built-in memory · upgrade it with MemClaw

Get Cortex Memory ⚡ Use MemClaw for OpenClaw

68.42%

Cortex Memory Score

vs 35.65% OpenClaw built-in

+32.8pp

Accuracy Gain vs OpenClaw

LoCoMo10, same judge methodology

11×

Fewer Tokens vs OpenClaw

~2,900 vs ~33,490 tokens/Q

18×

Better Score-per-Token

23.6 vs 1.3 score/1K tokens

Overall Score

How Cortex Memory Compares

LLM-as-a-Judge evaluation on LoCoMo10 (long-conversation memory QA). All results use the same judge methodology.

Cortex Memory + OpenClaw
Intent ON

68.42%

Cortex Memory + OpenClaw
Intent OFF

49.67%

Competitors

OpenViking + OpenClaw −memory-core

52.08%

OpenViking + OpenClaw +memory-core

51.23%

LanceDB + OpenClaw

44.55%

OpenClaw (built-in memory)

35.65%

* OpenViking tested on 1,540 questions (10 samples). Cortex Memory tested on 152 questions (conv-26 sample). Same LLM-as-a-Judge methodology throughout. Cortex Memory Intent OFF score is estimated from measured delta on the same dataset.

Score Breakdown

Wins Across All Question Types

LoCoMo10 tests four distinct memory capabilities. Cortex Memory (Intent ON) leads in every category.

Cat 1 · Factual Recall

37.5%

Cortex (Intent ON)

37.5%

Cortex (Intent OFF)

28.1%

Cat 2 · Temporal Reasoning

62.2%

Cortex (Intent ON)

62.2%

Cortex (Intent OFF)

40.5%

Cat 3 · Commonsense Inference

76.9%

Cortex (Intent ON)

76.9%

Cortex (Intent OFF)

69.2%

Cat 4 · Multi-hop Reasoning

84.3%

Cortex (Intent ON)

84.3%

Cortex (Intent OFF)

65.7%

Feature Spotlight

Intent Analysis: A Meaningful Upgrade

Cortex Memory's Intent Analysis classifies each query before retrieval, routing to the right memory scope. The improvement is most pronounced on multi-hop and temporal questions.

Intent ON Query-aware retrieval

68.42%

104 / 152 correct · LoCoMo10 #1

Cat 1 Factual

37.5%

Cat 2 Temporal

62.2%

Cat 3 Inference

76.9%

Cat 4 Multi-hop

84.3%

Intent OFF Standard retrieval

49.67%

75 / 152 correct · estimated

Cat 1 Factual

28.1%

Cat 2 Temporal

40.5%

Cat 3 Inference

69.2%

Cat 4 Multi-hop

65.7%

+18.75pp on Cat 4 (Multi-hop) — Intent Analysis is most effective when questions require combining multiple memories across time. By routing multi-hop queries to entity and relational memory scopes first, retrieval precision improves significantly.

Capabilities

Feature Comparison

A side-by-side look at what each system supports.

Capability	Cortex Memory + OpenClaw	OpenViking + OpenClaw	OpenClaw built-in
LoCoMo10 Best Score	68.42% 🏆	52.08%	35.65%
Avg Tokens / Question	~2,900	~2,769	~15,982
Hierarchical Memory (L0 / L1 / L2)	✓ 3 layered context loading	✓ 3 layered context loading	—
Intent-Driven Retrieval	✓ multi intent types	✓ multi intent types	—
Open Source	🦀 Rust (Efficient & Secure)	🐢 Python (Slow)	🗑️ JavaScript (Bloated)

Cost Efficiency

Higher Accuracy.
Far Fewer Tokens.

Cortex Memory's hierarchical L0/L1/L2 architecture means you only pay for the context you actually need — precision retrieval without the bloat.

Cortex Memory · Tokens per Question

~2,900

Input tokens per query · same tier as OpenViking

Score-per-1K-Tokens Efficiency Ratio

23.6×

vs 1.3 for OpenClaw LanceDB — 18× better value

Token Savings vs OpenClaw LanceDB

11×

33,490 → ~2,900 tokens/Q while accuracy goes up +23.9pp

Cortex Memory + OpenClaw (The MemClaw Plugin) This Work ~2,900 tokens/Q

OpenViking + OpenClaw (+memory-core) ~1,363 tokens/Q

OpenViking + OpenClaw (−memory-core) ~2,769 tokens/Q

OpenClaw (built-in memory) ~15,982 tokens/Q

LanceDB + OpenClaw (−memory-core) ~33,490 tokens/Q

💡

Why Cortex Memory is token-efficient

The hierarchical L0 / L1 / L2 architecture retrieves starting from ultra-compact ~100-token abstracts for fast relevance filtering, then fetches only the necessary detail layers — you never pay for full-document context you don't need. The result speaks for itself: 68.42% accuracy at ~2,900 tokens/question, while OpenClaw LanceDB spends 11× more tokens to achieve only 44.55%.

How It Works

Built Differently

The architecture choices that drive superior benchmark performance.

🧠

Hierarchical L0 / L1 / L2 Memory

Three-level abstraction: ~100-token abstracts for fast relevance filtering, ~2K-token overviews for context, full content for precise answers. Reduces token cost without sacrificing accuracy.

🎯

Query Intent Classification

Before retrieval, each query is classified into one of 5 intent types (Event, Activity, Entity, Relational, Generic). Multi-hop and temporal queries are routed to the most relevant memory scopes.

⏱️

Temporal-Aware Answering

Conversation timestamps are injected into the answer generation context. Relative time expressions ("last Sunday", "next month") are correctly resolved to absolute dates — a major accuracy driver.

📁

Structured Memory Extraction

Memories are automatically typed and organized: events, entities, preferences, goals, personal info, relationships, work history. Category-scoped retrieval boosts precision on factual queries.

⚡

Token-Efficient by Design

~2,900 input tokens per question — on par with OpenViking and 5–11× cheaper than naive full-context retrieval systems, while delivering significantly higher accuracy.

🔌

Flexible Deployment

Ships as an HTTP REST service, MCP server, or embeddable Rust library. Works with any OpenAI-compatible LLM. Single binary, no cloud dependency.

References & Data

Evaluation Artifacts

All raw outputs, judge reports, and methodology details are publicly available in the repository.

📊

BENCHMARK.md

Full benchmark results table, token efficiency analysis, methodology notes, and reproduction guide

🗂️

qa-s0-v5.judge.md

v5 judge report — CORRECT/WRONG verdict per question, category scores, wrong-answer examples

⚙️

eval.py

Evaluation script — QA runner, LLM-as-a-Judge, time-aware answer prompt, retry mechanism

Methodology: LLM-as-a-Judge (CORRECT / WRONG) on LoCoMo10 conv-26 (152 questions covering 19 sessions, May–October 2023). OpenViking competitor scores are sourced from the official OpenViking README (1,540 questions, 10 samples). Cortex Memory Intent OFF scores estimated from v2 delta on same dataset. Full raw outputs (QA responses + judge JSON) available at examples/locomo-evaluation/benchmark/.

Cortex Memory Best Accuracy. 11× Fewer Tokens.

How Cortex Memory Compares

Wins Across All Question Types

Intent Analysis: A Meaningful Upgrade

Feature Comparison

Higher Accuracy.Far Fewer Tokens.

Built Differently

Evaluation Artifacts

Higher Accuracy.
Far Fewer Tokens.