RULER

text

About

RULER is a synthetic benchmark designed to comprehensively evaluate long-context language models through flexible configurations for sequence length and task complexity. Expanding beyond needle-in-a-haystack tests, RULER includes information retrieval, multi-hop tracing, and aggregation tasks, revealing how model performance degrades as context length increases and testing behaviors beyond simple retrieval.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

87.1%

Average Score

85.6%

High Performers (80%+)

Top Organizations

#1Microsoft

2 models

85.6%

Leaderboard

2 models ranked by performance on RULER

			License		Links
#01Phi-3.5-MoE-instruct	Microsoft	Aug 23, 2024	MIT	87.1%
#02Phi-3.5-mini-instruct	Microsoft	Aug 23, 2024	MIT	84.1%

Resources

Research Paper