CruxEval-O

text

About

CruxEval-O is a code execution benchmark consisting of 800 Python functions (3-13 lines) designed to evaluate AI models' ability to predict program outputs. This benchmark tests code reasoning, understanding, and execution capabilities through output prediction tasks. CruxEval-O measures how well AI systems can mentally simulate code execution and accurately determine program results without actual execution.

Evaluation Stats

Total Models1

Organizations1

Verified Results0

Self-Reported1

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

1 models

Top Score

51.3%

Average Score

51.3%

High Performers (80%+)

Top Organizations

#1Mistral AI

1 model

51.3%

Leaderboard

1 models ranked by performance on CruxEval-O

			License		Links
#01Codestral-22B	Mistral AI	May 29, 2024	MNPL-0.1	51.3%

Resources

Research Paper