HumanEval-Mul

Multilingual

text

About

HumanEval-Mul is a multilingual extension of the HumanEval benchmark that evaluates AI models' code generation capabilities across multiple programming languages. This benchmark tests models' ability to generate functional code in various programming languages beyond Python, measuring cross-language programming competency and algorithmic thinking across diverse programming paradigms and syntax structures.

Evaluation Stats

Total Models2

Organizations1

Verified Results0

Self-Reported2

Benchmark Details

Max Score1

Language

Performance Overview

Score distribution and top performers

Score Distribution

2 models

Top Score

82.6%

Average Score

78.2%

High Performers (80%+)

Top Organizations

#1DeepSeek

2 models

78.2%

Leaderboard

2 models ranked by performance on HumanEval-Mul

			License		Links
#01DeepSeek-V3	DeepSeek	Dec 25, 2024	MIT + Model License (Commercial use allowed)	82.6%
#02DeepSeek-V2.5	DeepSeek	May 8, 2024	deepseek	73.8%

Resources

Research Paper