[ PROMPT_NODE_22378 ]

Evaluation Bigcode Evaluation Harness 基准测试

[ SKILL_DOCUMENTATION ]

# BigCode Evaluation Harness - 基准测试指南 BigCode Evaluation Harness 支持的所有基准测试的综合指南。 ## 带单元测试的代码生成这些基准测试通过针对单元测试执行生成的代码来测试功能正确性。 ### HumanEval **概述**: 由 OpenAI 创建的 164 个手写 Python 编程问题。 **数据集**: HuggingFace 上的 `openai_humaneval` **指标**: pass@k (k=1, 10, 100) **问题类型**: 带文档字符串的函数补全 **问题结构示例**: python def has_close_elements(numbers: List[float], threshold: float) -> bool: """检查给定数字列表中是否有任意两个数字的距离小于给定阈值。 >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ **用法**: bash accelerate launch main.py --model bigcode/starcoder2-7b --tasks humaneval --temperature 0.2 --n_samples 200 --batch_size 50 --allow_code_execution **推荐设置**: - `temperature`: 大规模 n_samples 的 pass@k 使用 0.8，贪婪搜索使用 0.2 - `n_samples`: 200 以获得准确的 pass@k 估计 - `max_length_generation`: 512（足以应对大多数问题） ### HumanEval+ **概述**: HumanEval 的扩展版，每个问题包含多 80 倍的测试用例。 **数据集**: HuggingFace 上的 `evalplus/humanevalplus` **使用原因**: 捕获那些通过原始测试但无法通过边缘情况测试的解决方案 **用法**: bash accelerate launch main.py --model bigcode/starcoder2-7b --tasks humanevalplus --temperature 0.2 --n_samples 200 --allow_code_execution **注意**: 由于额外的测试，执行时间更长。可能需要调整超时设置。 ### MBPP (Mostly Basic Python Problems) **概述**: 1,000 个为入门级程序员设计的众包 Python 问题。 **数据集**: HuggingFace 上的 `mbpp` **测试集划分**: 500 个问题（索引 11-511） **指标**: pass@k **问题结构**: - 英文任务描述 - 每个问题 3 个自动化测试用例 - 代码解决方案（真值） **用法**: bash accelerate launch main.py --model bigcode/starcoder2-7b --tasks mbpp --temperature 0.2 --n_samples 200 --allow_code_execution ### MBPP+ **概述**: 399 个精选的 MBPP 问题，测试用例多 35 倍。 **数据集**: HuggingFace 上的 `evalplus/mbppplus` **用法**: bash accelerate launch main.py --model bigcode/starcoder2-7b --tasks mbppplus --allo

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI