evaluation methodology for humaneval
#15
by
viplismism
- opened
hey hi @shunxing1234 and @arieldeng
Congrats on the impressive KAT-Coder results!
I'm trying to replicate your HumanEval evaluation to benchmark my own models. Could you clarify:
- What temperature did you use for HumanEval (96.3% in Table 1)
- Is this pass@1 or pass@k? If pass@k, what's k and n?
- Did you use the Python HumanEval or MultiPL-E (Rust/other language)?
