evaluation methodology for humaneval

#15
by viplismism - opened

hey hi @shunxing1234 and @arieldeng

Congrats on the impressive KAT-Coder results!
I'm trying to replicate your HumanEval evaluation to benchmark my own models. Could you clarify:

  1. What temperature did you use for HumanEval (96.3% in Table 1)
  2. Is this pass@1 or pass@k? If pass@k, what's k and n?
  3. Did you use the Python HumanEval or MultiPL-E (Rust/other language)?

Screenshot 2025-11-21 at 9.55.56 PM

Sign up or log in to comment