The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical ...
Boys perform better than girls in tests made up of multiple-choice questions. Multiple-choice questions are considered objective and easy to mark. But my research shows they give an advantage to males ...