Benchmark Model - Search News

22h

ACE ROBOTICS’ Kairos World Model Leads Multiple Global Embodied-Intelligence Benchmarks

SHANGHAI, CHINA - Media OutReach Newswire - 15 June 2026 - ACE ROBOTICS today announced that its open-source Kairos ...

Morning Overview on MSN

The newest Anthropic model just took the top spot on the Super-Agent benchmark — the only AI to finish every test case end-to-end and beat OpenAI’s GPT-5.5

Anthropic’s latest AI model has reportedly reached the top of the Super-Agent benchmark, a grueling test of whether an AI ...

MSN on MSNOpinion

Microsoft unveiled MAI-Code-1-Flash, its first model that turns descriptions into working code

Software developers working on complex, multi-file projects now have a new tool to evaluate after Microsoft released MAI-Code ...

The Next Web

Fable 5 was beating GPT 5.5 on every major benchmark. Then the US government pulled it offline.

Anthropic's Fable 5 led every major AI benchmark over OpenAI's GPT 5.5 before a US export control directive forced it offline three days after launch.

VentureBeat

LiveBench is an open LLM benchmark that uses contamination-free test data and objective scoring

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A team of Abacus.AI, New York University, ...

SiliconANGLE

OpenAI details o3 reasoning model with record-breaking benchmark scores

OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model’s introduction caps off a 12-day product announcement series that started with the launch of a new ...

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

WinBuzzer

New DeepSWE Benchmark Puts GPT-5.5 Ahead of Claude Opus 4.7

Datacurve's new DeepSWE benchmark puts GPT-5.5 ahead of Claude and challenges older AI coding rankings by arguing verifier design can distort results.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results