Large language models (LLMs), artificial intelligence (AI) systems that can process human language and generate texts in ...
MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.
Gemini 3.1 Pro promises a Google LLM capable of handling more complex forms of work.
AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.… A ...
For those who enjoy rooting for the underdog, the latest MLPerf benchmark results will disappoint: Nvidia’s GPUs have dominated the competition yet again. This includes chart-topping performance on ...
Researchers debut "Humanity’s Last Exam," a benchmark of 2,500 expert-level questions that current AI models are failing.
Google DeepMind researchers introduce new benchmark to improve LLM factuality, reduce hallucinations
Hallucinations, or factually inaccurate responses, continue to plague large language models (LLMs). Models falter particularly when they are given more complex tasks and when users are looking for ...
Simbian today announced the “AI SOC LLM Leaderboard,” a comprehensive benchmark to measure LLM performance in Security Operations Centers (SOCs). The new benchmark compares LLMs across a diverse range ...
Every AI model release inevitably includes charts touting how it outperformed its competitors in this benchmark test or that evaluation matrix. However, these benchmarks often test for general ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results