Tag

Evaluation

21 posts

Apr 14, 2026 5 min

Review: The Welch Labs Illustrated Guide to AI

A review of a rare AI book that uses mathematics to illuminate rather than intimidate, making difficult ideas feel genuinely learnable.

LLMs Evaluation

Mar 9, 2026 7 min

Your AI Writes Tests That Pass. That Is the Problem.

AI-generated tests can look reassuring while proving very little, exposing a dangerous gap between green checkmarks and real verification.

Coding Software Development

Feb 17, 2026 6 min

From PDE Guarantees to LLM Inference: What BEACONS Gets Right About Reliability

BEACONS offers a model for reliability that AI systems badly need: explicit bounds, checkable guarantees, and less benchmark theater.

Benchmarks Evaluation

Jan 7, 2026 6 min

When an AI Can Act but Cannot Judge

LLMs may act impressively while still failing to know when they are capable, making self-assessment a core safety problem.

Coding Software Development

Sep 20, 2025 4 min

From Prompts to People: What OpenAI’s Usage Study Means for Model Leaderboards

OpenAI's usage study shifts attention from benchmark scores to how ordinary people actually use ChatGPT in daily life.

ChatGPT OpenAI

Aug 9, 2025 4 min

GPT-5 Rollout: The Aftermath

A follow-up on GPT-5's rocky rollout, user frustration, and OpenAI's attempts to tune expectations after launch.

GPT GPT-5

Aug 7, 2025 6 min

Factual Recap: OpenAI's GPT-5 Keynote Event

A factual recap of OpenAI's GPT-5 keynote, collecting the main claims, demos, benchmarks, and availability details.

GPT GPT-5

Jun 19, 2025 10 min

The Future of AI: How Self-Adapting Language Models Are Redefining Learning

SEAL points toward language models that rewrite their own training material, hinting at AI systems that learn after deployment.

LLMs Evaluation

Mar 27, 2025 6 min

Knowledge Graphs Won't Solve the LLM Crisis—Here's Why

Knowledge graphs are useful, but the post argues they are not a magic cure for LLM hallucination and reasoning failures.

LLMs Evaluation

Feb 23, 2025 6 min

The Rise of the AI Prophetes: Translating the Machine Oracle

As AI becomes an oracle, a new class of interpreters may emerge to translate machine outputs into human decisions.

AI Evaluation

Feb 16, 2025 6 min

When AI Masters Competitive Programming: Why Generalists Outperform Specialists

OpenAI's competitive-programming work suggests generalist reasoning models can outperform narrow specialists in demanding coding contests.

OpenAI Coding

Jan 31, 2025 6 min

When Metrics Go Wrong: A Tale of Goodhart's Law and AI Misalignment

Goodhart's Law explains why AI alignment can fail when proxy metrics become targets and systems learn the wrong game.

Coding Software Development

Jan 29, 2025 5 min

Humanity’s Last Exam: The Ultimate Test for AI and the Future of Intelligence

Humanity's Last Exam is framed as a benchmark that tests not only models, but our assumptions about intelligence itself.

LLMs DeepSeek

Dec 10, 2024 7 min

When AI Can't Count: A Hilarious Look at the Math Skills of Text-to-Image Models

Text-to-image models still struggle with counting, making their visual brilliance look surprisingly fragile at the level of basic numeracy.

AI ChatGPT

Jul 5, 2024 4 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 1/6

The opening part of a benchmark series asks what LLM evaluations really measure and why the numbers often mislead.

LLMs Alignment

Jul 5, 2024 5 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 2/6

Part two examines benchmark methods themselves, exposing the assumptions behind the scores used to compare language models.

LLMs Alignment

Jul 5, 2024 5 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 3/6

Part three moves from benchmark scores to application areas, asking where LLM performance actually matters in practice.

LLMs Alignment

Jul 5, 2024 5 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 4/6

Part four digs into the good, bad, and misleading sides of benchmark results and their interpretation.

LLMs Alignment

Jul 5, 2024 5 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 5/6

Part five steps beyond scores to consider real-world limitations, reliability, and practical model behavior.

LLMs Alignment

Jul 5, 2024 5 min

Comparison of LLMs: Lies, Damned Lies, and Benchmarks 6/6

The final benchmark essay looks toward better evaluation methods that test usefulness rather than leaderboard theater.

LLMs Alignment

Jun 30, 2024 5 min

Can You Spot the AI? The Turing Test and GPT-4's Sneaky Success

GPT-4's Turing-test performance revives the old question of whether fooling humans proves intelligence or just fluency.

GPT Coding