From Prompts to People: What OpenAI’s Usage Study Means for Model Leaderboards

When one studies OpenAI’s recent usage analysis of ChatGPT, what emerges is a portrait of an AI tool deeply embedded in daily life—not as an exotic laboratory artifact, but as an assistant in writing, learning, decision-making, and routine communication. Between May 2024 and June/July 2025, more than 1.5 million user messages were sampled; users are no longer narrowly defined by occupation, by tech-savviness, or by geography. The demographics are shifting: by mid-2025 the share of active users with names classifiable as feminine rose from about 37% in early 2024 to around 52%. Adoption in low- and middle-income countries is accelerating—growth rate in the poorest countries is more than four times that in the richest. Meanwhile, nearly half of all messages are sent by users under age 26.

Usage categorization shows that approximately 49% of messages fall into an “Asking” intent: users seeking information or guidance; 40% are “Doing” intent: asking the model to produce outputs (planning, writing, task execution); and about 11% are “Expressing”—messages serving more personal reflection or exploratory expression rather than seeking action or information. Among the subset of messages that are work-related, “Doing” dominates somewhat: around 56% of those are output-oriented tasks.

Even in work-related use, writing (especially editing, translation, critique) is far more common than generating wholly new content. Only about 4.2% of all messages are related to computer programming. Meanwhile, inquiries about relationships or personal reflection are rare: just under 2%.

From these findings flows a strong conclusion: what people ask of ChatGPT most are tasks that involve decision-support, clarity, advice, writing help—not edge cases of formal logic or purely technical code production. The most frequent work-adjacent activities are obtaining or interpreting information, giving advice, solving problems creatively. These patterns hold steady across occupations, education levels, and over time. Intent types are evolving: “Asking” is growing faster than “Doing,” and the style and content of requests are shifting as more people discover and deepen usage.

Against that backdrop, the HUMAINE framework from ProlificAI + Hugging Face offers a complementary lens: rather than merely cataloguing what people do, it seeks to measure how well models perform in the ways people care about. In HUMAINE, participants are asked to bring real-world problems or topics meaningful to them; they interact in multi-turn conversations with anonymized models; after the conversations they express which model they prefer overall and also evaluate side-by-side on dimensions such as clarity of presentation, task & reasoning, adaptiveness and interaction quality, trust, ethics/safety. The recruitment is stratified across demographic groups (age, ethnicity, political affiliation in US & UK) so that differences in perception across groups become visible rather than averaged away.

The implication is that traditional benchmarks—single-turn problem sets, formulaic reasoning, technical correctness—while valuable, capture only part of what people actually experience and value. When an individual approaches ChatGPT, what matters as much as the correctness of a response is how well it is explained; whether follow-ups feel natural; whether uncertainty is handled; whether the tone suits the context. If leaderboards only score for “correct answer” or “reasoning depth,” they risk privileging models that fare very well in academic or technical evaluation, but stumble in real-world usage.

Because ChatGPT usage is so heavy on “Asking” and guidance, clarity, trust, adaptability must be core evaluation dimensions—not luxuries. Because most usage is non-work, the evaluation settings need to include personal, creative, exploratory tasks. Because demographic adoption patterns are expanding, fairness and consistency across different user groups becomes essential.

Moreover, temporal evolution suggests that what counts as “good” shifts over time. As asking grows, as more users with diverse backgrounds enter, expectations rise: patience for ambiguity, expectation of follow-ups, higher demand for explanation, perhaps more natural style or conversational grace. Evaluations must adapt.

In this light, model leaderboards that aspire to real significance should do more than sort models by their best technical scores. They should also expose performance along human-centered dimensions, reflect what people actually ask models to help them with, and reveal where models perform well and where they fail for different groups of people. In doing so, they don’t only measure cleverness or reasoning—they align evaluation with usefulness, satisfaction, trust, which in the end are what people live with.