Teaching LLMs to Ask Smarter Questions: Bayesian Experimental Design for Multi-Turn Information Gathering

Large Language Models (LLMs) have shown remarkable capabilities in understanding and generating text, but they struggle with adaptive, multi-turn information gathering – i.e. asking relevant follow-up questions based on previous answers. The paper “BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design” (arXiv:2508.21184) addresses this shortcoming by introducing a new approach that enables LLMs to proactively seek information in an intelligent, adaptive manner. The authors leverage the framework of sequential Bayesian experimental design (BED), which uses an information-theoretic criterion (expected information gain, EIG) to decide what question to ask next. This approach, termed BED-LLM, is a general-purpose method to make LLM-based agents more effective in multi-turn conversations and interactions with external sources.

In essence, BED-LLM treats the information-gathering process as a probabilistic inference problem: the LLM maintains a belief distribution over possible “hypotheses” (e.g. the hidden answer in a guessing game or a user’s hidden preferences) and chooses questions that maximally reduce its uncertainty about the true hypothesis. The paper provides a principled formulation of the EIG using the LLM’s internal belief model and introduces several innovations to implement this in practice. Key innovations include a carefully designed estimator for EIG, a strategy to update the LLM’s belief state without relying solely on in-context learning, and targeted generation of candidate questions. By integrating these components, BED-LLM aims to significantly improve performance on tasks that require asking a series of questions, compared to naive prompting or simpler question-selection heuristics. The following sections summarize the main results of the paper and discuss their implications for the field.

Proposed Method: BED-LLM Overview

BED-LLM (Bayesian Experimental Design with LLMs) is the proposed method that guides an LLM to ask informative questions in a sequential manner. At each dialogue turn, the LLM (acting as a “questioner”) generates a set of candidate questions and evaluates their expected information gain (EIG) – the expected reduction in uncertainty about the target after receiving an answer. The question with the highest EIG is chosen to ask the user (or another system acting as an “answerer”). The answer is then used to update the LLM’s belief over the possible hypotheses (e.g. possible secret identities or user profiles). This process repeats, enabling the LLM to adapt its queries based on past responses, honing in on the correct information over multiple turns.

Several design decisions are crucial for making this work effectively. The authors highlight the importance of how the joint probability model is factorized and updated. They opt for a “prior–likelihood” pairing where the LLM’s belief (prior) over hypotheses is updated using a likelihood derived from the LLM’s predicted answer distribution, rather than a simpler “data–estimation” scheme. This means BED-LLM explicitly considers a set of hypothesized answers and how likely each is, given a candidate question and the LLM’s world knowledge. The paper argues (and later shows in ablation results) that this approach yields more faithful uncertainty estimates and better performance than using a fixed candidate set or just the LLM’s entropy over answers. Another innovation is to avoid relying purely on the LLM’s internal context for belief updates; instead, external mechanisms are used to track and update the belief state across turns, ensuring consistency with all past Q&A history. The question generation process is also tailored: rather than arbitrary queries, the model proposes questions likely to differentiate between the most plausible remaining hypotheses (using unconstrained or diversity-seeking generation as needed).

Baselines: To evaluate BED-LLM, the authors compare it against two main strategies: (1) Naive QA, essentially letting the LLM ask questions and guess answers using its standard prompting with no explicit info-gain objective (relying solely on the LLM’s implicit reasoning); and (2) an Entropy-based strategy, which selects questions by maximizing the predictive entropy of the answer (a common heuristic proxy for information gain) rather than the full EIG. The paper also implements a variant of the aforementioned data–estimation approach as an ablation, representing a more naive BED method similar to some prior works. These comparisons help isolate the impact of BED-LLM’s contributions.

Experiments and Results

The authors test BED-LLM on two challenging interactive tasks and report quantitative results demonstrating its advantages. Experiments are run with several different LLMs (from large proprietary models to open-source models like Qwen2.5-72B, GPT-4o-mini, etc.) to ensure the findings are robust across model sizes and types. In each task, the questioner is an LLM employing either BED-LLM or a baseline strategy, and the answerer (simulating the user or oracle) provides responses. Below we summarize the results for each task, focusing on success metrics and comparative performance.

1. 20-Questions Game

Setup: In this classic game, the LLM must figure out a hidden target entity by asking up to 20 yes/no questions. The paper evaluates three categories of targets: Animals, Celebrities, and Things, each comprising 100 secret entities unknown to the questioner. The game continues until the model either guesses the correct answer (considered a success) or exhausts 20 questions. The questioner LLM is prompted only with the general category (e.g. “animal” or “celebrity”) but not the list of possible entities, so it must infer the target through questioning. The evaluation metric is the success rate – the percentage of games (out of 100) in which the model correctly guesses the entity within 20 questions. The authors also track the cumulative success rate as turns progress (to see how quickly each strategy narrows down the answer). Tests are conducted in two modes: (a) the same LLM is used as both questioner and answerer, and (b) the questioner and answerer are different models, to simulate a realistic scenario where the user’s responses may not follow the questioner’s internal model exactly.

Results: Across all datasets and models, BED-LLM consistently achieved the highest success rates, significantly outperforming both the entropy-based and naive QA baselines. Table 1 (excerpted below) illustrates the final success rates for an example LLM (GPT-4o) on each dataset:

Dataset	Naive QA	Entropy	BED-LLM
Animals	45%	88%	93%
Celebrities	45%	64%	86%
Things	34%	49%	64%

20-Questions success rates (percentage of 100 games won) using the GPT-4o model as questioner/answerer. BED-LLM outperforms the baselines in every category.

As shown, the BED-LLM strategy dramatically improves the success rate of the guessing game. For instance, on the Animals dataset, BED-LLM solved 93% of the games with GPT-4o, compared to only 45% by the naive questioning and 88% by the entropy-heuristic strategy. Similar patterns hold across other model types: the information-driven approach often more than doubles the success rate of naive QA. In one of the most challenging settings (e.g. a smaller model on the Celebrities set), BED-LLM achieved 91% success versus a mere 14% with the naive method – a striking improvement demonstrating the value of principled question selection. Even on the hardest category (Things), where all methods found it tougher to identify the target, BED-LLM still attained the best performance (around 54–64% success depending on the model, compared to 23–34% for naive QA). All these gains are achieved within the same 20-question limit, underscoring that BED-LLM makes far more efficient use of each question to gather information.

Another important observation is how the performance evolves over the course of questioning. Initially (in the first ~5 turns), BED-LLM and the entropy-baseline perform similarly, since any reasonable question will slightly shrink the hypothesis space. However, after around 7–10 questions, BED-LLM’s advantage clearly emerges, with its success rate climbing faster than the entropy strategy’s. By the later turns, BED-LLM is solving substantially more games. This divergence is evident in plots of success rate vs. turn count. The delay in BED-LLM “pulling away” is likely because in early turns the space of possible answers is very large (no method can reliably guess correctly with so little information), but as the game progresses, asking the most informative questions pays off. The entropy-based method, which is a simpler heuristic, doesn’t gain as much discriminatory power in those later stages, whereas the full EIG-based approach continues to zoom in on the target efficiently. This result validates that using the true expected information gain (as BED-LLM does) has tangible benefits over using entropy or intuitive questions, especially as the problem narrows down.

The authors also tested robustness when the questioner and answerer are different models (to mimic a scenario where the LLM is querying an external agent or human). Here too, the benefit of BED-LLM persists. In these mismatched-model games, BED-LLM still achieved higher success rates than both baselines on all datasets. Interestingly, the entropy strategy’s performance dropped more severely in this setting – it often fell closer to the naive baseline when the answerer’s model differed from the questioner’s. In contrast, the naive QA method was relatively “robust” to the misspecification (its performance was consistently low to begin with). This suggests that BED-LLM’s explicit modeling of uncertainty transfers better when the answering distribution is not identical to the questioner’s expectations. The entropy heuristic may have been over-reliant on the questioner’s exact predictive probabilities, so when those probabilities don’t match the actual answerer’s behavior, its effectiveness diminishes. BED-LLM, by using a more principled Bayesian update (and perhaps filtering hypothesis sets by consistency), manages to retain strong performance even under this realistic condition. In practical terms, this robustness is crucial – it indicates that an agent using BED-LLM can still perform well when interacting with humans (who of course won’t follow an LLM’s internal distribution exactly).

Finally, an ablation study compared the chosen modeling approach (prior–likelihood pairing for EIG) with the alternative data–estimation approach. The data–estimation variant is similar to some prior active learning methods (it treats new data as direct evidence without maintaining a probabilistic prior in the same way). On a subset of the 20-Questions task (Celebrities dataset with a Qwen2.5-72B model), this variant was tested against full BED-LLM. The result was that BED-LLM “substantially outperforms” the data-estimation alternative. The data-estimation method did show some improvement over the naive baseline (indicating that any kind of info-driven question selection helps), but it performed even worse than the simple entropy strategy and well below the proper BED-LLM. This confirms the authors’ claim that how the joint model is formulated and updated is a critical design choice – BED-LLM’s approach to maintaining a flexible hypothesis space and correctly estimating uncertainty is superior. In summary, the 20-Questions experiments demonstrate that BED-LLM enables dramatically higher success rates and more robust question-asking behavior than existing approaches, validating the effectiveness of the proposed method in a controlled guessing-game scenario.

2. Active Preference Elicitation

Setup: The second task explores a more open-ended interactive problem: learning a user’s personal preferences (specifically movie tastes) through Q&A, and then making recommendations. This simulates scenarios like a recommender system interviewing a user to understand their likes and dislikes. In the paper’s formulation, the hidden “target” to be discovered is a user persona – essentially a paragraph of text describing the user’s film preferences. The questioner (LLM agent) must ask questions to uncover this persona, and after each round of questions it attempts to recommend movies it thinks the user would rate highly. The answerer simulates a user: they took real user rating profiles from the MovieLens dataset and had an oracle LLM generate a textual persona summarizing each user’s taste. The answerer LLM is prompted with that persona and will answer the questions as if it were that user (e.g. expressing preference or not).

A twist in this task is that answers can be subjective or nuanced, which makes evaluating information gain harder. To tackle this, the questions were restricted to a multiple-choice format (A/B/C/D/E). This means the questioner must not only ask something relevant, but also provide five possible answer options (with option “E” always being “None of the above” to allow for an out if none of the listed options apply). This restriction ensures the answer distribution can be captured more reliably – the LLM can assign probabilities to five discrete choices, making entropy and information gain calculations meaningful. (Open-ended answers would be much harder to quantify in terms of uncertainty.) After each question and answer, the evaluation involves the questioner suggesting 10 movie recommendations that it believes the user would like, given the information gathered so far. The answerer (with access to the true user profile) then rates each recommended movie from 1 to 5 stars, and the average rating is recorded. A higher average rating means the recommendations are better aligned with the user’s actual preferences, indicating the questioner has learned more about the user. This Q&A + recommendation cycle is repeated for 5 turns, and the mean rating (with standard error) across 200 simulated users is reported at each turn to gauge performance.

Results: In this preference elicitation task, BED-LLM again outperformed the baseline approaches, though the differences were more modest than in the 20-Questions game. When using the same model as both questioner and answerer, BED-LLM began to pull ahead after a couple of questions: by around the 3rd turn, it achieved a higher average recommendation rating than both the naive QA and entropy-based methods. This indicates that the BED-LLM agent was quicker to identify movies the user would enjoy, thanks to more informative questioning. Over the 5 turns, all methods did improve their recommendations (as they learned more about the user with each answer), but BED-LLM consistently maintained the best performance curve for each of the tested LLMs. The advantage is described as an “established advantage by turn n” in the paper , implying that while early on the methods might be closer, eventually BED-LLM yields noticeably better personalization.

However, unlike in 20-Questions, the entropy method did not significantly outperform the naive strategy on average in this scenario. In fact, the authors note that Entropy performs similarly to Naive QA in the preference elicitation setting. This suggests that simply asking questions that maximize uncertainty in the answer (entropy) is not much more effective than unguided questions when dealing with complex, subjective preferences. One possible reason is that user preference space is high-dimensional and the “naive” LLM might already ask somewhat reasonable broad questions, while the entropy heuristic might not target the most useful aspects of a user’s taste. BED-LLM’s full EIG approach presumably does a better job identifying which question will most reduce uncertainty about the user’s movie likes, leading to more incrementally useful information per question (hence better recommendations sooner).

The team also evaluated the model-misspecification scenario here by pairing different models for the questioner and answerer (for example, Qwen2.5-72B as the questioner vs GPT-4o-mini as the answerer, and vice versa). The results show that BED-LLM and the entropy method remain quite robust even when the questioner’s internal model of the user might not perfectly match the “true” user model (answerer). In these cross-model tests, BED-LLM continued to outperform the baselines, and interestingly the entropy strategy also held up relatively well. In contrast, the naive QA approach was brittle under model mismatch – in at least one pairing, its performance (the average recommendation rating achieved) dropped noticeably when the questioner and answerer were different models. This mirrors the 20-Questions findings: a strategy that relies only on the questioner’s implicit reasoning (naive) can falter when the assumptions about answers are off, whereas an explicit info-gathering strategy can adjust better to surprises. Overall, although the gains in this task were smaller, BED-LLM demonstrated consistently best or equal-best performance in learning user preferences, confirming that its benefits extend beyond simple games to more realistic and subjective domains.

Implications and Future Directions

Main Findings: The results of this paper provide strong evidence that incorporating Bayesian experimental design principles into LLM-driven interactions yields significant improvements in performance for multi-turn information gathering tasks. BED-LLM was able to double or triple the success rates of a naive questioning strategy in the 20-Questions game , and it maintained an edge in a complex preference elicitation scenario as well. These improvements are not just numeric wins; they demonstrate a qualitative leap in capability. Where standard LLMs often get stuck or ask redundant/off-target questions, the BED-LLM approach leads to more efficient and goal-directed questioning, enabling the agent to zero in on the relevant information. The fact that this holds across different underlying models (including smaller open models) suggests the approach is general and not solely reliant on a very powerful LLM – a noteworthy point for making interactive AI more accessible.

Implications for the Field: This work is a step toward more intelligent conversational agents that can actively learn from their users. It addresses a known weakness of current LLMs – the lack of adaptive querying – by combining them with a principled probabilistic strategy. The success of BED-LLM implies that future AI systems (virtual assistants, chatbots, tutoring systems, etc.) can be made substantially more effective by giving them the ability to ask the right questions. Instead of relying purely on massive pretrained knowledge, such systems can interactively gather context-specific information (e.g. a user’s personal needs or a particular problem’s details) and thereby provide more personalized and accurate responses. The paper’s conclusion emphasizes that this is the first approach to use proper EIG (rather than a proxy like entropy) with an LLM’s internal model, and that doing so yields “substantial performance improvements”. In other words, it sets a new benchmark for multi-turn information gathering tasks, demonstrating that a rigorous info-theoretic approach outperforms heuristic strategies. This is likely to influence future research on LLM reasoning and active learning, encouraging others to build on Bayesian methods or improve the modeling of uncertainty within LLMs.

Real-world Applications: The capabilities demonstrated by BED-LLM could enhance a wide range of applications that involve iterative questioning. For example, interactive AI systems in healthcare could ask patients targeted diagnostic questions, improving accuracy of a diagnosis. Personal assistants and customer service chatbots could clarify user requests with fewer, smarter questions, leading to faster resolution. In education, an AI tutor could adapt to a student’s understanding by asking guiding questions and thus personalize the teaching strategy. The authors specifically note that such adaptive questioning is essential for applications ranging from medical diagnostic assistants and tutoring systems to automated surveys and market research. Below are a few potential application areas:

Medical and Diagnostic AI: An assistant that asks patients adaptive questions about symptoms and history to converge on likely conditions (improving triage and diagnosis).
Personalized Tutoring Systems: A tutoring chatbot that figures out a student’s misconceptions by asking probing questions, then provides tailored explanations or exercises.
Recommendation Systems: Interactive recommenders (for movies, products, etc.) that interview users about their preferences (like the film preference task) and quickly learn to make better suggestions.
Task Clarification in Assistants: AI assistants (e.g. for coding or home automation) that ask follow-up questions to clarify ambiguous user requests, ensuring the solution fits the user’s intent.
Market Research & Surveys: AI-driven survey bots that adapt questions based on prior responses to gather consumer preferences or feedback more efficiently than static questionnaires.
Scientific Discovery: AI agents that help researchers by asking about experimental conditions or hypotheses to systematically narrow down possibilities (an AI scientist that runs optimal “experiments” by querying data or experts).

These examples illustrate how the ability to adaptively gather information can make AI both more powerful and more user-centric. The BED-LLM framework provides a blueprint for developing such systems, as it has shown that even a simulated LLM “agent” can outperform naive approaches in understanding a hidden target or user profile. The next step towards real-world deployment would be to validate this approach with actual human users in the loop. For instance, conducting user studies where an AI agent uses BED-LLM to personalize its interaction in real-time would be an exciting direction. There may be challenges to address (e.g. ensuring that the questions remain natural and the user is comfortable, integrating domain knowledge for specialized fields, etc.), but the reported results are a strong proof-of-concept.

Future Research: The paper opens up several avenues for future work. One technical direction is to further improve how LLMs represent and update their belief state. In the preference elicitation task, the authors had to constrain answers to multiple-choice options to reliably compute uncertainty. Future research could aim to handle free-form answers by developing better ways to extract or approximate uncertainty from an LLM’s open-ended responses (perhaps via calibration or auxiliary models). Additionally, exploring more complex hypothesis spaces is important – e.g. scenarios where the space of possibilities is very large or continuous (the paper kept the hypothesis space implicit and manageable through LLM generation of candidates). Techniques from Bayesian optimization or active learning might be combined with LLMs to extend BED-LLM to such cases. Another direction is to integrate reinforcement learning: since asking questions is akin to a decision-making process, one could train the questioning policy with RL (using the BED-LLM framework as a strong prior or initial policy). The current results already confirm that principled, EIG-driven strategies work well , so learning-based refinements could potentially push performance even further or adapt the system to new domains automatically.

In summary, BED-LLM demonstrates that LLMs augmented with Bayesian experimental design can achieve a new level of interactive intelligence. The quantitative gains in both a structured game and a realistic preference-learning task imply that many AI applications will benefit from this approach. By supporting the theoretical proposal with solid experimental evidence, this paper lays the groundwork for more adaptive, curious AI agents that learn actively from their interactions. The findings encourage a paradigm where large language models are not just knowledge generators, but also knowledge seekers, driving the conversation in order to better serve users or solve problems. This represents a significant advancement in the field of conversational AI and interactive machine learning, with much potential for future development and real-world impact.