I don't entirely understand what two models mean here, because typically the search strategy (or acquisition function) in bayesopt - which in your case seems to be some form of entropy search (ES) - decides the explore-vs-exploit tradeoff for itself (possibly with some additional hyperparams ofc). For ex., ES would do this one way, Expected Improvement (EI) would do it differently, etc. - all this in the service of the bayesopt objective you want to maximize (or minimize).
Assuming that you mean this objective when you mention exploitation, which here is based on the model performing well, wouldn't it just pick queries that the model can (or is likely to) answer correctly? This would be a very optimistic evaluation of the LLM.
I don't entirely understand what two models mean here, because typically the search strategy (or acquisition function) in bayesopt - which in your case seems to be some form of entropy search (ES) - decides the explore-vs-exploit tradeoff for itself (possibly with some additional hyperparams ofc). For ex., ES would do this one way, Expected Improvement (EI) would do it differently, etc. - all this in the service of the bayesopt objective you want to maximize (or minimize).
Assuming that you mean this objective when you mention exploitation, which here is based on the model performing well, wouldn't it just pick queries that the model can (or is likely to) answer correctly? This would be a very optimistic evaluation of the LLM.