-

The ACX 2026 Prediction Contest started in January. There are $10000 in prizes. Anyone can enter; they don’t need, for instance, to have registered a Metaculus account in advance. There is no entry fee. The conclusion is obvious.

The (Initial) Strategy

The Team

I initially proposed this in the KWR Discord server. Mae and Jenn heart reacted. KWR had some leftover funding available to spend on API credits. Mission status: go. I recruited six additional people to the cause, so we could have four questions per person. The team consisted of Jasmine, two anonymous birds, Jenn, pi guy, Brent, timerune, Mae, and myself.

The Bot

The bot which placed most highly in last year’s contest is metac-o1+asknews; metac-o3+asknews and metac-gpt5+asknews, with newer models, have placed highly in other tournaments on Metaculus. We shouldn’t necessarily expect these to be the best bots², but they are the most likely of my options to be the best bots (that are open-source). They use OpenAI’s o3 and GPT 5.2 LLMs for reasoning and the AskNews news API endpoint as a “researcher”. Unfortunately, AskNews doesn’t let you pay per API credit; we would have needed to pay $250 for any API access at all. I found a way around this which for obvious reasons I will not explicitly state, figuring that 36 calls would not break anyone’s bank. If you are a staff member of either Metaculus or AskNews, and you are worried about this, contact me and I can let you know what we did.

Sanity Checking

We can therefore possibly improve on the bot just by checking its sources and correcting them if they are gratuitously off (a %3 prediction resolved positively is a huge score penalty, as will become clear later, so if we find even one of these then that represents a very large expected gain in score). Thus, once we had called the AskNews endpoint to generate 36 news reports, I got the team to look over them to check whether there were any sources missing; everyone claimed four questions and looked over the research for those questions. The plan was that we would remove sources that were misleading or irrelevant, and if there were no relevant articles, we would add some. This was a very good idea; there were serious issues with about ten of the reports, and they were fairly easy to fix with a small amount of legwork.

Interlude: How the Contest Works

Let’s talk about what determines a player’s winnings. First, each player has a calculated “peer score”, calculated for each question as:

\text{Peer score} = \frac{100}{N}\sum_{i=1}^N\ln(p) - \ln(p_i)

Where

p

is the probability that they predicted for the outcome which occured. Next, a player’s “take” is calculated as:

\text{Take} ={\max(0, \sum_{u\in Q}\text{peer score}_u \times \text{weight}_u)}^2

Where

Q

is the set of questions. The proportion of the tournament pot which player

P

wins is then the proportion their “take” was of the sum total take among all players.

As given, this is true, with a few caveats. We can see this by noting that the part of a player’s score that they control is just the (base

e

) surprisal of the event, from their perspective, scaled by a factor of 100:

\text{Peer score of player }a = \sum_{q\in Q}100\ln(p_{q,a}) + f(p_{q,-a})

Where

f(p_{q,-a})

, in the economist’s fashion, is notation for a function depending only on the actions on question

q

of players who are not

a

. From the perspective of player

a

, the expected value of this quantity is (a hundred times) the cross entropy of the predicted distribution with respect to their credence—since the cross entropy is minimised when the two distributions are the same, the maximum expected score is obtained when credences are reported faithfully. The “take” being the square of your score means that you are slightly incentivised to increase variance in your score even if it slightly decreases expected value (because you win more money getting +10 points than you lose getting -10); this isn’t a big deal, though, since the effect is small enough that any significant deviation from your credences reduces your expected squared score as well.

But there is another problem with the “take” calculation. The fact that your score is restricted to be above zero means that you should only honestly report your credence if you expect your score to be positive under either outcome of each event. Otherwise, for example, if missing some prediction would in itself make your score negative, then you ought to predict a more extreme probability (since the outcome if you predict wrongly is that your score is even more negative, which does not impact your winnings, whereas the outcome if you predict correctly is a slightly higher score, which does). This combines unfortunately (or fortunately, for the purposes of shenanigans) with this footnote:

This massively changes the entire game. It makes it so that the “zero” point, the point below which you are completely ambivalent about your score, is potentially very high—for most players, above their median expected score. In the 2025 ACX prediction contest, as of the time of this writing, only 149 people out of three thousand have scored highly enough that they are winning anything at all. This means that unless you are confident that you will place comfortably at the top of the leaderboard, much more important than your expected score is your chance of winning anything at all—your expected score is almost irrelevant, and can be arbitrarily low in a winnings-maximising strategy.

In order to illustrate this, let’s work out explicitly what a player’s expected winnings are in a hypothetical contest. For simplicity, we’ll assume that the outcome of each question is independent. Let’s suppose that in a tournament of twenty questions we expect each to happen with probability 0.5. Assume that if we honestly predict each question then we expect our score with respect to the minimum winning threshold to be distributed according to

\mathcal{N}(-100, 50)

(note that the distribution here is over the scores of other players—our part of the score is always the same, because our 50% predictions make our portion of the score evaluate the same regardless of how each question resolves). We consider two strategies: above, if we honestly report our 50% credences; below, if we honestly report our 50% credences for every question except one, which we predict as 100% likely:

Our expected winnings are roughly the size of the shaded portion (the “take” increases somewhat with higher score, but not much as it would if the y axis were truly aligned with zero score; more on this later); note that the bell curve in the right graph is half as tall, since half of the time (whenever we guessed wrong) our score is negative infinity. But even with this penalty, we’ve increased the amount we expect to win by roughly six times! Also, note that we could have predicted that one question as 0% and gotten the same expected payoff; this will be relevant later. Even though the trade “gain 70 points with 50% probability, lose infinite points with 50% probability” is bad for our expected score, it’s very good for our expected winnings.

More Maths

Let’s formally evaluate our expected winnings as a function of our strategy. Let’s call the prediction we make for question

q

p_q

(or with respect to some particular question,

p

), and our real credence that the question will resolve positively as

c_q

(

c

). If we ignore the contribution of other players, and the scaling factor, we get a contribution of

\ln(p)

if a question is resolved to yes, with probability

c

, and

\ln(1-p)

if it is resolved to no, with probability

1-c

. The mean of the random variable (RV) representing a single question’s score is then

c\ln(p) + (1-c)\ln(1-p)

, and the variance is

c(1-c){(\ln(p)-\ln(1-p))}^2

We can use a variant of the central limit theorem to approximate the score as a normal distribution, since (with Metaculus clamping our predictions between 1% and 99%) the series of RVs representing each question’s score satisfy Lyapunov’s condition. If we do this, then we estimate our total score as being distributed according to a normal distribution with the following properties:

\text{mean} = \sum_{q \in Q}c\ln(p) + (1-c)\ln(1-p)

\text{standard deviation} = \sum_{q \in Q}c_q(1-c_q){(\ln(p_q)-\ln(1-p_q))}^2

Again, the mean is the (inverse of the) cross-entropy

H(C, P)

, where

P

is our predicted distribution over all questions and

C

is our true expected distribution. This mean isn’t directly meaningful (ha!); what’s important is the difference between the mean and the minimum score required to receive any winnings. If we have an estimate of what this difference is under an honest strategy, then we can evaluate the drop in mean as the relative entropy,

D_{\text{KL}}(C || P)

, of our predicted distribution with respect to our expected distribution. Next, we can calculate our chance of winning as one minus the CDF of our resulting distribution at

0

. To optimise, we fortunately don’t need to explicitly evaluate this; we can just ensure that our mean over our standard deviation is as high as possible. For example, in our hypothetical twenty-question contest, the quantity we wish to maximise is:

\frac{-100-D_{\text{KL}}(P || C)}{50+\frac{1}{4}\sum_{q \in Q}{(\ln(p_q)-\ln(1-p_q))}^2}

Maybe I could come up with a closed-form solution to this with some effort, but fortunately I don’t have to; I can just chuck it into R’s optim function and not worry about it. I get that an optimal strategy is to predict about 60% for all questions; this gives a chance of winning of about 16%.

Note that this approximation is not all that good for small numbers of questions, especially as predictions get more extreme, although it will work well for contests with large numbers of questions. For instance, in this particular scenario, we can get a better chance of winning (about 20%) by guessing two questions as 99% probable and leaving the rest as 50%. It also won’t work for questions which are numeric or categorical. For our final strategy, I planned to use a Monte-Carlo estimator after finding some initial conditions using this method; this would also have the benefit that we could relax our assumptions and make use of the correlations between questions.

The Predictions

At least, this is what I was thinking when the contest was announced; metac-o1+asknews was resting around the prize boundary on the leaderboard. At around the time the team had submitted all the revised research reports, enough additional questions were resolved that the bot had made it to fifth place (now that the 2025 tournament has completely resolved, it is at eleventh place). This completely changes the optimal strategy. When I expected that our “true” predictions would place slightly below the payoff threshold (since when we pick the highest-placing bot we should expect that it also got lucky), the winnings-maximising strategy is to increase variance to ensure the highest chance of meeting the threshold. Now, I expected that our true predictions would place above the prize threshold by themselves. If this is the case, then the winnings-maximising strategy can involve reducing variance—you should be willing to sacrifice some points in order to reduce the chance that you score below the threshold. Unfortunately, I care more about winning than I do about maximising expected payout, so I didn’t do this; I decided that the new goal was to have as high a chance as possible of getting first place, instead of the previous goal of getting as much prize money as possible.

In any case, I ran the bot with two models—o3 and GPT 5.2—on the research reports that the team had revised. Next, each member of the team looked over the bot reports for their assigned questions and decided on a prediction for the team to use for the baseline. For most questions, this was one of the two bot predictions, but for some the person with the question decided that the bot reasoning was terrible and overruled them. I also adjusted our prediction for one question, “What percent of the top 5 human average score will the best bot score in ACX 2026?” to a fairly narrow range between 80 and 90 percent, for strategic reasons. If the best bot has a very good score, then we will probably do very well, so we can afford to lose some points on that question. Likewise, if the best bot does terribly, then we haven’t got any chance at all of winning, so there’s no point trying to get points back on one question. In the same vein, I considered strategically predicting a low value for “What will be the price of Bitcoin at the end of 2026?” and purchasing a very small amount of Bitcoin as a hedge, but decided against it.

Interlude: Legality

Let’s talk about the Metaculus tournament rules. It is not clear to me whether participating as a team in this way is allowed. The rules say:

The phrase “…as a team under one account” sounds like it should make this legal, as the exception makes the rule; the phrase “forecasts must be your own work” sounds like it should make it illegal. In any case, we’ve told Scott Alexander what we’re doing and intend to ask Metaculus about it if we would win any prize money.

More dubiously, there are a couple of collusive strategies that we aren’t using but which would work in some situations. For example, if you expect to place below the prize threshold, then you could lower the prize threshold by entering with accounts which predict in a way such as to minimise their score. If you do this, you boost the point total of every other player, reduce the difference in winnings between players with the same absolute difference in score, and ensure that more people win (since money is distributed away from the top scorers). Obviously, we didn’t plan on doing this, and it turned out not to be relevant anyway since our expected score is too high. Metaculus doesn’t have any rules against collusion, I assume because they think that their scoring rule is resistant to it; presumably if anyone actually did this they would disqualify them, remove the artificially low-scoring accounts from affecting the peer score, and add a rule against it.

I should comment on why I recruited a team in the first place, because nothing we actually did required anyone else (I would have had to spend some extra hours checking over questions, I was already spending much longer arranging everything, rewriting the bot to use research on disk instead of API calls, formatting things in human- and machine-readable formats, and so on). Since we’re splitting any money we make evenly, and introducing other people just means that there’s a chance that Metaculus disqualifies us all, why not just enter once?

The answer is simple: I don’t want a

\frac{1}{8}

chance of winning, I want to win. This necessitates having seven other people to win with.

The Entries

At this point, disaster struck. On the evening of January 16th, I was still waiting for a few people to confirm the predictions for their assigned questions. The Metaculus tournament UI listed the date that predictions would be registered for scoring was January 18th, and this is the date that I was planning for. I figured that I would pester people to finally get everything in on the 17th, write my simulation code using the completed slate on the 17th and 18th, and get the team to input their customised slates on the evening of the 18th. I double-checked the tournament details, and it turned out that actually it ended on the 17th; it was displaying as the 18th because it ended at midnight pacific time, and the UI was accounting for my being in EST. I spent the day of the 17th locking in on Jenn’s couch (I was travelling for unrelated reasons), pinging people to get their predictions to me and writing incredibly shitty makeshift simulation code to get a very basic idea of how far we would need to bias things to get a reasonable shot at 1st place (while minimising the loss to our expected score, since the more we deviated from our “true” predictions the less money we would win in expectation). The way I was originally intending to do things was to produce a set of correlations between each pair of questions using more LLM labour, use those to simulate outcomes, then do stochastic gradient descent on a MC simulation of scores (with the function parameters being the inputed predictions of each account, and the loss function being the probability that no account gets

> 200

points above the baseline slate). This would have taken hours which I did not have. Instead, I selected three groups of questions that ought to be somewhat correlated—roughly fitting into the categories “AI progress”, “economy”, and “US politics”—and tried to eyeball how far we needed to go to get a 200 point swing if all three of them turned in a certain direction, using some rudimentary simulation code to figure out how often this happened. Then, I biased each of eight prediction slates in one of the

2^3

ways that could be done, producing, for instance, one slate which is optimistic on all three, one slate which is optimistic on the economy and US politics and pessimistic on AI progress, and so on. At about midnight I was done, and started producing slates of predictions for each team member (save one, pi guy, who was busy and who had input the “baseline” predictions earlier in the day). Thankfully, everyone had stayed up and put in their predictions on time, and we had finished all nine accounts’ predictions by 1:30, an hour and a half before the deadline.

…Success?

At the time of writing, the community predictions have been released. In retrospect, both of the questions we moved the most on (Supreme Court composition change and Israel/Saudi Arabia relationship normalisation, both significantly downwards) have community predictions closer to the bot prediction; time will tell if “correcting” them was a mistake. Otherwise, our “baseline” predictions are very similar to the community predictions. Last year, the community prediction was a hundred and fifty points above the prize threshold but with 400 points fewer than the winning account. It is thus very likely that at least half of the team accounts will also be above the prize threshold, but less likely that we will win first place, assuming that the community predictions are the “true” likelihoods of the events occurring. I expect that we will most likely have a single account in the top ten and win about $300 (if Metaculus grants us prize money), totalling around 35 dollars per team member. This wasn’t even slightly worth it for me, of course, since I spent about twenty hours on this all told, but each other member put in an average of around an hour of work and will probably make a decent return. For my part, I’ll consider it worth it if we take a spot in the top three, even if we are disqualified from receiving prize money, because I like winning.

Conclusion

This was fun, so I don’t regret spending time on it, but unfortunately with the amount of prize money Metaculus hands out for tournaments like these and the number of entrants, it’s just not economical to do any effortful strategy unless you already know that you’re an extremely good forecaster. (Of course, one might have motives other than the cash prizes). The strategy of blindly following a bot would be good except for the fact that all placing bots use the AskNews API, and that’s very expensive to get access to. In retrospect, the strategy which maximises net income per unit time is simply running a single bot with Perplexity or Exa, exerting the minimum amount of human effort necessary to satisfy the Metaculus rules about predictions being based on “[one’s] own understanding”, and fudging towards extreme values enough to place above the winning threshold if you get lucky. This isn’t very fun, though, and feels like it’s even more clearly against the spirit of the game than making a team. It seems difficult to avoid people doing this, though! I guess it’s just a fundamental issue with open contests that have a potentially winning strategy that involves very little effort, and the only realistic solution is either closed contests or trying to ensure that nobody knows that this strategy exists. (Uh, persuant to this, I will be removing these lines from the public version of this post. Better not to make it public knowledge!)

In retrospect, it turns out that I am pretty good at forecasting! I got top twenty in the Bridgewater prediction contest, run at about the same time as this contest was. I should have had totally unwarranted and unevidenced confidence in my own competence, and we would have probably done much better…↩︎
https://forum.effectivealtruism.org/posts/CMfrQBrSwpujaqF8Z/how-much-do-you-believe-your-results ↩︎