Four models are better than three

The ensemble that powers this site just went from three models to four. The new member is a small neural network. It is not the strongest model in the group, but it makes the group stronger. Here is what happened.

What changed

Every probability on this site comes from averaging the predictions of independent models. Until today, there were three:

Elo converts team ratings into match probabilities using the standard logistic formula. Simple, battle-tested, no parameters to fit beyond the ratings themselves.
Dixon-Coles estimates attack and defence strengths from historical goal counts, then generates probabilities from a bivariate Poisson distribution with a low-score correction. The workhorse.
Hierarchical Poisson fits the same goal-rate structure as Dixon-Coles but uses Bayesian MCMC sampling (PyMC NUTS), which produces full posterior distributions rather than point estimates.

The new fourth model:

Neural Poisson replaces the linear log-rate formula in Dixon-Coles with a 2-layer feedforward network (32 hidden units, softplus activation). Each team gets a dense embedding vector instead of just two scalars (attack, defence). The second hidden layer can learn nonlinear interactions between home and away embeddings. The Poisson likelihood and the Dixon-Coles low-score correction are retained.

The model is initialised from the Dixon-Coles solution. L2 regularisation prevents it from departing unless the data supports it. It can only improve on Dixon-Coles, never get stuck at a worse linear solution.

The validation gate

No model enters the ensemble without passing a walk-forward backtest. The protocol:

8 folds, 90 days each, no overlap
Each fold refits all models on data strictly before the evaluation window
The gate requires the 4-model ensemble to have a strictly lower median Brier score than the 3-model ensemble, and calibration error (ECE) within +0.2 percentage points

Results:

Metric	3-model ensemble	4-model ensemble
Median Brier	0.4860	0.4844
Median ECE	6.52pp	5.78pp

Both criteria pass. The 4-model ensemble improved Brier in 6 of 8 walks.

Why it works (and why the margin is small)

The Neural Poisson model is the weakest standalone model in the ensemble. Its individual Brier score (0.5055) is worse than Dixon-Coles (0.4981), worse than the hierarchical Poisson (0.5039), and only slightly better than Elo (0.5137).

So why does adding it help?

Because its errors are different. When the existing three models agree on a prediction, they tend to be wrong in the same direction (they share the same linear-additive log-rate assumption). The neural network disagrees with them 12-14% of the time, and those disagreements carry signal. Averaging in a model with uncorrelated errors reduces the variance of the ensemble even when that model is individually weaker.

This is the classic ensemble result: diversity beats individual accuracy, up to a point. The margin is small (16 basis points on Brier) because the neural network is only modestly decorrelated from the others. All four models see the same match data, share the same Poisson likelihood, and start from the same Dixon-Coles initialisation. The decorrelation comes entirely from the model class, not from different data or different objectives.

What it took to get right

The first version used ReLU activations and failed the gate badly, making the ensemble 48 basis points worse. The problem was the optimiser. We use scipy's L-BFGS-B, a quasi-Newton method that expects smooth gradients. ReLU has a discontinuous gradient at zero. The optimiser couldn't converge: gradient norms oscillated between 110 and 600 across iterations instead of decreasing. More iterations just overfit the training data.

Switching to softplus (a smooth approximation of ReLU) fixed the convergence. The gradient is now the sigmoid function everywhere, which L-BFGS-B handles correctly. After tuning the L2 penalty (0.003) and iteration count (2000), the model passed the gate.

What this means for the probabilities

Not much, match by match. For most fixtures, the four-model ensemble shifts predictions by 0.5-2 percentage points compared to the old three-model version. Occasionally the shift is larger when the neural network has a strong view that differs from the linear models.

The improvement shows up over many matches, not on any single one. Calibration gets tighter: the probabilities better match what actually happens.