Primer – How does the Stocky work?

The Stocky, which is short for stochastic simulation, is a Monte Carlo simulation of the season using ELO modelling to work out what the outcome of that season might be.

The basic premise of a Monte Carlo simulation is that if you have a few pieces of the puzzle, an idea of how they relate and then throw enough random numbers at it, you’ll get a pretty good idea of what the puzzle picture is.

Let’s say you have a circle inside a square with sides the same length as the circle’s diameter. Then throw a bunch of sand onto the square/circle combination and count how many grains of sand end up in the circle. If you know the length of the square’s side and the proportion of sand that ends up in the circle, you can work out a value for π.

(You want more detail? Fine: the side of the square can be used to calculate the area of the square, multiply that by the proportion of sand inside the circle will give you an estimate of the circle’s area, divide the circle’s area by square of half the square’s length and you will get an estimate of π).

The more grains of sand you throw at the square/circle, the closer the estimate will be to the actual answer.

To apply this to rugby league, the outcome of each game is determined by a random number generator. The distribution of random numbers is based on the actual calculated results from the ELO models (remember that scorelines can be converted to percentages). The outputs of the number generator follows a normal distribution with a mean of 0.55 and a standard deviation of 0.23.

If the number generated is within the expectancy of the favourite as derived from the two teams’ ratings, the favourite is awarded the win. Repeat this process through the 190 regular season games and you have a season of results to analyse. But this is just one way the season could go and the odds that it will match the way the actual season goes are infinitely small.

So we throw 10,000 grains of sand at the season and redo the analysis, using different random numbers each time, 10,000 times which gives us an idea of which pathways are more likely than others. We take the average of the results and use that for the analysis. Perhaps contrary to expectation, patterns do emerge and it’s possible to output estimates for:

  • End of season ratings for each team
  • Projected number of wins
  • Odds of a team hitting the minor premiership, making the finals or getting the wooden spoon

These are all things we want to know but it needs to be tempered with the knowledge that the outputs reflect some fundamental assumptions. We have a few inputs:

  • Initial ELO ratings which we need to select from an ELO model. At the start of the season, we can’t just assign everyone a 1500 rating because every team would end up with 12 projected wins. Therefore, we must select our ratings from a continuous system.
  • Calculation method for the model – margin vs result/winner-takes-all (WTA)
  • The draw, over which I obviously have no control

To work out the best way forward over the inputs I do have control over, I ran the 2016 season through a set of different scenarios to see which had the best forecasting ability. I used a mix of starting ratings from the different models and used both calculation methodologies to project wins for each team. Those projected wins were compared to the actual wins from the 2016 season and the one closest on average, i.e. the one with the lowest mean absolute error (MAE), is the best forecaster.

Nearly 28 million simulated match outcomes (it was actually way more because I kept stuffing up the process) later:

2016 stocky testing.JPG

If you just give every team 12 wins (190 games divided 16 ways), then you finish with a MAE of 3.63, so systems worth considering should outperform that, as we see with some of the WTA models but not the margin models.

It’s also worth noting that in 2016, Newcastle only won a single game and drew another. This is a singularly poor performance. No other team in the NRL’s nineteen year history has finished with fewer than three wins. If we take them out, the MAE for scenario C with WTA calculation falls to 2.74 (11.4%), lower of any of the fourteen options. With any luck, we will avoid such outlier performances in 2017 and the Stocky will be more accurate.

Accuracy in this context is a matter of relative performance. Scenario C’s mean absolute error is 13.5%. But this before a single game has been played, so so we don’t have any information about performances during the season. By the end of 2016, the MAE is just 0.6, some of which is due to the fact that the Stocky projects part-wins while a team can obviously only win whole numbers of games.

2016 was a pretty kind year for the ELO modelling, generally performing above average and particularly well in the Euclid model:

  • Euclid prediction rate for 2016 – 69.7% (best ever and way up from 56.7% the year before)
  • Hipparchus prediction rate for 2016 – 61.7% (down from 65.5% in 2015, up from 53.2% in 2014)

Just as a test, I took the advantage/disadvantage in each rating in Scenario C and doubled them (e.g. a 1480 team became a 1460 team) to see if that would have any positive impact. Conclusion: it does not (MAE = 3.36).

So for 2017, Euclid ELO ratings will be the basis of the projection using a winner-takes-all calculation method. The Stocky outputs will be updated after each round on Monday based on the last round results. Initial testing suggests that the Stocky tends to overreact to individual results, so some tuning may be required for 2018.

Longer term, I’m looking to create a Finals Stocky, which would work a bit differently to the normal season but would still assess each team’s relative chances of winning the big one.