A Bayesian Model for NCAA Field Hockey

Approximating skill and predicting outcomes with the negative binomial distribution
R
Stan
brms
Author
Published

November 8, 2023

My lab at Northwestern sits in a building near Lakeside Field, home to the school field hockey team. The team is quite sucessful: they won the national championship in 2021 and were runners-up in 2022. The team is once again in the hunt for the national championship in 2023, and I thought it would be fun to try to predict the outcome of the tournament (or at least quantify their odds).

To do so, I build on some existing models built over the last half-century. To start, I create a model similar to Maher (1982), who uses the Poisson distribution to model the number of goals scored by each team. However, because the data are overdispursed (Figure 1), I use the negative binomial distribution instead.

The Model

The Poisson distribution represents the probability of a number of events occurring over a fixed period of time. In this case, we are interested in goals during the course of a game, which varies based on team ability. The Poisson distribution can be considered a special case of the negative binomial distribution when the mean is equal to the variance. Since this is not the case with the field hockey goals data, we use the negative binomial distribution instead. We consider the amount of scoring as a function of both a team’s offensive skill and the defensive skill of their opponent and additionally consider home field advantage.

\[ \text{goals}_{ij} \sim \text{NegBinom}(\alpha + \text{offense}_i + \text{defense}_j + \text{home}) \]

where:

  • \(\text{goals}_{ij}\) is the number of goals scored by team \(i\) against team \(j\)
  • \(\alpha\) is an intercept term
  • \(\text{offense}_i\) is the offensive skill of team \(i\)
  • \(\text{defense}_j\) is the defensive skill of team \(j\)
  • \(\text{home}\) is the home field advantage

To get the model to converge, we add the additional zero-sum constraints

\[ \sum_{i=1}^{n} \text{offense}_i = \sum_{i=1}^{n} \text{defense}_j = 0 \]

In brms, this model is specified as

model_formula <- bf(
    gf ~ 1 +                    # intercept
        home +                  # home field advantage
        (1 | team_id) +         # offense
        (1 | opponent) +        # defense
        offset(log(match_time)),# overtime adjustment
    center=TRUE
)

Metrics and Visualization

Simulations

Team Seed Champs %
Northwestern 2 24.64
North Carolina 1 14.18
Harvard 9 8.41
Duke 3 8.14
Iowa 7 6.92
Maryland 4 6.45
Virginia 5 5.68
Saint Joseph’s 12 5.43
Syracuse 11 5.13
Old Dominion 13 3.98
Liberty 6 3.90
Louisville 10 2.75
Rutgers 8 2.10
American 14 1.19
Miami (OH) 15 1.00
William & Mary 16 0.10

Too simple?

The simple model above assumes that the goals scored by each team are independent. This assumption, however, misses something we can see in our data: the number of goals scored by one team is negatively correlated with the number of goals scored by the other team (Figure 2). In fact, the probability of a shutout increases as a team scores more goals (or rather, teams that score more goals are more likely to get shutouts proportionally).

References

Maher, Michael J. 1982. “Modelling Association Football Scores.” Statistica Neerlandica 36 (3): 109–18.