My lab at Northwestern sits in a building near Lakeside Field, home to the school field hockey team. The team is quite sucessful: they won the national championship in 2021 and were runners-up in 2022. The team is once again in the hunt for the national championship in 2023, and I thought it would be fun to try to predict the outcome of the tournament (or at least quantify their odds).

To do so, I build on some existing models built over the last half-century. To start, I create a model similar to Maher (1982), who uses the Poisson distribution to model the number of goals scored by each team. However, because the data are overdispursed (Figure 1), I use the negative binomial distribution instead.

The Model

The Poisson distribution represents the probability of a number of events occurring over a fixed period of time. In this case, we are interested in goals during the course of a game, which varies based on team ability. The Poisson distribution can be considered a special case of the negative binomial distribution when the mean is equal to the variance. Since this is not the case with the field hockey goals data, we use the negative binomial distribution instead. We consider the amount of scoring as a function of both a team’s offensive skill and the defensive skill of their opponent and additionally consider home field advantage.

\[ \text{goals}_{ij} \sim \text{NegBinom}(\alpha + \text{offense}_i + \text{defense}_j + \text{home}) \]

where:

$\text{goals}_{ij}$ is the number of goals scored by team $i$ against team $j$
$\alpha$ is an intercept term
$\text{offense}_i$ is the offensive skill of team $i$
$\text{defense}_j$ is the defensive skill of team $j$
$\text{home}$ is the home field advantage

To get the model to converge, we add the additional zero-sum constraints

\[ \sum_{i=1}^{n} \text{offense}_i = \sum_{i=1}^{n} \text{defense}_j = 0 \]

In brms, this model is specified as

model_formula <- bf(
    gf ~ 1 +                    # intercept
        home +                  # home field advantage
        (1 | team_id) +         # offense
        (1 | opponent) +        # defense
        offset(log(match_time)),# overtime adjustment
    center=TRUE
)

Metrics and Visualization

Simulations

Team	Seed	Champs %
Northwestern	2	24.64
North Carolina	1	14.18
Harvard	9	8.41
Duke	3	8.14
Iowa	7	6.92
Maryland	4	6.45
Virginia	5	5.68
Saint Joseph’s	12	5.43
Syracuse	11	5.13
Old Dominion	13	3.98
Liberty	6	3.90
Louisville	10	2.75
Rutgers	8	2.10
American	14	1.19
Miami (OH)	15	1.00
William & Mary	16	0.10

Too simple?

The simple model above assumes that the goals scored by each team are independent. This assumption, however, misses something we can see in our data: the number of goals scored by one team is negatively correlated with the number of goals scored by the other team (Figure 2). In fact, the probability of a shutout increases as a team scores more goals (or rather, teams that score more goals are more likely to get shutouts proportionally).

References

Maher, Michael J. 1982. “Modelling Association Football Scores.”Statistica Neerlandica 36 (3): 109–18.

--- title: "A Bayesian Model for NCAA Field Hockey" description: "Approximating skill and predicting outcomes with the negative binomial distribution" date: 2023-11-08 bibliography: references.bib categories: - R - Stan - brms execute: echo: false warning: false cache: true --- ```{r, setup, include=FALSE} ``` My lab at Northwestern sits in a building near Lakeside Field, home to the school field hockey team. The team is quite successful: they won the national championship in 2021 and were runners-up in 2022. The team is once again in the hunt for the national championship in 2023, and I thought it would be fun to try to predict the outcome of the tournament (or at least quantify their odds). ```{r libraries-data, warning=FALSE, message=FALSE} library(tidyverse) library(arrow) library(brms) library(tidybayes) library(cfbplotR) library(ggtext) library(glue) library(grid) load_matches_year <- function(year) { read_feather(glue::glue("data/matches_{year}.feather")) %>% mutate( minutes = as.integer(stringr::str_extract(time, "\\d+(?=:)")), seconds = as.integer(stringr::str_extract(time, "(?<=:)\\d+")) ) %>% mutate(quarters = (minutes + seconds/60.0) / 15.0) %>% mutate(match_time = quarters/4.0) %>% mutate(season=year) } # TODO: Some of these are missing times? matches <- c(2022, 2023, 2024) %>% map(load_matches_year) %>% bind_rows() %>% arrange(desc(home), date) %>% mutate(home_team = (home == 1)) teams <- matches %>% dplyr::select(team_id, name) %>% distinct() %>% mutate(team_id = as.integer(team_id)) ``` ```{r} #| label: fig-poisson #| column: margin #| fig-width: 3 #| fig-height: 2.5 matches %>% group_by(team_id, season) %>% summarize( mean = mean(gf), var = var(gf) ) %>% ggplot(aes(x=mean, y=var)) + geom_point() + geom_abline(slope=1, intercept=0, color="red") + labs(x="Mean", y="Variance") ``` To do so, I build on some existing models built over the last half-century. To start, I create a model similar to @maher1982modelling, who uses the Poisson distribution to model the number of goals scored by each team. However, because the data are overdispersed (@fig-poisson), I use the negative binomial distribution instead. # The Model The Poisson distribution represents the probability of a number of events occurring over a fixed period of time. In this case, we are interested in goals during the course of a game, which varies based on team ability. The Poisson distribution can be considered a special case of the negative binomial distribution when the mean is equal to the variance. Since this is not the case with the field hockey goals data, we use the negative binomial distribution instead. We consider the amount of scoring as a function of both a team's offensive skill and the defensive skill of their opponent and additionally consider home field advantage. $$ \text{goals}_{ij} \sim \text{NegBinom}(\alpha + \text{offense}_i + \text{defense}_j + \text{home}) $$ where: - $\text{goals}_{ij}$ is the number of goals scored by team $i$ against team $j$ - $\alpha$ is an intercept term - $\text{offense}_i$ is the offensive skill of team $i$ - $\text{defense}_j$ is the defensive skill of team $j$ - $\text{home}$ is the home field advantage To get the model to converge, we add the additional zero-sum constraints $$ \sum_{i=1}^{n} \text{offense}_i = \sum_{i=1}^{n} \text{defense}_j = 0 $$ In `brms`, this model is specified as ```{r} #| label: model-formula #| echo: true model_formula <- bf( gf ~ 1 + # intercept home + # home field advantage (1 | team_id) + # offense (1 | opponent) + # defense offset(log(match_time)),# overtime adjustment center=TRUE ) ``` ```{r, poisson-model} build_model <- function(matches) { brm( model_formula, data = matches, family = negbinomial(), cores = 4, iter = 2500, warmup = 1500, ) } model_2024 <- build_model((matches %>% filter(season == 2024) %>% filter(date < "11/08/2023"))) model_2023 <- build_model((matches %>% filter(season == 2023))) model_2022 <- build_model((matches %>% filter(season == 2022))) ``` ## Metrics and Visualization ```{r} #| label: plot-offense-defense offense_defense <- function(model) { offense <- model %>% spread_draws(r_team_id[team_id,]) %>% group_by(team_id) %>% summarize( mean = mean(r_team_id), lower = quantile(r_team_id, 0.025), upper = quantile(r_team_id, 0.975) ) %>% arrange(desc(mean)) %>% left_join(teams, by = "team_id") defense <- model %>% spread_draws(r_opponent[opponent,]) %>% group_by(opponent) %>% summarize( mean = mean(r_opponent), lower = quantile(r_opponent, 0.025), upper = quantile(r_opponent, 0.975) ) %>% arrange(mean) %>% rename(team_id = opponent) %>% left_join(teams, by = "team_id") team_rename <- c( "UAlbany"="Albany", "App State"="Appalachian State", "American"="American University", "Boston U."="Boston University", "Ohio St."="Ohio State", "Penn St."="Penn State", "Penn"="Pennsylvania", "Massachusetts"="UMass", "Ball St."="Ball State", "Central Mich."="Central Michigan", "Michigan St."="Michigan State", "LIU"="Long Island University", "Kent St."="Kent State" ) plot_data <- inner_join( (offense %>% dplyr::select(team_id, mean, name) %>% rename(offense=mean)), (defense %>% dplyr::select(team_id, mean) %>% rename(defense=mean)), by="team_id") %>% mutate(name = coalesce(team_rename[name], name)) %>% mutate(has_image = (name %in% valid_team_names())) %>% mutate(defense=-defense) return(plot_data) } plot_offense_defense <- function(plot_data) { plot <- plot_data %>% filter(has_image) %>% ggplot(aes(y=defense, x=offense, label=name, team=name)) + #geom_richtext(angle=-45) + geom_cfb_logos(width = 0.05, angle=-45) + coord_fixed(ratio=1, xlim=c(-1.5, 1.5), ylim=c(-1.5, 1.5)) + labs( #title="2023-2024 NCAA Field Hockey Efficiency", x="Offense", y="Defense" ) + theme_bw() + theme( axis.title.x = element_text(angle = -45, hjust = 0.5, vjust = 0.5), axis.title.y = element_text(angle = -45, hjust = 0.5, vjust = 0.5), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank() ) grid::grid.newpage() vp <- grid::viewport(name = "rotate", y=0.4, angle = 45, width = 0.75, height = 0.75) grid::pushViewport(vp) print(plot, vp="rotate", newpage = FALSE) } ``` ::: panel-tabset ### 2023 ```{r} #| label: offense-defense-2024 #| fig-width: 6 #| fig-height: 6 #| classes: preview-image od_2024 <- offense_defense(model_2024) plot_offense_defense(od_2024) ``` ### 2022 ```{r} #| label: offense-defense-2023 #| fig-width: 6 #| fig-height: 6 od_2023 <- offense_defense(model_2023) plot_offense_defense(od_2023) ``` ### 2021 ```{r} #| label: offense-defense-2022 #| fig-width: 6 #| fig-height: 6 od_2022 <- offense_defense(model_2022) plot_offense_defense(od_2022) ``` ::: ## Simulations ```{r} draws <- model_2024 %>% spread_draws(b_Intercept, shape, b_home, r_team_id[team_id,], r_opponent[opponent,]) ``` ```{r} #| label: simulate-goals simulate_goals <- function(draws, is_home, t1, t2) { draws %>% filter(team_id == t1) %>% filter(opponent == t2) %>% mutate(mu = exp(b_Intercept + r_team_id + r_opponent + b_home*is_home)) %>% rowwise() %>% mutate( goals = list(rnbinom(100, size=shape, mu=mu)), # TODO: how to do this with negative binomial? time = list(rgeom(100, 1 - shape/(shape+mu)) / shape) ) %>% select(goals, time) %>% unnest(cols = c(goals, time)) } ``` ```{r} tournament_teams_list <- c( "North Carolina", "Northwestern", "Duke", "Maryland", "Virginia", "Liberty", "Iowa", "Rutgers", "Harvard", "Louisville", "Syracuse", "Saint Joseph's", "Old Dominion", "American", "Miami (OH)", "William & Mary" ) home_advantage <- c(1, 5, 5, 5, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17) tournament_teams <- data.frame( name=tournament_teams_list, home=home_advantage ) %>% inner_join(teams, by="name") %>% mutate(seed=seq_along(name)) ``` ```{r} teams_win_percentage <- function(draws, home, t1, t2) { sim_1 <- simulate_goals(draws, home, t1, t2) sim_2 <- simulate_goals(draws, -home, t2, t1) goal_diff <- sim_1$goals - sim_2$goals first_rate <- sim_1$time < sim_2$time overtime_win <- (first_rate & (goal_diff == 0) & (pmin(sim_1$time, sim_2$time) < (1/3))) # TODO: Assume random for now shootouts <- sum((goal_diff == 0) & (pmin(sim_1$time, sim_2$time) >= (1/3))) shootout_win <- runif(shootouts) < 0.5 # TODO: Ignores shootout for now. # Could factor in when the rate is lower than 20 return((sum(goal_diff > 0) + sum(overtime_win) + sum(shootout_win)) / length(goal_diff)) } ``` ```{r, eval=FALSE} teams_win_percentage(draws, -1, 457, 509) ``` ```{r} tournament_draws <- draws %>% filter(team_id %in% tournament_teams$team_id) %>% filter(opponent %in% tournament_teams$team_id) m <- matrix(0, nrow=nrow(tournament_teams), ncol=nrow(tournament_teams)) for (i in 1:(nrow(tournament_teams) - 1)) { for (j in (i + 1):nrow(tournament_teams)) { has_home_advantage <- tournament_teams$home[i] <= j win_percentage <- teams_win_percentage( tournament_draws, as.integer(has_home_advantage), tournament_teams$team_id[i], tournament_teams$team_id[j] ) m[i, j] <- win_percentage m[j, i] <- 1 - win_percentage } } ``` ```{r} #| label: tournament-simulation tournament_round <- function(mat, teams) { if (length(teams) == 1) { return(teams[1]) } if (length(teams) == 2) { t1 <- teams[1] t2 <- teams[2] return(ifelse(runif(1) < mat[t1, t2], t1, t2)) } half_length <- length(teams) %/% 2 first_half_winner <- tournament_round(mat, teams[1:half_length]) second_half_winner <- tournament_round(mat, teams[(half_length + 1):length(teams)]) return(tournament_round(mat, c(first_half_winner, second_half_winner))) } ``` ```{r} y <- replicate(100000, tournament_round( m, seq(1, nrow(tournament_teams), 1) )) %>% table() %>% prop.table() %>% as.data.frame() %>% setNames(c("s", "wins")) %>% mutate(seed = as.integer(s)) %>% inner_join(tournament_teams, by="seed") %>% arrange(desc(wins)) y %>% mutate(winsp = wins*100) %>% dplyr::select(name, seed, winsp) %>% knitr::kable(digits=2, col.names=c("Team", "Seed", "Champs %")) ``` ## Too simple? ```{r} #| label: fig-goal-correlation #| column: margin #| fig-width: 3 #| fig-height: 2.5 matches %>% ggplot(aes(x=ga, y=gf)) + geom_point(alpha=0.05) + geom_smooth(method = "glm", method.args = list(family = "poisson"), se = FALSE, color = "red") + labs(x="Goals Allowed", y="Goals For") ``` The simple model above assumes that the goals scored by each team are independent. This assumption, however, misses something we can see in our data: the number of goals scored by one team is *negatively correlated* with the number of goals scored by the other team (@fig-goal-correlation). In fact, the probability of a shutout *increases* as a team scores more goals (or rather, teams that score more goals are more likely to get shutouts proportionally). ```{r, eval=FALSE} model_2024 <- build_model((matches %>% filter(season == 2024))) draws <- model_2024 %>% spread_draws(b_Intercept, shape, b_home, r_team_id[team_id,], r_opponent[opponent,]) tournament_teams_list <- c( "North Carolina", "Northwestern", "Duke", "Virginia" ) home_advantage <- c(1, 5, 5, 5) tournament_teams <- data.frame( name=tournament_teams_list, home=home_advantage ) %>% inner_join(teams, by="name") %>% mutate(seed=seq_along(name)) ``` ```{r, bivariate-model, eval=FALSE} teams <- matches_2023 %>% dplyr::select(team_id, name) %>% distinct() %>% mutate(team_id = as.integer(team_id)) data <- matches_2023 %>% # TODO: At the moment, matches are counted twice... #mutate(i = paste(date, pmin(team_id, opponent), pmax(team_id, opponent))) %>% #distinct(i, .keep_all=T) %>% arrange(date) %>% drop_na() %>% mutate(season=2022) %>% dplyr::select(season, team_id, opponent, gf, ga) %>% mutate(team_id = as.integer(team_id)) %>% inner_join(teams, by=c("team_id"="team_id")) %>% rename(team_id="team_id", home="name") %>% mutate(opponent = as.integer(opponent)) %>% inner_join(teams, by=c("opponent"="team_id")) %>% rename(opponent="opponent", away="name") %>% select(season, home, away, gf, ga) sf <- stan_foot( data=data, model="double_pois", ind_home="FALSE", predict=2 ) ```