For a piece of work I had to calculate the number of matches that a team plays away from home in a row, which we will call days_on_the_road
. I was not sure how to do this with dplyr
but it’s basically a ‘grouped sequence’. For this post, I’ve created some dummy data to illustrate this idea. The num_matches_away
variable is what we want to mimic using some data manipulation.
library(tidyverse)
sports_df <- tibble::tribble(
~team, ~date, ~home_or_away, ~num_matches_away,
"Team A", "07/10/2022", "H", 0L,
"Team A", "14/10/2022", "A", 1L,
"Team A", "21/10/2022", "A", 2L,
"Team A", "28/10/2022", "H", 0L,
"Team A", "04/11/2022", "A", 1L,
"Team A", "11/11/2022", "A", 2L,
"Team A", "18/11/2022", "A", 3L,
"Team A", "25/11/2022", "H", 0L,
"Team A", "02/12/2022", "H", 0L,
"Team A", "09/12/2022", "H", 0L,
"Team B", "07/10/2022", "H", 0L,
"Team B", "14/10/2022", "A", 1L,
"Team B", "21/10/2022", "A", 2L,
"Team B", "28/10/2022", "A", 3L,
"Team B", "04/11/2022", "H", 0L,
"Team B", "11/11/2022", "A", 1L,
"Team B", "18/11/2022", "A", 2L,
"Team B", "25/11/2022", "H", 0L,
"Team B", "02/12/2022", "H", 0L,
"Team B", "09/12/2022", "A", 1L
) %>%
mutate(date = as.Date(date, "%d/%m/%Y")) %>%
arrange(team, date)
sports_df
## # A tibble: 20 × 4
## team date home_or_away num_matches_away
## <chr> <date> <chr> <int>
## 1 Team A 2022-10-07 H 0
## 2 Team A 2022-10-14 A 1
## 3 Team A 2022-10-21 A 2
## 4 Team A 2022-10-28 H 0
## 5 Team A 2022-11-04 A 1
## 6 Team A 2022-11-11 A 2
## 7 Team A 2022-11-18 A 3
## 8 Team A 2022-11-25 H 0
## 9 Team A 2022-12-02 H 0
## 10 Team A 2022-12-09 H 0
## 11 Team B 2022-10-07 H 0
## 12 Team B 2022-10-14 A 1
## 13 Team B 2022-10-21 A 2
## 14 Team B 2022-10-28 A 3
## 15 Team B 2022-11-04 H 0
## 16 Team B 2022-11-11 A 1
## 17 Team B 2022-11-18 A 2
## 18 Team B 2022-11-25 H 0
## 19 Team B 2022-12-02 H 0
## 20 Team B 2022-12-09 A 1
I firstly came up with a complicated solution for this using a combination of slider::slide()
and rle()
(run length encoding).
sports_df2 <- sports_df %>%
group_by(team) %>%
mutate(days_on_road = unlist(slider::slide(rle(home_or_away)$length, ~1:.x)),
days_on_road = if_else(home_or_away == "H", 0L, days_on_road)) %>%
ungroup()
identical(sports_df2$days_on_road, sports_df$num_matches_away)
## [1] TRUE
These are some pretty cool functions and it was nice to know rle()
existed but I was never really happy with this solution as it seemed overly complex and it’s difficult to understand what the code is doing by simply reading it. So I asked a colleague to try to solve this problem and they came up with a better solution which I’m grateful for! It involves using a combination of group_by()
and seq_len()
which is a whole lot simpler to understand in my opinion.
sports_df3 <- sports_df2 %>%
mutate(away = cumsum(home_or_away == "H")) %>%
group_by(team, home_or_away, away) %>%
mutate(days_on_road = seq_len(n())) %>%
ungroup() %>%
mutate(days_on_road = if_else(home_or_away == "H", 0L, days_on_road))
identical(sports_df3$days_on_road, sports_df$num_matches_away)
## [1] TRUE
I hope this is useful for some of you out there coding with R!