This blog post is just a note that when you try to do a grouped summary of a date variable but some groups have all missing values, it will return Inf
. This means that the summary will not show up as an NA
and this can cause issues in analysis if you are not careful.
library(tidyverse)
df <- tibble::tribble(
~id, ~dt,
1L, "01/01/2001",
1L, NA,
2L, NA,
2L, NA
) %>%
mutate(dt = dmy(dt))
z1 <- df %>%
group_by(id) %>%
summarise(dt_min = min(dt, na.rm = TRUE),
.groups = "drop")
z1
# A tibble: 2 × 2
# id dt_min
# <int> <date>
# 1 1 2001-01-01
# 2 2 Inf
sum(is.na(z1$dt_min))
# [1] 0
There are a couple of ways around this. Firstly you can use an if()
statement.
z2 <- df %>%
group_by(id) %>%
summarise(dt_min = if (all(is.na(dt))) NA_Date_ else min(dt, na.rm = TRUE),
.groups = "drop")
z2
# A tibble: 2 × 2
# id dt_min
# <int> <date>
# 1 1 2001-01-01
# 2 2 NA
sum(is.na(z2$dt_min))
# [1] 1
Or you can summary functions from the hablar
package.
z3 <- df %>%
group_by(id) %>%
summarise(dt_min = hablar::min_(dt),
.groups = "drop")
z3
# A tibble: 2 × 2
# id dt_min
# <int> <date>
# 1 1 2001-01-01
# 2 2 NA
sum(is.na(z3$dt_min))
# [1] 1
Is there a reason why R decides to return Inf
when summarising dates? Are there any other solutions to summarising date variables that contain missing values? Leave me a comment if you know thanks.