The timeplyr R package, created by my colleague Nick, was accepted on CRAN in October 2023. A direct quote from the CRAN page is that it provides a set of fast tidy functions for wrangling, completing and summarising date and date-time data. It looks like a really neat package for working with time series data in a way consistent with what people have become used to with the tidyverse. From my chats with Nick, I believe some of the ideas for this package were inspired by problems that came up repeatedly while working with COVID-19 data. So the lesson here is that if you want to get clever solutions to problems that come up time and time again during analyses, you just have to sufficiently annoy a good programmer!
I’m going to give timeplyr
a quick try out here, with the help of a supermarket sales dataset that I can handily pilfer from Kaggle. To keep the dataset simple, I’ve dropped some variables that I’ve considered surplus to requirements for this try out.
library(tidyverse)
library(janitor)
library(timeplyr)
sales_df <- read_csv("https://raw.githubusercontent.com/sushantag9/Supermarket-Sales-Data-Analysis/master/supermarket_sales%20-%20Sheet1.csv",
show_col_types = FALSE, name_repair = make_clean_names) %>%
mutate(date_time = mdy_hms(paste(date, time)), .after = invoice_id) %>%
select(-(date:rating), -city, -gender)
glimpse(sales_df)
## Rows: 1,000
## Columns: 9
## $ invoice_id <chr> "750-67-8428", "226-31-3081", "631-41-3108", "123-19-117…
## $ date_time <dttm> 2019-01-05 13:08:00, 2019-03-08 10:29:00, 2019-03-03 13…
## $ branch <chr> "A", "C", "A", "A", "A", "C", "A", "C", "A", "B", "B", "…
## $ customer_type <chr> "Member", "Normal", "Normal", "Member", "Normal", "Norma…
## $ product_line <chr> "Health and beauty", "Electronic accessories", "Home and…
## $ unit_price <dbl> 74.69, 15.28, 46.33, 58.22, 86.31, 85.39, 68.84, 73.56, …
## $ quantity <dbl> 7, 5, 7, 8, 7, 7, 6, 10, 2, 3, 4, 4, 5, 10, 10, 6, 7, 6,…
## $ tax_5_percent <dbl> 26.1415, 3.8200, 16.2155, 23.2880, 30.2085, 29.8865, 20.…
## $ total <dbl> 548.9715, 80.2200, 340.5255, 489.0480, 634.3785, 627.616…
As there are many sales per day, I’ll start with aggregating up to get daily sales for each branch, which we can do with time_by()
.
daily_df <- sales_df %>%
group_by(branch) %>%
time_by(date_time, "day", .add = TRUE, time_floor = TRUE) %>%
summarise(daily_total = sum(total), .groups = "drop")
daily_df
## # A tibble: 263 × 3
## branch date_time daily_total
## <chr> <dttm> <dbl>
## 1 A 2019-01-01 00:00:00 2371.
## 2 A 2019-01-02 00:00:00 307.
## 3 A 2019-01-03 00:00:00 937.
## 4 A 2019-01-04 00:00:00 483.
## 5 A 2019-01-05 00:00:00 2025.
## 6 A 2019-01-06 00:00:00 1310.
## 7 A 2019-01-07 00:00:00 1106.
## 8 A 2019-01-08 00:00:00 683.
## 9 A 2019-01-09 00:00:00 202.
## 10 A 2019-01-10 00:00:00 731.
## # ℹ 253 more rows
We can plot this data quickly using time_ggplot()
which is flexible enough to allow facets as well - we’ll use this here as the lines for each branch overlap a lot. It also has nice defaults to ensure the x-axis does not get overly cluttered with text.
time_ggplot(daily_df, date_time, daily_total, group = branch, facet = TRUE)
Another thing timeplyr
offers is to fill in missing dates in a dataset using time_complete()
. Across the 3 branches, there are 4 days missing and these should have sales of zero which didn’t appear in the first plot. We’ll do some data manipulation to clearly shows the points that have been added in and to also demonstrate that standard ggplot2 operations can be added in with time_ggplot()
.
daily_df2 <- daily_df %>%
time_complete(date_time, .by = branch, time_by = "day",
fill = list(daily_total = 0))
anti_join(daily_df2, daily_df, join_by(branch, date_time))
## # A tibble: 4 × 3
## branch date_time daily_total
## <chr> <dttm> <dbl>
## 1 B 2019-01-11 00:00:00 0
## 2 B 2019-01-23 00:00:00 0
## 3 B 2019-02-01 00:00:00 0
## 4 C 2019-03-22 00:00:00 0
daily_df2 %>%
mutate(zero = if_else(daily_total == 0, "Zero", NA_character_)) %>%
time_ggplot(date_time, daily_total, group = branch, facet = TRUE) +
geom_point(aes(colour = zero), show.legend = FALSE) +
labs(x = NULL, y = "Daily Sales") +
theme(legend.position = "top")
To finish off, let’s take a quick look at weekly sales by customer type, just to show another variation with time_by()
.
weekly_df <- sales_df %>%
group_by(branch, customer_type) %>%
time_by(date_time, "week", .add = TRUE, time_floor = TRUE) %>%
summarise(weekly_total = sum(total), .groups = "drop")
weekly_df
## # A tibble: 78 × 4
## branch customer_type date_time weekly_total
## <chr> <chr> <dttm> <dbl>
## 1 A Member 2018-12-31 00:00:00 3642.
## 2 A Member 2019-01-07 00:00:00 4142.
## 3 A Member 2019-01-14 00:00:00 6202.
## 4 A Member 2019-01-21 00:00:00 4427.
## 5 A Member 2019-01-28 00:00:00 5771.
## 6 A Member 2019-02-04 00:00:00 2677.
## 7 A Member 2019-02-11 00:00:00 2361.
## 8 A Member 2019-02-18 00:00:00 4232.
## 9 A Member 2019-02-25 00:00:00 3359.
## 10 A Member 2019-03-04 00:00:00 5791.
## # ℹ 68 more rows
ggplot(weekly_df, aes(x = date_time, y = weekly_total, fill = customer_type)) +
geom_bar(stat = "identity") +
facet_grid(rows = vars(branch)) +
scale_x_datetime(date_breaks = "2 week", date_labels = "%d %b\n%Y") +
labs(x = NULL, y = "Weekly Sales") +
scale_fill_hue("Customer Type") +
theme(legend.position = "top")
This package definitely looks useful to me and does allow you to e.g. easily summarise and plot time-series data according to different time units. There’s a lot more that can be done with timeplyr
and it’s been designed with efficiency in mind so the functions are generally speedy. If you’re interested to find out more about it, then take a look at the Github page which has ample examples already on it.