I recently saw this great post on Nathan Yau’s FlowingData website which guesses a person’s name based on what the name starts with. It also needs you to select a gender and a decade for when you were born before it can guess. Of course, it isn’t really a guess and is really just based on proportions calculated after restricting the data to what has been selected.
It uses data from the Social Security Administration in America so results more specifically apply to the US. I thought it’d be cool to see how it looks for Scottish data which is available from the National Records of Scotland (NRS). I’ve embedded the Shiny app below into an iframe but you can view the app in it’s own page by going to https://alan-y.shinyapps.io/name_guess. I used a little bit of css to make the iframe responsive (resizes based on the amount of screen/window space available). The app is hosted on shinyapps.io on a free account so if nobody visits it for a while, it will no longer be available (unless I restart it). So apologies if you happen to visit this page further down the line and it’s not working!
The R code I used to download and wrangle the data as well as create the Shiny app is provided further down the page. I hope the app is interesting to some people and my acknowledgements again, to Nathan Yau as this is clearly based off his work.
Downloading the data
First I identified the webpages from the NRS website that contained the required babynames csv files and then scraped the links to all the csv files with help from the rvest package. I created some helper functions (one to grab the csv links and one to read the csv files into R and tidy them up) to use with the map()
functions from purrr.
library(tidyverse)
library(janitor)
library(rvest)
# Helper functions
get_csv_links <- function(link) {
read_html(link) %>%
html_nodes("a") %>%
html_attr("href") %>%
str_subset("\\.csv") %>%
paste0("https://www.nrscotland.gov.uk/", .)
}
read_babynames <- function(link, yr) {
b <- read_csv(link, skip = 6) %>%
remove_empty() %>%
select(-contains("Position")) %>%
clean_names()
boy <- b %>%
select(1:2) %>%
set_names(c("name", "number_of_babies")) %>%
mutate(gender = "boy")
girl <- b %>%
select(3:4) %>%
set_names(c("name", "number_of_babies")) %>%
mutate(gender = "girl")
bind_rows(boy, girl) %>%
mutate(year = yr) %>%
filter(!is.na(number_of_babies))
}
# List of webpages containing the csv files
pages <- c("https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/full-lists-of-babies-first-names-archive/full-lists-of-babies-first-names-1974-to-1979",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/full-lists-of-babies-first-names-archive/full-lists-of-babies-first-names-1980-to-1989",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/full-lists-of-babies-first-names-archive/full-lists-of-babies-first-names-1990-to-1999",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/full-lists-of-babies-first-names-2000-to-2009",
"https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/names/babies-first-names/full-lists-of-babies-first-names-2010-to-2014")
csv_links <- map(pages, get_csv_links) %>%
unlist()
# Find the years for each csv file
yr <- parse_number(str_extract(csv_links, "[0-9]+\\.csv")) %>%
if_else(is.na(.), 2018, .) %>%
if_else(. < 1000, . + 2000, .)
babynames <- map2_df(csv_links, yr, read_babynames)
babynames2 <- babynames %>%
mutate(decade = paste0(str_sub(year, 1, 3), "0s")) %>%
group_by(decade, gender, name) %>%
summarise(number_of_babies = sum(number_of_babies)) %>%
ungroup()
# Save as rds so it can be quickly read in for the Shiny app
saveRDS(babynames2, "babynames.rds")
Shiny App
I created the Shiny app by amending the Shiny template available in RStudio as required – all fairly straightforward stuff and nothing fancy involved at all!
# Shiny App: Scotland's most popular babynames by decade
library(shiny)
library(dplyr)
library(ggplot2)
library(scales)
library(stringr)
theme_set(theme_minimal(base_size = 14))
babynames <- readRDS("babynames.rds")
ui <- fluidPage(
titlePanel("Scotland's Most Popular Babynames"),
sidebarLayout(
sidebarPanel(
selectInput("decade", "Born in Decade:",
c("1970s" = "1970s",
"1980s" = "1980s",
"1990s" = "1990s",
"2000s" = "2000s",
"2010s" = "2010s")),
radioButtons("gender", "Gender:",
c("Boy" = "boy",
"Girl" = "girl")),
textInput("name_start", "Name starts with", ""),
),
mainPanel(
plotOutput("barPlot")
)
)
)
server <- function(input, output) {
output$barPlot <- renderPlot({
babynames %>%
filter(decade == input$decade,
gender == input$gender,
str_detect(name, paste0("^", str_to_title(input$name_start)))) %>%
arrange(desc(number_of_babies)) %>%
mutate(perc = number_of_babies / sum(.$number_of_babies),
name = factor(name, levels = rev(.$name))) %>%
slice(1:20) %>%
ggplot(aes(x = name, y = perc)) +
geom_bar(stat = "identity", fill = "orange", width = 0.7) +
scale_y_continuous(labels = percent, limits = c(0, 1)) +
labs(x = NULL, y = NULL,
caption = "Source: National Records of Scotland\nBabynames Data 1974-2018") +
coord_flip() +
theme(panel.grid.major.y = element_blank())
})
}
shinyApp(ui = ui, server = server)