March 25, 2014

Visualize the trend of baby names in U.S. with rCharts

Preparation for data Today, I try to visualize popular babynames in U.S.. There are three steps.
  1. Preparation for data
  2. Grasp top 10 baby names in 2012
  3. Visualize the trajectory of top 10 from 1910 to 2012
OK, let's do it.

Preparation for data

First, we need to download zip file from URL below.
http://www.ssa.gov/OACT/babynames/state/namesbystate.zip
Download and unzip, we can see 52 files. One is “ReadMe” file and the rest is the babyname data of each states. We read the files other than “ReadMe” and set column names.
library(plyr)
setwd("./downloads/namesbystate")
fs <- list.files()
fs <- fs[fs != "StateReadMe.pdf"]
dat_babyname <- ldply(as.list(fs), function(x) read.csv(x, as.is = TRUE, header = FALSE))
colnames(dat_babyname) <- c("state", "sex", "year", "name", "freq")

Grasp the top 10 baby names in 2012

Next, we grasp top 10 baby names in 2012 by dplyr package. The codes below appear complex but they are set of simple steps.
library(dplyr)
top10_female <- dat_babyname %.% filter(year == 2012 & sex == "F") %.% group_by(name) %.% 
    summarise(count = sum(freq)) %.% arrange(desc(count)) %.% head(10)

top10_male <- dat_babyname %.% filter(year == 2012 & sex == "M") %.% group_by(name) %.% 
    summarise(count = sum(freq)) %.% arrange(desc(count)) %.% head(10)
Here's the female result. Sophia is the most popular baby name in 2012. The second is Emma, and the third is Isabella.
top10_female
## Source: local data frame [10 x 2]
## 
##         name count
## 1     Sophia 22158
## 2       Emma 20791
## 3   Isabella 18931
## 4     Olivia 17147
## 5        Ava 15418
## 6      Emily 13550
## 7    Abigail 12583
## 8        Mia 11940
## 9    Madison 11319
## 10 Elizabeth  9596
In the male result. Jacob ranks first. Mason and Ethan are following.
top10_male
## Source: local data frame [10 x 2]
## 
##         name count
## 1      Jacob 18899
## 2      Mason 18856
## 3      Ethan 17547
## 4       Noah 17201
## 5    William 16726
## 6       Liam 16687
## 7     Jayden 16013
## 8    Michael 15996
## 9  Alexander 15105
## 10     Aiden 14779

Visualize the trajectory of top 10 from 1910 to 2012

Top10 names in 2012 have been also popular for a long time? Finally, let's visualize the trajectory with rCharts.
babyname_trajectory <- dat_babyname %.% group_by(sex, year, name) %.% summarise(count = sum(freq)) %.% 
    group_by(sex, year) %.% mutate(prop = round(count * 100/sum(count), 3)) %.% 
    filter(name %in% c(top10_female$name, top10_male$name))
dF <- subset(babyname_trajectory, sex == "F" & name %in% top10_female$name)
dM <- subset(babyname_trajectory, sex == "M" & name %in% top10_male$name)

library(rCharts)
nF <- nPlot(data = dF, x = "year", y = "prop", group = "name", type = "lineChart")
nM <- nPlot(data = dM, x = "year", y = "prop", group = "name", type = "lineChart")
The female case is as follows. Top3 names in 2012, Sophia, Emma, and Isabella, have skyrocketed since the middle of 1990. On the contrary, Emily had ranked highest in 1990s, however, plunged in the middle of 2000s. Also Elizabeth has been a popular name since the beginning of the 20th centuries.
The top two male baby names of 2012, Jacob and Mason, show contrasting trends. Jacob's popularity peaked in 1998 and then declined, while Mason's has been increasing since the end of 2000, and might surpass Jacob within the next year. Furthermore, you can see the past glories of the names William and Michael. William ranked high for a long time since it became popular in 1910, but Michael soared from the 1940's to the 1960's and succeeded to the throne.
Today we have got the interesting result. In the next post, I will explore the trend of baby names in each states.

No comments:

Post a Comment