March 31, 2014

Why don't you use %>% ?

One of features of dplyr package, which is well known as very useful one for manipulation, is the operator %.%. It is called chain operator and chains any operation in R as follows.

library(dplyr)
iris %.% group_by(Species) %.% summarise(avg = mean(Sepal.Width))
## Source: local data frame [3 x 2]
## 
##      Species   avg
## 1     setosa 3.428
## 2 versicolor 2.770
## 3  virginica 2.974

%.% works like pipe operator in UNIX. %.% is simple but powerful. In case you have to make temporary objects, you don't need.

If you like F# style, you can use %>% operator with magrittr package. In the next version of dplyr, it is announced that %.% will be deprecated and replaced with %>%. Please check out below in detail.

https://groups.google.com/forum/#!msg/manipulatr/4EtIPVR3qEw/Xx4Vec7O0CQJ

library(magrittr)
iris %>% group_by(Species) %>% summarise(avg = mean(Sepal.Width))
## Source: local data frame [3 x 2]
## 
##      Species   avg
## 1     setosa 3.428
## 2 versicolor 2.770
## 3  virginica 2.974

March 30, 2014

Try circlize package

Today I try to circlize package. This package enables us to make Circos plot. Circos plot is well known for describing transitions of population. Also in Japan, the plot is used in the special TV program of the last election (below).

http://dl.dropboxusercontent.com/u/956851/test000002.jpg

circulize package has a good vignette, so I highly recommend you read that.

At this post I try just one example. In future articles, I will describe how to plot in more detail.

library(circlize)
par(mar = c(1, 1, 1, 1))
factors = letters[1:8]
circos.par(points.overflow.warning = FALSE)

# initialize
circos.initialize(factors = factors, xlim = c(0, 10))
circos.trackPlotRegion(factors = factors, ylim = c(0, 1), bg.col = "grey",
    bg.border = NA, track.height = 0.05)

# linking between elements
circos.link("a", 5, "c", 5)
circos.link("b", 5, "d", c(4, 6))
circos.link("a", c(2, 3), "f", c(4, 6))

plot of chunk unnamed-chunk-1

March 28, 2014

Private decision making —prepare for meetup—

Collect

This post is continued from the previous post.

Today, I will show an example of private decision making with the three key steps, Collect, Viz and Imagine. As the example, I will take the case of joining a meetup. Suppose that you are going to join an event, but you do not know most of participants. In this event, participants need to register the event site and show their profiles. Some make their twitter accounts open. You want to make acquaintance in this event. This is the start point.

Collect

Most of event sites have web APIs, for example, Meetup, Eventbrite, ATND. To write a few lines code, you can get various data in relation to events. Okay, let's take ATND as an example. ATND is provided by Recruit, Japanese famous company, and I usually use this site. For this purpose, I have prepared the R package, firstdate. You can install the package with devtools::install_github. After the installation, you can acquire the data from API with just one line. Here's the code.

# devtools::install_github("dichika/firstdate")
library(firstdate)
users <- getATNDEventUsers(eventid=48048)
colnames(users)
## [1] "nickname"    "status"      "twitter_img" "twitter_id"  "user_id"

The data is composed of five columns, twitter_id, user_id, status, twitter_img, nickname. Next let's see how many people have twiter accounts.

have_twitter <- sum(!is.na(users$twitter_id))
cat(have_twitter,"/", nrow(users), "(",round(100*have_twitter/nrow(users),1), "% ) people have twitter accounts.")
## 65 / 91 ( 71.4 % ) people have twitter accounts.

71.4% of attendantces have twitter accounts.

What topics are they interested in? The choice of topics for a talk is very important step. Let's collect their recent tweets. To get tweets, you need the authorization of Twitter. You have to register your accounts as the twitter developers and get API key and API secret. With your key and your secret, you can acquire recent 180 tweets of participants. Today, we will get first 10 particpants as the example.

tweets <- ldply(participants[participants$twitter_id!="holidayworking",]$twitter_id,
                function(x)getTwitter(x, key="Your key", secret = "Your secret"))

Can you get tweets successfully? Okay, let' move next step, Vis.

Viz

For vizualization, you can choose various methods, for example, network, timeline, simple bar charts and so on. In this case, we try to visualize tweets as simple timelines. As follows, the frequency and the contents have variation among participants.

visTL(tweetdata=tweets, group="name", path="twTL.html")
browseURL("twTL.html")

Imagine

The final step is Imagine. This is a simple step. You just imagine the tweets of participants and what topic they are interested in. In some case, mathematical models and machine learning methods will help your imagination. For example, LDA might summarise messy tweet data and extract topics. At present I do not implement any methods to firstdate but will do it soon.

So far, I have introduced three steps for private decision making, Collect, Viz and Imagine. In this blog I will keep introducing other useful cases.

March 27, 2014

The key steps of private decision making —Collect, Viz, Compare—

It is said that the 21st century is the Big Data Era. We can approach various kinds of data, for example, web log data, massive sensory data form machines, public statistics and so on. All our activity is recorded and output in various forms. 

In the business world, data driven decision making is becoming popular. Data is being stored as form of easy use and the fusion of massive data and fine statistical modeling is achieving success every day.

By comparison, can we say that we can handle and make use of all of data for our own decision making, private decision making? Aren't we so overwhelmed by data? For example, when you join meetups, I'm afraid most of you might not utilize the information about other participants. If you made full use of the information, you could obtain a chance with your bright future. In our everyday life, we can't breathe in the flood of data.

I'm convinced that Collect, Viz, and Imagine are the key steps for making full use of massive data. First, Collect is namely the step of collection data. This includes acquiring data through public web APIs and spidering data around WWW. Second, Viz is the visualization of collected data. As you can see in this blog, we can easily visualize data as various forms with programming. Finally, Imagine is to come up with hypothesis from the data and your experience. I'm afraid that this step goes well with Bayesian statistics. At present I'm not good at Bayesian modeling very much , so I usually make use of natural bayesian inference with my brain and my experience.

In the next post, I will describe an example how we Collect Viz and Imagine with R.

March 26, 2014

How did Sophia spread in the U.S.?

Preparation

In the previous post, I founded popular baby names in U.S. As for female, the most popular baby name is Sophia. As you can see, Sophia has become popular recently, actually, has incresed sharply from 1990s to 2000s.

In today's post, I will graphically check the area characteristics of Sophia with rCharts.

Preparation

As with the approach of the previous post, I prepare the data with plyr package. Furthermore, I add the prop column to the data with dplyr package. It denotes proportion in each states. Okay, we are all ready. Let's visualize it.

library(plyr)
setwd("./downloads/namesbystate")
fs <- list.files()
fs <- fs[fs!="StateReadMe.pdf"]
dat_babyname <- ldply(as.list(fs), function(x)read.csv(x, as.is=TRUE, header=FALSE))
colnames(dat_babyname) <- c("state", "sex", "year", "name", "freq")

library(dplyr)
sophia <- dat_babyname %.% 
  group_by(state,sex,year) %.% 
  mutate(prop=round(freq*100/sum(freq),2)) %.%
  filter(name=="Sophia")

Visualize 2012

First, I visualize the area trend of Sophia in 2012. Here's the result. The higher the props are , the darker the colors get . You can see Sophia is more popular in the west side of the U.S than in the east.

library(rMaps)
sophia2012 <- subset(sophia, year==2012)
isophia2012 <- ichoropleth(prop~state,  data=sophia2012, pal="Reds", 
                           geographyConfig=list(popupTemplate="#!function(geo, data) {
                                                return '<div class=\"hoverinfo\"><strong>'+
                                                geo.properties.name+
                                                '<br>Prop: ' + data.prop +
                                                '</strong></div>';}!#"))
isophia2012$show("iframesrc", cdn = TRUE)

Visualize 1910-2011

Next I visualize the transition of Sophia. On the contrary to the present, it doesn't seem that Sophia is popular in the west side at the beginning of the 20th century. However, it spread to the south after the middle of 1960s.

isophia <- ichoropleth(prop~state, data=sophia, pal="Reds", animate="year", play=TRUE,
                       geographyConfig=list(popupTemplate="#!function(geo, data) {
                                            return '<div class=\"hoverinfo\"><strong>'+
                                            geo.properties.name+
                                            '<br>Rank: ' + data.prop +
                                            '</strong></div>';}!#"))
isophia$show("iframesrc", cdn = TRUE)