March 15, 2014

Exploring the github archive 2

Data cleaning

Before, I used the Google BigQuery for retrieving the github archive data. Today I used only R for that. First of all, I download the data and decompress the one.

# get data
now <- format(as.POSIXct("2014-03-14 15:00:00"), "%Y-%m-%d-%H")
url <- sprintf("http://data.githubarchive.org/%s.json.gz", now)
tmpgz <- "/Users/myname/Downloads/github.json.gz"
tmpjson <- "/Users/myname/Downloads/github.json"
download.file(url, tmpgz)
system(paste0("gunzip ", tmpgz))

Data cleaning

A part of data has no specific column, e.g.language, description, so I judge whether data have them or not. With the information, I subset data. Also, I restrict the data only for “PushEvent”. I add the extra column, type, which means that the repository is a package or not.

# data cleaning
library(rjson)
res <- scan(tmpjson, what="character", sep="\n")
parsed <- lapply(as.list(res), fromJSON)

library(plyr)
condition <- ldply(parsed, function(x){
  data.frame(type=x$type,
             lang=!is.null(x$repository$language),
             desc=!is.null(x$repository$description))})
dat <- ldply(parsed[condition$type=="PushEvent" & condition$lang & condition$desc],
             function(x){
               data.frame(created_at=x$created_at,
                          isFork=x$repository$fork,
                          forks=x$repository$forks,
                          url=x$repository$url,
                          description=x$repository$description,
                          owner=x$repository$owner,
                          name=x$repository$name,
                          language=x$repository$language,
                          stringsAsFactors=FALSE
                          )}
             )
dat <- subset(dat, language=="R")

# summarise
library(dplyr)
result <- dat %.% 
  group_by(owner, name, description, url, isFork) %.%
  summarise(forks=sum(forks), recent_activity=max(created_at))
result$type <- laply(sprintf("https://raw.github.com/%s/%s/master/DESCRIPTION", 
                              result$owner, result$name), 
                     function(x)httr:::http_status(httr::GET(x))$category)
result$type <- ifelse(result$type=="success", "package", "other")
result$url <- paste0("<a href='", result$url, "' target='_blank'>URL</a>")
result <- result %.% arrange(desc(type), desc(recent_activity))

Visulalize

For the visualization, I use dataTable in the rCharts package. Here's the result.

library(rCharts)
dTable(result,sScrollX="600px", sScrollY="400px",
       bPaginate=TRUE, sPaginationType = "full_numbers",
       bScrollInfinite = T,bScrollCollapse = T)

No comments:

Post a Comment