Getting data from the Infochimps Geo API in R Supporting tagline
I am very intrigued by the Infochimps Geo API, so wanted to play around with it a little bit and pull the data into R. I'll start by getting data from the American Community Survey Topline API for a 10km area around where I live.
First some setup code here. It imports a couple libraries that we'll need (RJSONIO and ggplot2) then sets up some variables that we'll use later to construct the REST call into Infochimps Geo.
[sourcecode language="r" light="true"]
library(RJSONIO)
library(ggplot2)
api.uri <- "http://api.infochimps.com/"
acs.topline <- "social/demographics/us_census/topline/search?"
api.key <- "apikey=xxxxxxxxxx"
radius <- 10000 # in meters
lat <- 44.768202
long <- -91.491603
columns <- c("geography_name","median_household_income",
"median_housing_value", "avg_household_size")
[/sourcecode]
Note: if you want to use this code you'll need to remove the x's in the api.key and replace it with your Infochimps API key.
I am going to pass a latitude, longitude and radius into the API and it will give me back data for just that geographic area. You can specify the geography in a number of ways (see API docs for more info)
Next we have to construct the URI to call the API, retrieve the data (which comes in as JSON) and convert the JSON object into an R object using RJSONIO.
[sourcecode language="r" light="true"]
uri <- paste(api.uri, acs.topline, api.key, "&g.radius=", radius, "&g.latitude=", lat, "&g.longitude=", long, sep="")
raw.data <- readLines(uri, warn="F")
results <- fromJSON(raw.data)
[/sourcecode]
Next, we need to do some manipulation on the retrieved data to get it into a form that's easier to deal with. I like using data frames a lot, so I'll turn it into a data frame by
[sourcecode language="r" light="true"]
ml <- lapply(results$results, function(x) x[columns])
mm <- matrix(unlist(ml), ncol=length(columns), byrow=TRUE)
md <- data.frame(mm)
colnames(md) <- columns
[/sourcecode]
You will now have a data frame in
blog comments powered by Disqus
md
that looks like
[sourcecode language="r" light="true"]
> head(md)
geography_name median_household_income median_housing_value
1 Altoona School District 48699 151800
2 Census Tract 5.01 54498 132100
3 Census Tract 5.02 60018 139300
4 Census Tract 8.02 66432 186700
5 Census Tract 8.01 65833 149900
6 Census Tract 7 33365 117500
>
[/sourcecode]
Unfortunately, the columns for median_household_income and median_housing_value are factors instead of numbers at this point. I don't know an automated way to change them to numeric, so you'll have to use
[sourcecode language="r" light="true"]
md$median_household_income = as.numeric(as.character(md$median_household_income))
md$median_housing_value = as.numeric(as.character(md$median_housing_value))
[/sourcecode]
by hand to turn them into numbers. You could put that into your script, but it would need to change each time you wanted to get different columns from the data set. If you have an idea of how to make it automatically turn what should be a numeric column from a factor into a numeric I'd be grateful.
Once you have that done you can now do your favorite analysis on the data like plotting a density graph, creating a linear model, etc.
[sourcecode language="r" light="true"]
qplot(median_household_income, data=md, geom="density")
model <- lm(median_housing_value ~ median_household_income, data=md)
[/sourcecode]
Have fun and play around with it. There are quite a few fields in the ACS topline data that you can explore. Here are the fields in the topline data:
"percent_black" "percent_0_to_9_yo" "percent_pacific" "percent_income_50_to_75k" "percent_income_100_to_200k" "percent_asian" "md5id" "avg_household_size" "geo_geometry_type" "percent_house_value_100_to_200k" "percent_race_hispanic" "intersects" "percent_carpool" "percent_housing_owned" "_type" "percent_income_25_to_50k" "percent_income_75_to_100k" "percent_18_to_24_yo" "geography_name" "census_logrecno" "percent_drive_alone" "percent_hs_graduate" "percent_public_trans" "percent_house_value_500_to_1000k" "percent_house_value_lt_50k" "percent_house_value_gtr_1000k" "percent_housing_rented" "fips_id" "percent_10_to_17_yo" "percent_house_value_50_to_100k" "percent_income_lt_25k" "total_pop" "percent_race_nonhispanic" "percent_work_at_home" "percent_female" "percent_65_over_yo" "percent_white" "median_household_income" "percent_mixed_race" "percent_native" "percent_ba_or_above" "percent_less_hs" "percent_50_to_64_yo" "inside" "percent_35_to_49_yo" "percent_25_to_34_yo" "median_housing_value" "percent_income_gtr_200k" "percent_some_college" "percent_other_trans" "percent_male" "percent_house_value_200_to_500k" "coordinates"I've created a Gist with all the code at https://gist.github.com/1208431. UPDATE: Thanks to Patrick Hausmann for creating a function that will all me to turn the entire results received from the API into a data frame without the ugly "columns" variable that I was using. Here is the new function he provided. I have updated the Gist referenced above with a new version of the code. [sourcecode language="r" light="TRUE"] ## Special thanks to Patrick Hausmann for the GetData function GetData <- function(x) { L <- vector(mode="list", length = x$total) a1 <- sapply(x$results, function(z) sapply(z, length) ) field.names <- names( which(apply(a1, 1, function(z) all(z == 1) )) ) a2 <- lapply(x$results, function(z) z[names(z) %in% field.names] ) for (i in seq_along(a2) ) { x1 <- a2[[i]] x2 <- data.frame(x1) L[[i]] <- x2 } x4 <- do.call(rbind, L) return(x4) } md <- GetData(results) str(md) [/sourcecode]
blog comments powered by Disqus
Published
10 September 2011