Learning R with Albuquerque Open Data

19 Jan

I have posted several tutorials on using REST endpoints from the City of Albuquerque, but the city has many data sets available in CSV that have no REST endpoint. In this post, I am going to show you how to use RStudio to perform simple statistics and generate graphs from the data.

The first thing we need to do is get a copy of R – a free statistical computing environment. We also need RStudio. This will make working with R much easier.

Once you have everything installed, go the the Albuquerque Open Data page. Scroll to the bottom and select “Graded Employee Earnings.” Download the csv file. I renamed the file “pay.txt” just to make things easier later. This file contains all the graded city employee names, their base pay and ytd earnings. Should be an interesting data set to play with.

Now launch your newly installed RStudio.

The details presented below will be very minimal. It is my intent to show you how quickly you can accomplish basic statistical analysis in R, not to provide an in depth tutorial on R. There are many free tutorials, a long one at the r-project and a shorter one at the r-project.

Start by opening the data. The data is actually tab delimited, so set that as the sep value.

data <- read.delim(file=”pay.txt”,header=TRUE, sep=”\t”, quote=”\””)

You can see what is in it by typing data, but it is a long list so we will use the head function to get the first few rows with a header.

head(data)

If you only want a specific column you can type

data[“BASE.HOURLY.RATE”]

or multiple columns

data[c(“BASE.HOURLY.RATE”,”DEPARTMENT”)]

You can grab some summary statistics. I only want to know about the hourly wage so I will pass that column as shown earlier.

summary(data[“BASE.HOURLY.RATE”])

This should result in the image below

summary

Half of all graded employees make less then $17.48 an hour and there is someone making $59.37. There is not a single graded employee making less than $8.75.

Graphs help us visualize data, so let’s create a frequency graph for hourly rate. We need to know the range of the data – the summary told us (take the min and max), but we will use the range function.

range(data[“BASE.HOURLY.RATE”])

We need groups for our counts so we will create breaks using the range. I rounded down on the min and up on the max.

breaks=seq(8,60,by=2)

Now I want to grab the data without the header and cut it in to the breaks(groups).

mydata=data[“BASE.HOURLY.RATE”][,1]

mydata.cut=cut(mydata,breaks,right=FALSE)

Now combine the data and plot it.

cbind(mydata.freq)
plot(mydata.freq)

In the lower right corner of your RStudio window, you will see your chart. You can export it to several formats. It should look like the image below.

RplotFreq

Hourly wages start to fall after about $17 an hour but jump up again at around $27. It looks like most people make between $12 to $22. This is confirmed by our summary earlier (the average is $19.04).

You can easily generate a histogram

hist(mydata)

RplotHistogram

Or a box plot

boxplot(mydata,horizontal=TRUE)

RplotBoxplot

You can graph the density. This will further confirm our belief that the majority of employees make $12-$22.

den=density(mydata)
plot(den, main=”Density of Base Hourly Rate”)

RplotDensity

I can see a tail on the right side of all three graphs. Let’s check the skewness to confirm it. To do so, we need to install a package called moments, import it and then execute the function.

install.packages(“moments”)
library(moments)
skewness(mydata)

The result is 1.100316. A positive number indicates a skew to the right. I was correct.

The last thing I want to do is find the median hourly wage by department. I will grab the department name and hourly wage columns and then aggregate them.

subset <-data[c(“DEPARTMENT”,”BASE.HOURLY.RATE”)]
theMedian <-aggregate(x=subset[“BASE.HOURLY.RATE”],FUN=median,by=list(Group.Department=subset$DEPARTMENT))
theMedian

The results are shown in the image below.

agg

Council Services has the highest median hourly wage and Family and Community Services has the lowest.

R is a very powerful tool for statistical analysis and RStudio helps make it even easier. In two lines of code we had the Min,Median,Mean, Max, 1st and 3rd Quartiles. We also generated graphs with single lines of code and saved them out as images. If you need to perform data wrangling, R could be a great solution for you. I am going to make a point to try to learn more. Can never have enough tools in your belt.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: