Tag Archives: Pandas

ESRI REST and Python: pandas.DataFrame

13 Jan

I have been working in JavaScript a lot recently and have missed coding in python. I wanted to work on something that would have real world value so I decided to see if I could load an ESRI REST endpoint in to a pandas DataFrame. This post will only scratch the surface of what you can do with data in python but should give you an idea of what is possible and allow you to imagine some really interesting possibilities. I have posted all of the code as a static IPython notebook and on Python Fiddle for easy copy+paste.

Getting Data

We are going to start by grabbing some open data from the City of Albuquerque – I will use the Crime Incidents REST service. The first step is to make the request and read it.

import urllib, urllib2
param = {‘where’:’1=1′,’outFields’:’*’,’f’:’json’}
url = ‘http://coagisweb.cabq.gov/…/APD_Incidents/MapServer/0/query ?’ + urllib.urlencode(param)
rawreply = urllib2.urlopen(url).read()

If you print rawreply, you will see a long string. We know it is a string because if you print rawreply[0] you will see { as the result. To confirm it, you can type(rawreply) and get back str.

Working with the Data

Now we need to convert it to JSON so that we can do something with the data.

import simplejson
reply = simplejson.loads(rawreply)
print reply[“features”][0][“attributes”][“date”]
print reply[“features”][0][“attributes”][“CVINC_TYPE”]

The above code will return something like

1405468800000
LARCENY ALL OTHER

How many features do we have? 24,440.

print len(reply[“features”])
reply[“features”][:3]

The above code will give you the count and show you the first three results. You can grab subsets of the data by splicing. If you want records 2 through 4, you can reply[“features”][2:5].

Let’s get to the good stuff.

Pandas DataFrame

JSON is great, but let’s put the data in to a table (DataFrame). First, a disclaimer. ESRI REST services return a deeply nested JSON object. Converting this to a DataFrame is more difficult than:

import pandas.io.json as j
jsondata = j.read_json(the URL)

But it is not much more difficult. We just need to recreate the JSON object grabbing what we are interested in – the attributes of the features. We need a loop that will  create an array of dictionaries. Each dictionary will be the attributes of a single feature.

count = 0
myformateddata=[]
while (count < len(reply[“features”])):
mydict={}
for key, value in reply[“features”][count][“attributes”].iteritems():
mydict[key]= value
myformateddata.append(mydict)
count = count + 1

The code above initializes a counter, array and a dictionary. It then loops until there are no more features. The loop reads the attributes of the feature and creates a dictionary for each attribute and its value. Now we have an array of dictionaries. If you len(myFrame) you will get 24,440. That is the same number of records we had in our JSON object.

We have all our data and it is an array. We can now create a DataFrame from it.

from pandas import DataFrame as df
myFrame = df(data=myformateddata)
myFrame

The code above imports the tools we need, assigns the frame to a variable and then prints it. You should see a table like the image below.

DataFrame with Crime Incidents from ESRI REST endpoint

DataFrame with Crime Incidents from ESRI REST endpoint

Data Manipulation

So what can we do with a DataFrame? First, let’s save it. We can export the DataFrame to Excel (The City of Albuquerque provides open data in several formats, but other sources may only provide you with the REST API so this would be a valuable method for converting the data).

from pandas import ExcelWriter
writer = ExcelWriter(‘CrimeIncidents.xlsx’)
myFrame.to_excel(writer,’Sheet1′,index=False)
writer.save()

Great! the data has been exported. Where did it go? The code below will return the directory with the file.

import os
os.getcwd()

You can open the Excel file and work with the data or send it to someone else. But let’s continue with some basic manipulation in the DataFrame.

How many of each type of incident do we have in our data?

myFrame[“CVINC_TYPE”].value_counts()

Incidents Summarized

Incidents Summarized

Now I would like a barchart of the top 5 most frequent Incident Types.

import matplotlib.pyplot as plt
incidentType=myFrame[“CVINC_TYPE”].value_counts()[:5]
incidentType.plot(kind=’barh’,rot=0)

My Bar Chart of the 5 most common Incidents

My Bar Chart of the 5 most common Incidents

Lastly, let’s filter our results by date.

import datetime
import time
myDate=datetime.datetime(2015,1,10)
milli=time.mktime(myDate.timetuple()) * 1000+ myDate.microsecond / 1000

The above code creates a date of January 10, 2015. It then converts it to milliseconds – 1420873200000. If we pass the date to the DataFrame we can grab all the Incidents after January 10, 2015.

myFrame[(myFrame.date>milli)]

DataFrame for Incidents After January 10, 2015

DataFrame for Incidents After January 10, 2015

Now you know how to connect to an ESRI REST endpoint, grab the data, convert it to JSON and put it in a DataFrame. Once you have it in the DataFrame, you can now display a table, export the data to Excel, plot a barchart and filter the data on any field.

IPython Notebook

27 Dec
Image

My IPython Notebook

I love IPython Notebook. It is an interactive web environment for executing code. You can also save and share your notebook. Don’t have IPython? you can view notebooks on the web using the website http://nbviewer.ipython.org. I have a notebook for my Pandas Bar Chart Example. On the viewer site, enter the URL:

http://educationalfacilityplanning.com/MySamplePandasBarChart.ipynb

Or, click http://nbviewer.ipython.org/url/educationalfacilityplanning.com/MySamplePandasBarChart.ipynb

Pandas and Python…..Oh My.

21 Dec

Crunching data and rearranging data in Python is cool, but I really need to visualize it. Nothing fancy, just a bar or line chart. I recently saw a D3 implementation in Python – awesome, but for now I just want to stick to Matplotlib. I grabbed a few books on scientific python and data in python. They seem to love IPython – the web notebook is pretty cool. Another tool that is often mention is Pandas. This is what I want to use – it uses Matplotlib for plotting. The one feature that caught my attention right away was the DataFrame. Think of it as a Excel spreadsheet, then if anyone asks about it, tell them it’s like data.frame() in R.  When I learn something, I like to start bare bones, then build it up with extra options and variations. I have put together some very minimal examples of plotting DataFrames and a Series in Pandas. From here you should have a good grasp of how to do more.

Plotting a Series.

Plotting a Series.

Plotting a Series requires a Series and a type of chart. Here is my code:

from pandas import Series
import matplotlib.pyplot as plt

b=[2,4,6,8,10]
a=Series(b,index=[‘a’,’b’,’c’,’d’,’e’])
Series.plot(a, kind=’bar’) #change to ‘barh’ for horizontal. Can also declare kind=’line’
plt.show()

Plotting a DataFrame

Plotting a DataFrame

 

Plotting a DataFrame is what I need the most in my work. Earlier I compared a DataFrame to an Excel spreadsheet. Here is what a DataFrame looks like:

Simple DataFrame

Simple DataFrame

Looking at the DataFrame and the Chart,  notice that each row plots as a group labeled by the index and columns. The DataFrame is created by passing a Numpy Array.:

a=np.array([[3,6,8,9,6],[2,3,4,5,6],[4,5,6,7,8],[3,6,5,8,6],[5,8,8,6,5]])
df=DataFrame(a, columns=[‘a’,’b’,’c’,’d’,’e’], index=[2,4,6,8,10])

To plot the chart, just call plot and pass a type. Here is the complete code:

from pandas import DataFrame
import matplotlib.pyplot as plt
import numpy as np

a=np.array([[3,6,8,9,6],[2,3,4,5,6],[4,5,6,7,8],[3,6,5,8,6],[5,8,8,6,5]])
df=DataFrame(a, columns=[‘a’,’b’,’c’,’d’,’e’], index=[2,4,6,8,10])

df.plot(kind=’bar’)
plt.show()

This is how I learned to use Pandas DataFrame and to plot my data. Knowing this, I felt much more comfortable looking at more advanced examples online.