Visualizing New York City Traffic Data

As part of a submission to Bracket’s Latest Issue, my research partners and I decided to supplement our paper by visualizing traffic patterns in a dense urban environment. I decided this also provided a great opportunity to begin learning Python. Coming from a design background my programming knowledge is quite limited and therefore I was very much looking forward to the challenge of learning a modern language like Python.

Our specific goal was to map how traffic volume changes over a period of 24 hours in dense urban centers. We began the task by seeking a good data source. Surprisingly, hourly traffic data for major cities was fairly difficult to come by. After scouring various open data resources we eventually found the website for New York State’s Department of Transportation. Their massive archives contained the kind of unadulterated raw data we were seeking. This provided us with enough source material to achieve a fine-grained analysis.

Naturally, the downside in dealing with raw data is that it comes in massive quantities, formatted in ways that are not necessarily easy to work with. The total dataset was about 500MB encoded as CSV files. Based on how they provided the data we were able to manually obtain a subset of about 60000 entries to work with. Clearly a programmatic approach would be needed to parse our subset into a suitable format for visualization.

For several months I had been dabbling in the Processing programming language in order to create visualizations. Although I intended to use Processing once again to produce the final graphics for this project, it wasn’t the ideal candidate to parse the raw data into a usable form. Having wanted to learn Python for some time, I decided it was more appropriate for this part of the project.

My first step was to determine what data I actually needed for the final visualization. The primary element was obviously the actual hourly traffic counts provided in the Department of Transportation dataset, but I also needed to associate those with some spatial data in order to construct a graphical map. Our source data only had pieces of the necessary information which I would have to stitch together and cross reference with other sources using Python.

The raw traffic count data was composed of about 25000 lines that looked like this:

040004,12/13/2009 02:00,3,0,4109

The first number is a unique ID that is assigned to each road in New York State, what follows is the date and time of each entry and the the final number is the actual traffic count for each lane:

New York State Traffic Count Data

At first glance the data seems pretty limited. This is because the ID number represents a particular road which can be determined by cross referencing the ID with a master header file. The header file contains about 36000 entries that look like this:

"710046","NY"," 3",,"CORNELIA ST","244.87","PLATTSBURGH W CITY LN",
"BROAD ST.","CLINTON","PLATTSBURGH","CITY","30","14"

The breakdown is as follows:

New York State Traffic Data

This process was pretty simple even for a python noob like myself. Using Python’s built in CSV library I looped through the traffic count file, and each time I encountered a unique ID I would loop through the header file to find the corresponding line. I saved the cross referenced data to a new CSV in the following format:

041003,HUDSON ST,CHRISTOPHER ST,BETHUNE ST

This provided me with a much reduced dataset that included only the headers that were relevant to the area I was mapping. I saved the ID number, road name and the starting and ending intersections. I would use these later to determine the precise location of the traffic data.

After creating the reduced header file I took a closer look at the actual traffic count data. Originally, I had naively assumed that the data for each street would be recorded at consistent dates and times. In reality it was much more of a hodge-podge; various streets were monitored at differing time periods throughout the year, some had complete records, others were spotty. This presented a problem for us because obviously we wanted to map a comparable data set. Since this was still the best data source we could find, we settled on a compromise. We decided to only take complete records of a continuous 24hr period and where possible to ensure that the data was recorded on a Wednesday. In the cases where Wednesday was not available we would take records from the next closest day. Although this wasn’t the ideal situation for the project it did make for an interesting programming challenge.

I found this task more difficult than the first, but still manageable. I started by looking closely at Python’s datetime library. It took some tinkering and a lot of trial and error but eventually I was able to format the dates correctly and return an integer based on the day of the week using the strftime() and strptime() methods. I was then able to use this information to rank each set of entries based on their proximity to Wednesday. Checking if the set of records was complete for a given date, was just a matter of ensuring that it had the correct number of entries. I outputted this information to a consolidated traffic count file.

At this point, the Department of Transportation data was in manageable form, though crucially I was still missing location data. The only information I had to work with were the road name and start and end intersections which I had previously saved in the consolidated header file. This part definitely provided the most exciting challenge. My goal was to get the latitudes and longitudes for these intersections in order to construct a map. I was aware of the existence of several geolocation services, but I was uncertain as to how robust they would be in terms of parsing my search queries. I also had never written any code that interfaced with external APIs or web services.

The first service I tried was geocoder.us. Working with their service was pretty straight forward. I simply had to craft a URL with my search query and then parse the information that geocoder.us returned. Once again, Python’s standard libraries made this process fairly simple. I was able to craft the URL by using my compiled header file with some string concatenation and replace. I then passed it to urllib which would return the data from the geocoder service. Geocoder.us allows you to specify the return format in a number of different ways, I chose CSV since I was already pretty familiar with it at this point. The downside to their service was that its interpretation of my intersection names was not very robust and the service only allows one query every 15 seconds. Ultimately I was only able to retrieve latitude and longitudes for a small part of my data set using this service.

Since geocoder.us was not up to the task, I set out in search of a new geocoding service. Eventually, I found Yahoo! GeoPlanet which turned out to be fantastic. Its search was much more robust and didn’t have a rate limit. It did have a couple added complications in that I had to have an API key and the data would always be returned in XML format. Getting the API key was a painless process, I just had to go through a fairly standard online signup process then append the key to my queries and working with the XML data wasn’t too much more difficult than working with the CSVs after I found the appropriate libraries. Ultimately, I was able to obtain good latitude and longitude information for the majority of my dataset using Yahoo’s service.

Now that I had all the necessary information to complete the visualization I created a nice clean CSV file that associated the traffic counts with the appropriate latitude and longitude. I was able able to easily read this into Processing in order to generate the necessary graphics. Some of the raw output from Processing can be seen below:

New York City Traffic Volume

Even without placing the visualization over an existing street map you can make out the rough outline of lower manhattan fairly easily. We were later able to use this material to create animated and interactive versions of the data as well as visualize it in 3D.

For me, this project was tremendously rewarding. Most of the work I did was quite basic, but it was very empowering to be able to retrieve and integrate large amounts of data from the web in useful ways; it’s definitely a skill I’m looking forward to implementing in other projects in the future. I won’t embarrass myself by posting the mess of spaghetti code I wrote to get this all done; certainly there were more elegant and concise approaches to the problem. Despite this I was able to achieve all the things I had hoped to do and managed to learn a lot of basic Python features in the process.

Related Posts:

5 years, 141 days ago   1 Comment

Comments


  1. saeran says:

    jesus, I can’t believe we made you do all that.