Analysis of NYC taxi trip data
This repository provides Python code to analyze New York City taxi pickups from 2009 - present (currently through June 2016). A detailed description of the data is available here, as well as the actual data in CSV format. The directory should be set up as follows:
- Top Directory (called 'Taxi Analysis' in the code)
- Top Directory\plots
- Top Directory\data
Make sure to download the 'nyc_neighborhoods.json' file to the top folder and the 'jfk_weather.csv' to the data subfolder. These contain the GeoJSON data for New York City neighborhoods (polygon boundaries) and daily weather data for JFK airport over the timeframe of the taxi pickup data. The Python scripts should be run in the following order (make sure to update the top directory location path in each):
- data_download.py (Downloads the trip data and saves to compressed gzip format files in the data directory. Requires ~40GB of disk space.)
- find_nbhd_centroids_boundaries.py (Finds and saves the neighborhood centroids and borders from the NYC neighborhoods JSON file.)
- find_pickup_dropoff_nbhds.py (Finds the neighborhood and borough of each pickup and dropoff location in the full dataset by reverse geocoding the longitude/latitude coordinate pairs using the GeoJSON NYC neighborhoods file.)
- create_summary_data.py (Creates summarized datasets to use for descriptive plots and modeling.)
- nyc_taxi_analysis.py (Creates plots, data summaries, and fits a predictive model for pickup frequencies.)
The next plot shows the percent change in total annual number of pickups for 2015 compared to 2010, by neighborhood.
percent_change | nbhd | borough |
---|---|---|
16.2 | Coney Island | Brooklyn |
14.6 | Marble Hill | Manhattan |
13.9 | Pelham Bay | Bronx |
12.2 | Norwood | Bronx |
11.3 | Bedford-Stuyvesant | Brooklyn |
10.8 | Melrose | Bronx |
10.7 | Kew Gardens | Queens |
9.8 | Belmont | Bronx |
9.6 | Crown Heights | Brooklyn |
9.2 | Jackson Heights | Queens |
Shown next are the relative proportions of late night pickups by neighborhood (late night defined as the time frame from 10pm to 3am), with the top ten shown below.
relative_percent | nbhd | borough |
---|---|---|
52.9 | Bushwick | Brooklyn |
52.6 | Williamsburg | Brooklyn |
45.0 | Lower East Side | Manhattan |
44.6 | Greenpoint | Brooklyn |
41.5 | Nolita | Manhattan |
41.3 | Park Slope | Brooklyn |
41.3 | South Slope | Brooklyn |
41.0 | Prospect Heights | Brooklyn |
39.3 | Ridgewood | Queens |
37.3 | Crown Heights | Brooklyn |
It would be of interest to taxi dispatchers to be able to predict the frequency of pickups by location. The following shows the true frequency of taxi pickups (by latitude / longitude pair rounded to three digits), by hour from January 2009 - June 2016. It's quite beautiful, and you can clearly see the ebb and flow of airport pickups (particularly JFK).
Instead of attempting to predict pickups at each latitude / longitude location, pickup totals are aggregated by neighborhood and the goal is to predict neighborhood pickup totals at a particular hour and date. Data from 2014 is used to predict the number of pickups at every neighborhood for each hour and date in 2015. As a baseline measure, the most naive guess is to use the actual pickups in 2014 to predict the number of pickups in 2015 (by date, hour, and neighborhood). This provides an R^2 value of 91.0 and mean squared error of 32,438.5.
To compare, a random forest (RF) model is built to make predictions. This has the advantage over the naive approach as it can incorporate covariates and compared to other predictive models, can accomodate non-linear relationships between the predictors and outcome (pickup totals). For covariates, daily weather data from JFK airport is used, as well as hour, day of week, month, holiday indication, and neighborhood. A subset of the 2014 data is used to tune the RF parameters via a random search, and predicting on out of sample data from 2015. The tuning parameters which give the best performance are then used to fit a RF to the full 2014 data, to predict total pickups for each hour, neighborhood, and date in 2015. This gives an R^2 value of 95.0 and mean squared error of 14,250.8 (less than 50% of the naive approach). The actual versus predicted total daily pickups by neighborhood for a random date (May 7th, 2015) is shown below.
- Additional covariates can be added to improve the error of the model. A few that come to mind are population density (by neighborhood), more granular weather data (hourly), and some sort of seasonality measure.
- Other supervised learning models may provide better performance than RF. It may be worthwhile to do a comparison.
- Other evaluation metrics can be used, such as Mean Squared Logarithmic Error. However, this increases the penalty on deviations in neighborhoods with small pickup totals, which may not be an appropriate in this situation. For example, a prediction of 2 pickups for a neighborhood with 1 actual pickup gives approximately the same penalty as predicting 1500 pickups for a neighborhood with 1000 actual pickups.