nyc_taxi

Analysis of NYC taxi trip data

This repository provides Python code to analyze New York City taxi pickups from 2009 - present (currently through June 2016). A detailed description of the data is available here, as well as the actual data in CSV format. The directory should be set up as follows:

Top Directory (called 'Taxi Analysis' in the code)
- Top Directory\plots
- Top Directory\data

Make sure to download the 'nyc_neighborhoods.json' file to the top folder and the 'jfk_weather.csv' to the data subfolder. These contain the GeoJSON data for New York City neighborhoods (polygon boundaries) and daily weather data for JFK airport over the timeframe of the taxi pickup data. The Python scripts should be run in the following order (make sure to update the top directory location path in each):

data_download.py (Downloads the trip data and saves to compressed gzip format files in the data directory. Requires ~40GB of disk space.)
find_nbhd_centroids_boundaries.py (Finds and saves the neighborhood centroids and borders from the NYC neighborhoods JSON file.)
find_pickup_dropoff_nbhds.py (Finds the neighborhood and borough of each pickup and dropoff location in the full dataset by reverse geocoding the longitude/latitude coordinate pairs using the GeoJSON NYC neighborhoods file.)
create_summary_data.py (Creates summarized datasets to use for descriptive plots and modeling.)
nyc_taxi_analysis.py (Creates plots, data summaries, and fits a predictive model for pickup frequencies.)

Plots

The next plot shows the percent change in total annual number of pickups for 2015 compared to 2010, by neighborhood.

percent_change	nbhd	borough
16.2	Coney Island	Brooklyn
14.6	Marble Hill	Manhattan
13.9	Pelham Bay	Bronx
12.2	Norwood	Bronx
11.3	Bedford-Stuyvesant	Brooklyn
10.8	Melrose	Bronx
10.7	Kew Gardens	Queens
9.8	Belmont	Bronx
9.6	Crown Heights	Brooklyn
9.2	Jackson Heights	Queens

Shown next are the relative proportions of late night pickups by neighborhood (late night defined as the time frame from 10pm to 3am), with the top ten shown below.

relative_percent	nbhd	borough
52.9	Bushwick	Brooklyn
52.6	Williamsburg	Brooklyn
45.0	Lower East Side	Manhattan
44.6	Greenpoint	Brooklyn
41.5	Nolita	Manhattan
41.3	Park Slope	Brooklyn
41.3	South Slope	Brooklyn
41.0	Prospect Heights	Brooklyn
39.3	Ridgewood	Queens
37.3	Crown Heights	Brooklyn

Predicting Taxi Pickups Frequencies by Neighborhood Using Random Forests

It would be of interest to taxi dispatchers to be able to predict the frequency of pickups by location. The following shows the true frequency of taxi pickups (by latitude / longitude pair rounded to three digits), by hour from January 2009 - June 2016. It's quite beautiful, and you can clearly see the ebb and flow of airport pickups (particularly JFK).

Instead of attempting to predict pickups at each latitude / longitude location, pickup totals are aggregated by neighborhood and the goal is to predict neighborhood pickup totals at a particular hour and date. Data from 2014 is used to predict the number of pickups at every neighborhood for each hour and date in 2015. As a baseline measure, the most naive guess is to use the actual pickups in 2014 to predict the number of pickups in 2015 (by date, hour, and neighborhood). This provides an R^2 value of 91.0 and mean squared error of 32,438.5.

To compare, a random forest (RF) model is built to make predictions. This has the advantage over the naive approach as it can incorporate covariates and compared to other predictive models, can accomodate non-linear relationships between the predictors and outcome (pickup totals). For covariates, daily weather data from JFK airport is used, as well as hour, day of week, month, holiday indication, and neighborhood. A subset of the 2014 data is used to tune the RF parameters via a random search, and predicting on out of sample data from 2015. The tuning parameters which give the best performance are then used to fit a RF to the full 2014 data, to predict total pickups for each hour, neighborhood, and date in 2015. This gives an R^2 value of 95.0 and mean squared error of 14,250.8 (less than 50% of the naive approach). The actual versus predicted total daily pickups by neighborhood for a random date (May 7th, 2015) is shown below.

Improvements / Thoughts

Additional covariates can be added to improve the error of the model. A few that come to mind are population density (by neighborhood), more granular weather data (hourly), and some sort of seasonality measure.
Other supervised learning models may provide better performance than RF. It may be worthwhile to do a comparison.
Other evaluation metrics can be used, such as Mean Squared Logarithmic Error. However, this increases the penalty on deviations in neighborhoods with small pickup totals, which may not be an appropriate in this situation. For example, a prediction of 2 pickups for a neighborhood with 1 actual pickup gives approximately the same penalty as predicting 1500 pickups for a neighborhood with 1000 actual pickups.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nyc_taxi

Plots

Predicting Taxi Pickups Frequencies by Neighborhood Using Random Forests

Improvements / Thoughts

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
plots		plots
README.md		README.md
create_summary_data.py		create_summary_data.py
data_download.py		data_download.py
find_nbhd_centroids_boundaries.py		find_nbhd_centroids_boundaries.py
find_pickup_dropoff_nbhds.py		find_pickup_dropoff_nbhds.py
nyc_neighborhoods.json		nyc_neighborhoods.json
nyc_taxi_analysis.py		nyc_taxi_analysis.py

brianbader/nyc_taxi

Folders and files

Latest commit

History

Repository files navigation

nyc_taxi

Plots

Predicting Taxi Pickups Frequencies by Neighborhood Using Random Forests

Improvements / Thoughts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages