Recreation.gov Demographic Analysis

Overview
The Data
The Data Wrangling Process
Amazon AWS
Data Pipeline
Graph Class
The Statistical Analysis Process
EDA
Mesa Verde National Park
O'Haver Lake
Conclusion
Sources

Overview

Every year, recreation.gov records every reservation on their site. These records contain information about each campsite/tour and some information about each customer. Specifically, each customer's state and zipcode are included.

Goals

Improve the recreation.gov product through a demographic analysis of their customers

Identify census variables for site populations that are statistically significant from national average

Develop a pipeline that creates a database of census variables for a zip code

To attack this question, I looked at two campsites in Colorado, Mesa Verde National Park and O'Haver Lake. I took more than 1300 census variables and subdivided them into clusters.

Economics
- Industry
- Commute
- Income
- Health Insurance
Social
- Internet
- Language
- Education
- Veteran Status
Demographics
- Age
- Gender
- Race

For each cluster, I queried the census.gov api across all customer zip codes for each site. For each variable, I graphed my sample distribution against national average. For select variables, I conducted a hypothesis test to gain insight into the customer base for each sight.

The Data

I started with data in reservations.csv. The data contained more than 1 million rows and 57 features:

I worked to tie in census data and get the data in more digestible form for each census cluster.

The Data Wrangling Process

I started with a little more than 1 million rows of reservation data from https://ridb.recreation.gov/download. I used docker and sparkbook to gather the columns I wanted to use (Site name, location & Customer location) for sites in Colorado and customers in the United States. Next, I cleaned the zipcodes and found corresponding Lat/Long points for each zipcode which would allow me to query the census api.

I pushed this new dataset to Amazon S3 so that I could access the data on an Amazon EC2 instance and quickly query the api for each site and each census variable cluster. I opened a Amazon EC2 instance with much more memory and CPUs than my computer. I installed Docker and Sparkbook on the cloud and mapped my localhost:8881 to my instance's localhost:8881. With faster computation times, more census variables could be accessed across more zipcodes in a realistic time frame. I subdivided colorado sites into Mesa Verde National Park and O'Haver Lake, and collected data across census clusters. I pushed the new data to GitHub, pulled it into visualzation.py and graphed it using my Grapher class.

Amazon AWS

This project could not have happened without AWS. Early in my dataset creation process, I was paralized by slow computation when querying the census api for multiple sites. AWS came to the rescue, providing powerful tools for me to produce census data for multiple sites. I used the c4.8xlarge EC2 instance which provided me with 36 CPUs and 60 GiB of memory. On the instance, I installed Docker and Apache Spark which allowed me to maneuver my data.

AWS drastically sped up the dataset generation portion of my project and allowed more time for anlaysis. It also created a great workflow where I would be generating data on the cloud and visualizing/statistically analyzing on my local machine while syncing up both sides with GitHub.

Data Pipeline

I used a functional programming paradigm in conjunction for my data pipeline. I would sequentially add a census tract for each customer zip, add census data for a cluster, and finally export the data to csv and Amazon S3. The add census data for a cluster step was a bit layered. In this step I would call a "call api" script where for each census variable in the cluster, I would call another function that would actually make the call to the census api.

Each census cluster dataset was created using Apache Spark. This method was chosen due to a couple important characteristics of RDDs (Resilient Distributed Datasets)

RDDs can recover from errors

RDDs are immutable

RDDs are lazily evaluated

Large census api calls were made on each row. If an error was made in a single api call, it could have compromised my entire dataset. RDDs allowed me to fix errors before I ruined my whole dataset due to their immutable nature. I put together functions to call the api, but lazy evaluation allowed me to correct mistakes before they actually happened.

Graph Class

Once, the data creation process finished, I needed a systematic way to statistically analyze my sample population against national average.

The grapher class takes in a site's sample poulation for a cluster, the cluster name, and graph characteristics and can provide the user with a histogram of census variables, a boxplot of census variables, or a hypothesis test for a single census variable. In the class we tie in national averages for census variables to plot on histograms and national distributions for census variables to plot a hypothesis test. This class provides quick statistical analysis and I predict it will be reusable for more EDAs.

The Statistical Analysis Process

To begin, I would look at a boxplot characterizing the sample population for a given census cluster. Here is age for example:

This gives us some insight into the sample population with median, 25%, 75%, max, min, and outliers. However, it does not give us its significance vs. national average.

This graph is messy and tough to analyze for any given age. So we hypothesis test specific ages.

EDA

Customer Distance Traveled	Total Reservations

In deciding which campsites to further investigate, I considered a few factors. First, I wanted a campsite with a diverse clientele in terms of distance from the campsite. Clients from all over the country. Data generation also had to be kept in mind. Therefore, I also wanted a campsite with fewer reservations than normal, so I could collect a good amount of data to analyze. I settled on Mesa Verde National Park and O'Haver Lake.

Mesa Verde National Park

Mesa Verde had over 1,000 reservations and customers from all over the country

Overall, it was difficult to come to any conclusions based on the census data of Mesa Verde's clientele...

I did have some questions:

Did Mesa Verde's customers from places where people walked to work more often than national average?

What about public transportation? If people were using public transportation more than national average in these locations across the country, this could be a good advertising opportunity

Walk	Public transportation

O'Haver Lake

O'Haver Lake had around 200 and customers mainly from Colorado

Maybe we can make some conclusions...

Gender	Race

It appears that the people who are visiting O'Haver Lake are whiter and more male than the national average.
But, what we can do with this is limited.. Is there something about this campsite that attracts white men or is there something about the campsite that attracts Coloradans? Remember most of its visitors are from Colorado.

Conclusion

Overall, it is hard to come to many if any conclusions from the analysis of Mesa Verde National Park and O'Haver. Most of my sample distributions for different census variables were in line with national averages. The ones that were different have to be treated with scrutiny. It is very easily to conflate different variables and confuse causation and correlation in this type of analysis. Just look at O'Haver Lake.

However, I was able to acomplish my second goal. I developed a pipeline that created a new database taking data from reservations.csv and the census api. This type of pipeline will be very useful for other projects, especially a pipeline that draws from the census api. The pipeline could also be useful for recreation.gov with more specific data about their clientele, more specific inqueries, and a campsite in question. With my pipeline and a campsite name, you are only a couple clicks (or ENTERS) away from a wealth of demographic information

Sources

https://www.census.gov/data/developers/data-sets/acs-migration-flows.html

https://ridb.recreation.gov/download

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
maps		maps
mesa_viz		mesa_viz
notebooks		notebooks
ohaver_viz		ohaver_viz
presentation_images		presentation_images
proposal		proposal
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
readme.md		readme.md
visualizatons.py		visualizatons.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recreation.gov Demographic Analysis

Overview

The Data

The Data Wrangling Process

Amazon AWS

Data Pipeline

Graph Class

The Statistical Analysis Process

EDA

Mesa Verde National Park

O'Haver Lake

Conclusion

Sources

About

Releases

Packages

Languages

reiffd7/rec.gov_demographic_analysis

Folders and files

Latest commit

History

Repository files navigation

Recreation.gov Demographic Analysis

Overview

The Data

The Data Wrangling Process

Amazon AWS

Data Pipeline

Graph Class

The Statistical Analysis Process

EDA

Mesa Verde National Park

O'Haver Lake

Conclusion

Sources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages