forked from lmullen/dh-r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathshapefiles.Rmd
229 lines (168 loc) · 14.4 KB
/
shapefiles.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "Working with Shapefiles"
---
There are several formats in which geographic data (especially data for boundaries) might come, but the most common format in the [shapefile](http://en.wikipedia.org/wiki/Shapefile). The format of shapefiles is controlled by Esri, a corporation that makes ArcGIS. The format, however, is widely used. Shapefiles can contain coordinates in latitude and longitude, but they can also contain projected coordinates in some other coordinate reference system which translates the coordinates on a three-dimensional globe to coordinates on a two-dimensional representation. For example, the coordinates of New Haven, Connecticut, in latitude and longitude are `41.3100° N, 72.9236° W`. In the popular coordinate reference system used by Google Maps and most other web mapping services, called Google Mercator, those coordinates would be represented as `-8117860.8230976425 5058258.564581587` (notice that longitude comes first, then latitude, and that the unit of measurement is not degrees but meters). While R can handle converting between coordinate reference systems using the [rgdal](http://cran.rstudio.org/web/packages/rgdal/) package, it is often easier to handle the conversion outside of R. Furthermore, shapefiles can often be enormous---in the hundreds of megabytes---because they contain much more detail about borders or coastlines than will appear in a an actual map. Simplifying these shapefiles is often a task better done outside of R, though many of the tasks can be done inside R as well. This chapter will explain how to reproject and simplify shapefiles gathered from the NHGIS website for use with [ggplot2](http://cran.rstudio.org/web/packages/ggplot2/). This pattern should be generalizable for most shapefiles.
## Installing external programs
We will use two external programs to work with shapefiles. The first is `ogr*` family of programs in [GDAL/OGR](http://www.gdal.org/). The second is [mapshaper](https://github.com/mbloch/mapshaper), which depends on Node.js.
### Installing on Mac
To install these programs on a Mac, run the following commands (assuming you use Homebrew).
```
brew update
brew install gdal
brew install node
```
You can then install mapshaper globally by running the following commands.
```
npm update
npm install -g mapshaper
```
### Installing on Ubuntu
To install these programs on Ubuntu, run the following commands.
```
apt-get update
apt-get install gdal-bin libgdal-dev libproj-dev
apt-get install nodejs
```
Then run the following command to install mapshaper.
```
npm update
npm install -g mapshaper
```
## Getting information about a shapefile
You can download a shapefile from the [NHGIS](https://www.nhgis.org/). For this example, we will use the shapefile of U.S. counties in 1850.
When you unzip the directory which contains the shapefile, you will notice that it actually contains several files. The file that ends `.shp` is the shapefile itself, but the other files are important too. The file that ends `.prj` contains the projection information for the shapefile, and the file that ends `.dbf` contains the attributes associated with the polygons or points or lines in the shapefile. (R will let you read in this DBF file on its own if you just want that data.)
![A shapefile](screenshots/shapefile.png)
We can see what kind of information is available in the shapefile using the command `ogrinfo`with some flags that tell the command to give us all the information in the file. Assuming you are in your terminal in the same directory as the shapefile, you can run the following command.
```
ogrinfo -so -al US_county_1850.shp
```
This command will return a lot of information:
```
INFO: Open of `US_county_1850.shp'
using driver `ESRI Shapefile' successful.
Layer name: US_county_1850
Geometry: Polygon
Feature Count: 1632
Extent: (-2356113.743199, -1337508.077280) - (2258224.796357, 1565781.659379)
Layer SRS WKT:
PROJCS["USA_Contiguous_Albers_Equal_Area_Conic",
GEOGCS["GCS_North_American_1983",
DATUM["North_American_Datum_1983",
SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]],
PROJECTION["Albers_Conic_Equal_Area"],
PARAMETER["False_Easting",0.0],
PARAMETER["False_Northing",0.0],
PARAMETER["longitude_of_center",-96.0],
PARAMETER["Standard_Parallel_1",29.5],
PARAMETER["Standard_Parallel_2",45.5],
PARAMETER["latitude_of_center",37.5],
UNIT["Meter",1.0]]
DECADE: String (4.0)
NHGISNAM: String (50.0)
NHGISST: String (3.0)
NHGISCTY: String (4.0)
ICPSRST: String (3.0)
ICPSRCTY: String (4.0)
ICPSRNAM: String (35.0)
STATENAM: String (25.0)
ICPSRSTI: Integer (10.0)
ICPSRCTYI: Integer (10.0)
ICPSRFIP: Real (17.5)
STATE: String (3.0)
COUNTY: String (4.0)
PID: Real (19.8)
X_CENTROID: Real (19.8)
Y_CENTROID: Real (19.8)
GISJOIN: String (8.0)
GISJOIN2: String (7.0)
SHAPE_AREA: Real (19.11)
SHAPE_LEN: Real (19.11)
```
This output tells us that the file contains polygons instead of points or lines (`Geometry: Polygon`) and that there are 1,632 polygons in the file (`Feature Count: 1632`). We can also see that the file is projected in using an [Albers equal area projection](http://en.wikipedia.org/wiki/Albers_projection). That is a fine choice for a projection, but it will make working with the shapefile in R more difficult. The output also tells us the variables associated with each polygon, such as `STATE` and `COUNTY`. In this file the information is mostly identification of the polygons, but other shapefiles will contain actual data. Note that we could obtain the same information in R by using the [rgdal](http://cran.rstudio.org/web/packages/rgdal/) wrapper around the `ogrinfo` command. The `ogrInfo()` function takes two arguments: the path to the directory with the shapefile, and the name of the shapefile without the `.shp` extension.
```{r}
library(rgdal)
ogrInfo("data/county-shapefile/", "US_county_1850")
```
## Reprojecting a shapefile
Next we want to reproject that shapefile into something more useable in R. There are many coordinate reference systems (CRS) or [spatial reference systems](http://en.wikipedia.org/wiki/Spatial_reference_system) (SRS), which are formal ways of describing the mathematics of a projection. These CRSes are usually defined either by a [Well-Known Text](http://en.wikipedia.org/wiki/Well-known_text) (WKT) or by a [Proj4](http://en.wikipedia.org/wiki/PROJ.4) string. Notice that the output of the `ogrinfo` command line tool gave us the WKT representation:
```
Layer SRS WKT:
PROJCS["USA_Contiguous_Albers_Equal_Area_Conic",
GEOGCS["GCS_North_American_1983",
DATUM["North_American_Datum_1983",
SPHEROID["GRS_1980",6378137.0,298.257222101]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]],
PROJECTION["Albers_Conic_Equal_Area"],
PARAMETER["False_Easting",0.0],
PARAMETER["False_Northing",0.0],
PARAMETER["longitude_of_center",-96.0],
PARAMETER["Standard_Parallel_1",29.5],
PARAMETER["Standard_Parallel_2",45.5],
PARAMETER["latitude_of_center",37.5],
UNIT["Meter",1.0]]
```
But the `ogrInfo()` R function gave us the Proj4 string:
```
+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
```
These are essentially interchangable, though sometimes functions or programs will expect one rather than the other. There are formal systems provided by ESRI, EPSG, and other organizations which provide identifying numbers to these projections. The best guide to these coordinate reference systems is [Spatial Reference](http://spatialreference.org/), which will list the common names, authority names and numbers, well-known texts, Proj4 strings, and other modes of representing projections. Looking up the "[USA Contiguous Albers Equal Area Conic](http://spatialreference.org/ref/esri/usa-contiguous-albers-equal-area-conic/)" projection, we can find various information about the projection.^[The EPSG codes and their associated Proj4 strings are available as a data frame via the `make_EPSG()` function in rgdal.]
![Spatial reference](screenshots/spatial-reference.png)
We want to convert the shapefile to [EPSG 4326](http://spatialreference.org/ref/epsg/4326/), a system which is useful because it is unprojected. That is, is will represents data points in latitude and longitude. We can do this at the command line using the program `ogr2ogr`. We will specify the SRS that we want to convert to (`-t_srs`) and the name of input and output files. Notice that the output file (here called `US_county_1850_epsg4326.shp`) comes before the name of the input file.
```
ogr2ogr -t_srs EPSG:4326 US_county_1850_epsg4326.shp US_county_1850.shp
```
We could also use [rgdal](http://cran.rstudio.org/web/packages/rgdal/) to reproject the file inside R. First we would have to load the shapefile using the function `readOGR()`. We can find its Proj4 string with the function `proj4string()`.
```{r}
sp <- readOGR("data/county-shapefile/", "US_county_1850")
proj4string(sp)
```
We can define a new projection using the `CRS()` function, which takes as its argument a Proj4 string. This new projection can be used as an argument in the `spTransform()` function, which does the reprojection. From Spatial Reference we can find that the Proj4 string we want is `+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs`.
```{r}
sp_epsg4326 <- spTransform(sp, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"))
proj4string(sp_epsg4326)
```
If we wished, we could then save the reprojected shapefile to disk using `writeOGR()` function.
## Simplifying shapefiles
The real problem with using these shapefiles in R is their size. The 1850 counties shapefile is about 55 MB. While that is not extremely large in terms of disk space, it is very large in terms of the complexity of the shapes that it represents. The shapefile is a vector format, meaning it is a mathematical description of the curves involved rather than a pixel by pixel grid as in a photograph. That means that the file is very space efficient. Consider the needlessly complex, fractal-like level of detail in the coastline of the Chesapeake Bay as contained in this shapefile. (The square in the lower left is the District of Columbia.)
![The Chesapeake in an unsimplified shapefile](screenshots/chesapeake.png)
At almost any size map that we make, that level of detail will be invisible to the user. But it comes at a heavy cost in terms of processing. A plot of the shapefile using the base R function `plot()` will take several minutes on even a fast computer; it will probably never plot at all in ggplot2. It is possible to use various algorithms such as the Douglas-Peucker or Visvalingam algorithm (admirably [explained](http://bost.ocks.org/mike/simplify/) by Mike Bostock) to simplify those lines and remove unnecessary polygons, such as small islands.
The [MapShaper](http://www.mapshaper.org/) website will let you upload a shapefile (or GeoJSON or TopoJSON file) and perform such simplification interactively. Keeping only 2% of the information in the file, we can create a much simpler geometry without losing any information that will be visible the user. This will make working the shapefile in R much faster.
![Mapshaper](screenshots/mapshaper.png)
For simplifying shapefiles one at a time, the mapshaper website is often sufficient. But whenever you have more than one shapefile to simplify or wish to make your work reproducible, it is better to use the mapshaper command line tool. (See the chapter on [reproducible research](reproducible.html).) We will use that command on our shapefile that we reprojected to EPSG:4326. The command takes an input file, a percentage by which to simplify the file, and an output file. It is often good to use the `auto-snap` option, which aligns the points to a grid to make sure that polygons continue to overlap with one another. The `-force` flag tells mapshaper to overwrite the output file if it already exists.^[See the other options at the [mapshaper wiki](https://github.com/mbloch/mapshaper/wiki).]
```
mapshaper US_county_1850_epsg4326.shp auto-snap -simplify 2.5% -o force US_county_1850_simplified.shp
```
The resulting file is much smaller, about 1.8 megabytes instead of 52 megabytes. We can now plot it quite easily after loading in the new shapefile.
```{r plot-simplified-shapefile}
sp_simplified <- readOGR("data/county-shapefile", "US_county_1850_simplified")
plot(sp_simplified)
```
Especially when fortifying shapefiles to use them with ggplot2, as explained in the [chapter on mapping](mapping.html), a simplified shapefile make complain about missing holes or rings. This is usually a sign that you have simplified the shapefile too much and lost some essential information. You should resimplify the original shapefile keeping more of the information. This usually requires some trial and error.
It is possible to use the R wrappers around GDAL/OGR to simplify shapefiles, but I do not recommend that you use them. Mapshaper is a far superior implementation.
## A Makefile for reprojecting a simplifying shapefiles
The following Makefile will take a directory full of shapefiles, convert them to EPSG:4326 and simplify them to a user-specified percentage. This Makefile can be run in parallel to convert many shapefiles at the same time. (For an explanation of Make, see the chapter on [reproducible research](reproducible.html).) The Makefile assumes that within the project directory, the original shapefiles are stored in the `shp/` directory, and that the final shapefiles will be created in the `shp_out/` directory.
```
# A Makefile to re-project and simplify shapefiles.
SIMPLIFY_PERCENTAGE := 2.5
REPROJECTION_CODE := EPSG:4326
SIMPLIFIED := $(patsubst shp/%.shp, shp_out/%.shp, $(wildcard shp/*.shp))
REPROJECTED := $(patsubst shp/%.shp, temp/%.shp, $(wildcard shp/*.shp))
all : $(SIMPLIFIED)
.SECONDARY : $(REPROJECTED)
shp_out/%.shp : temp/%.shp
@echo "Simplifying $*"
mkdir -p shp_out
mapshaper $^ auto-snap -simplify $(SIMPLIFY_PERCENTAGE)% -o force $@
temp/%.shp : shp/%.shp
@echo "Reprojecting $*"
mkdir -p temp
ogr2ogr -t_srs $(REPROJECTION_CODE) $@ $^
clean :
rm -f temp/*
clobber : clean
rm -f shp_out/*
```
You can modify the options at the top of the file as necessary. Running the command `make all` will re-project and simplify all the shapefiles. You can run this task in parallel with the command `make all -j 8`, where the number is the number of parallel processes that you want. The command `make clobber` will remove all output files.