-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
349 lines (288 loc) · 13 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
TIGER Geocoder
2004/10/28
A plpgsql based geocoder written for TIGER census data.
Design:
There are two components to the geocoder, the address normalizer and the
address geocoder. These two components are described separately below.
The goal of this project is to build a fully functional geocoder that can
process an arbitrary address string and, using normalized TIGER censes data,
produce a point geometry reflecting the location of the given address.
- The geocoder should be simple for anyone familiar with PostGIS to install
and use.
- It should be robust enough to function properly despite formatting and
spelling errors.
- It should be extensible enough to be used with future data updates, or
alternate data sources with a minimum of coding changes.
Installation:
Refer to the INSTALL file for installation instructions.
Usage:
refcursor geocode(refcursor, 'address string');
Notes:
- The assumed format for the address is the US Postal Service standard:
() indicates a field required by the geocoder, [] indicates an optional field.
(address) [dirPrefix] (streetName) [streetType] [dirSuffix]
[internalAddress] [location] [state] [zipCode]
Address Normalizer:
The goal of the address normalizer is to provide a robust function to break a
given address string down into the components of an address. While the
normalizer is built specifically for the normalized US TIGER Census data, it
has been designed to be reasonably extensible to other data sets and localities.
Usage:
normalize_address('address string');
Support functions:
location_extract_countysub_exact('partial address string', 'state abbreviation')
location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
location_extract_place_exact('partial address string', 'state abbreviation')
location_extract_place_fuzzy('partial address string', 'state abbreviation')
cull_null('string')
count_words('string')
get_last_words('string')
state_extract('partial address string')
levenshtein_ignore_case('string', 'string')
Notes:
- A set of lookup tables, listed below, is used to provide street type,
secondary unit and direction abbreviation standards for a given set
of data. These are provided with the geocoder, but will need to be
customized for the data used.
direction_lookup
secondary_unit_lookup
street_type_lookup
- Additional lookup tables are required to perform matching for state
and location extraction. The state lookup is derived from the
US Postal Service standards, while the place and county subdivision
lookups are generated from the dataset. The creation statements for
the place and countysub tables are given in the INSTALL file.
state_lookup
place_lookup
countysub_lookup
- The use of lookup tables is intended to provide a versatile way of applying
the normalizer to data sets and localities other than the US Census TIGER
data. However, due to the need for matching based extraction in the event
of poorly formatted or incomplete address strings, assumptions are made about
the data available. Most notably the division of place and county
subdivision. For data sets without exactly two logical divisions in location
precision, code changes will be required.
- The normalizer will perform better the more information is provided.
- The process for normalization is roughly as follows:
Extract the address from the beginning.
Extract the zipCode from the end.
Extract the state, using a fuzzy search if exact matching fails.
Attempt to extract the location by parsing the punctuation
of the address.
Find and remove any internal address.
If internal address was found:
Set location as everything between internal address and state.
Extract the street type from the string.
If multiple potential street types are found:
If internal address was found:
Extract the last street type that preceeds the internal address.
Else:
Extract the last street type.
If street type was found:
If a word beginning with a number follows the street type.
This indicates the street type is part of the street name,
eg. 'State Hwy 92a'.
Set street type to NULL.
Else if location not yet found:
Set location as everything between street type and state.
Extract direction prefix from start of street name.
If internal address was found:
Extract direction suffix from end of street name.
Else:
Extract direction suffix from start of location.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
If internal address was found:
Extract direction prefix from beginning of string.
Extract direction suffix before internal address.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
Extract direction suffix.
If direction suffix is found:
Set location as everything between direction suffix and state,
zip or end of string as appropriate.
Extract direction prefix from beginning of string.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Else:
Attempt to determine the location via exact comparison against
the places lookup.
Attempt to determine the location via exact comparison against
the countysub lookup.
Attempt to determine the location via fuzzy comparison against
the places lookup.
Attempt to determine the location via fuzzy comparison against
the countysub lookup.
Extract direction prefix.
Set street name as everything that is not the address, direction
prefix or suffix, internal address, location, state or
zip code.
Address Geocoder:
The goal of the address geocoder is to provide a robust means of searching
the database for a match to whatever data the user provides. To accomplish
this, the coder uses a series of checks and fallthrough cases. Starting with
the most specific combination of parameters, the algorithm works outwards
towards the most vague combination, until valid results are found. The result
of this is that the more accurate information that is provided, the faster the
algorithm will return.
Usage:
normalize_address('address string');
Support functions:
geocode_address(cursor, address, 'dirPrefix', 'streetName', 'streetType',
'dirSuffix', 'location', 'state', zipCode)
geocode_address_zip(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', zipCode)
geocode_address_countysub_exact(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_countysub_fuzzy(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_place_exact(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
geocode_address_place_fuzzy(cursor, address, 'dirPrefix', 'streetName',
'streetType', 'dirSuffix', 'location', 'state')
rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB')
rate_attributes('dirPrefixA', 'dirPrefixB', 'streetNameA', 'streetNameB',
'streetTypeA', 'streetTypeB', 'dirSuffixA', 'dirSuffixB',
'locationA', 'locationB')
location_extract_countysub_exact('partial address string', 'state abbreviation')
location_extract_countysub_fuzzy('partial address string', 'state abbreviation')
location_extract_place_exact('partial address string', 'state abbreviation')
location_extract_place_fuzzy('partial address string', 'state abbreviation')
cull_null('string')
count_words('string')
get_last_words('string')
state_extract('partial address string')
levenshtein_ignore_case('string', 'string')
interpolate_from_address(given address, from address L, to address L,
from address R, to address R, street segment)
interpolate_from_address(given address, 'from address L', 'to address L',
'from address R', 'to address R', street segment)
includes_address(given address, from address L, to address L,
from address R, to address R)
includes_address(given address, 'from address L', 'to address L',
'from address R', 'to address R')
Notes:
- The geocoder is quite dependent on the address normalizer. The direction
prefix and suffix, streetType and state are all expected to be standard
abbreviations that will match exactly to the database.
- Either a zip code, or a location must be provided. No exception will be
thrown, but the result will be null. If the zip code or location cannot
be matched, with the other information provided, against the database
the result is null.
- The process is as follows:
If a zipCode is provided:
Check if the zipCode, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location exactly matches a place:
Check if the place, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location exactly matches a countySubdivision:
Check if the countySubdivision, streetName and optionally state
match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location approximately matches a place:
Check if the place, streetName and optionally state match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
If location approximately matches a countySubdivision:
Check if the countySubdivision, streetName and optionally state
match any roads.
If they do:
Check if the given address fits any of the roads.
If it does:
Return the matching road segment information, rating and
interpolated geographic point.
Current Issues / Known Failures:
- If a location starts with a direction, eg. East Seattle, and no suffix
direction is given, the direction from the location will be interpreted
as the streets suffix direction.
'18196 68th Ave East Seattle Washington'
address = 18196
dirPrefix = NULL
streetName = '68th'
streetType = 'Ave'
dirSuffix = 'E'
location = 'Seattle'
state = 'WA'
zip = NULL
- The last possible street type in the string is interpreted as the street type
to allow street names to contain type words. As a result, any location
containing a street type will have the type interpreted as the street type.
'29645 7th Street SW Federal Way 98023'
address = 29645
dirPrefix = NULL
streetName = 7th Street SW Federal
streetType = Way
dirSuffix = NULL
location = NULL
state = NULL
zip = 98023
- While some state misspellings will be picked up by the fuzzy searches,
misspelled or non-standard abbreviations may not be picked up, due to
the length (soundex uses an intial character plus three codeable
characters)
'2554 E Highland Dr Seatel Wash'
address = 2554
dirPrefix = 'E'
streetName = 'Highland'
streetType = 'Dr'
dirSuffix = NULL
location = 'Seatel Wash'
state = NULL
zip = NULL
- If neither a location or a zip code are found by the normalizer, no search
is performed.
- If neither street type, direction suffix nor location are given in the
address string, the street name is generally misclassified as the
location.
'98 E Main Washington 98012'
address = 98
dirPrefix = 'E'
streetName = NULL
streetType = NULL
dirSuffix = NULL
location = 'Main'
state = 'WA'
zip = 98012
- If no street type is given and the street name contains a type word, then the
type in the street name is interpreted as the street type.
'1348 SW Orchard Seattle wa 98106'
1348::SW:Orch::Seattle:WA:98106
address = 1348
dirPrefix = NULL
streetName = SW
streetType = Orch
dirSuffix = NULL
location = Seattle
state = WA
zip = 98106
- Misspellings of words are only handled so far as their soundex values match.
'Hiland' will not be matched with 'Highland'
soundex('Hiland') = 'H453'
soundex('Highland') = 'H245'
- Missing words in location or street name are not handled.
'Redmond Fall' will not be matched with 'Redmond Fall City'
- Unacceptable failure cases:
The street name is parsed out as 'West Central Park'
'500 South West Central Park Ave Chicago Illinois 60624'