Loading OpenStreetMap amenity data into R
extract-amenities is a script for extracting amenities from an OpenStreetMap data export. The script writes the amenity data into three tab-separated text files; one for nodes, one for ways, and one for relation map elements. Here we illustrate how to load these files into R, and how to make some simple analyses. The extract-amenities
script can run on various OSM export files. Below, the output is shown for the full planet export from 10.8.2015 (67 GB as a .osm.bz2 format, MD5 checksum d2a64c0f3c80daf73d5b4ea54ac47f6b
). This export includes version data and deleted entries, and for this input file, the exported amenities take around 1.2 GB.
The R code for this analysis is available on Github.
Loading the amenity data
Most of the R code for this analysis is contained in the two files loaded below.
source('load_amenities.R')
source('helper.R')
Then, after running extract-amenities
on an OSM input file (see above), the extracted amenity data can be loaded into R as follows:
osmdir <- "~/osm-data/amenities-output-history-150810/"
amenities <- load_amenities_cached(osmdir)
To speed up repeated loadings, the above command will only parse the files the first time and then cache the result for repeated calls. To force a reparse, one needs to delete the file amenities.cache
in the directory passed to the function.
Column structure
The extracted amenities are now loaded into the amenities
data frame and contains 20093735 rows and 9 columns. The first few rows read:
kable(add_osm_links(head(amenities, 10)), format = 'markdown')
id | version | visible | sec1970 | pos1 | pos2 | amenity_type | name | type |
---|---|---|---|---|---|---|---|---|
1 | 11 | TRUE | 1359944817 | -31.638757 | -60.693853 | restaurant | 3390 Restó | node |
1 | 12 | FALSE | 1389141228 | node | ||||
1 | 13 | TRUE | 1432502726 | 48.566985 | 13.4465242 | node | ||
19 | 2 | TRUE | 1278489078 | 51.9458753 | -0.20698 | post_box | node | |
19 | 3 | TRUE | 1354019926 | 51.9458753 | -0.20698 | post_box | node | |
22 | 2 | TRUE | 1278491712 | 51.938183 | -0.268633 | post_box | node | |
22 | 3 | TRUE | 1278518946 | 51.938183 | -0.268633 | post_box | node | |
22 | 4 | TRUE | 1280613203 | 51.938183 | -0.268633 | post_box | node | |
22 | 5 | TRUE | 1340530437 | 51.938183 | -0.268633 | post_box | node | |
26 | 2 | TRUE | 1278491713 | 51.93021 | -0.274278 | post_box | node |
The type
column can take values node
, way
and relation
. The other columns are as in the output format of the extract-amenities
script described here. The position columns (pos1
and pos2
) are stored as strings to avoid changing the data). Since R only supports 32-bit integers natively (!), the id
-column is stored as a string, see this link.
Each row represents one version of a map element. Typically, one is most interested in the latest (and possibly the first) version of an element. The flatten_entries
function extracts information from these versions:
flat_amenities <- flatten_elements(amenities)
The column structure for this new data frame can be seen from the first few rows:
kable(add_osm_links(head(flat_amenities, 10)), format = 'markdown')
id | last_version | last_is_visible | sec1970A | sec1970B | type | last_pos1 | last_pos2 | last_amenity_type | last_name |
---|---|---|---|---|---|---|---|---|---|
1 | 13 | TRUE | 1359944817 | 1432502726 | node | 48.566985 | 13.4465242 | ||
100 | 8 | TRUE | 1199661283 | 1414851569 | node | 52.8916184 | 10.8340913 | ||
100000039 | 1 | TRUE | 1297854038 | 1297854038 | way | 1156219743 | NA | parking | |
100000049 | 1 | TRUE | 1297854041 | 1297854041 | way | 1156219772 | NA | parking | |
100000091 | 2 | TRUE | 1297854074 | 1332411646 | way | 1156220768 | NA | kindergarten | Детский сад № 127 |
100000092 | 2 | TRUE | 1297854074 | 1332411659 | way | 1156220806 | NA | school | Школа № 124 |
100000150 | 2 | FALSE | 1297854083 | 1298322603 | way | NA | NA | ||
100000158 | 4 | TRUE | 1297854084 | 1332155847 | way | 1156220603 | NA | ||
100000193 | 4 | TRUE | 1297855531 | 1342425862 | way | 1156222358 | NA | kindergarten | Д.с. №135 |
100000206 | 4 | TRUE | 1297855532 | 1317371028 | way | 1156222760 | NA | kindergarten | Д.с. №151 |
Let us recall that a map element is extracted if it has an amenity=..
-tag, or if a previous version of the element has an amenity=..
tag. The last_is_visible
-column indicates whether the last version is visible or if it is (currently) deleted. The sec1970A
and sec1970B
-columns store the values of the sec1970
-column for the first and last versions. The other columns should be self-explanatory.
The below table summarizes the loaded data:
kable(amenity_summary(amenities), align = rep("r", 4), format = 'markdown')
Nodes | Ways | Relations | Total | |
---|---|---|---|---|
number_of_extracted_versions | 12797866 | 7205426 | 90443 | 20093735 |
unique_map_elements | 6578046 | 3750026 | 48875 | 10376947 |
currently_visible | 5360774 | 3467733 | 41077 | 8869584 |
currently_deleted | 1217272 | 282293 | 7798 | 1507363 |
unique_amenity_types | 18608 | 7527 | 439 | 22994 |
The first and last timestamps are 2006-03-22 and 2015-08-10.
Examples
Below are some examples illustrating how to work with the extracted amenity data in R.
Growth plots
Let us first plot the growth of amenity elements. To do this, we select those map elements from flat_amenities
whose latest version is tagged as an amenity and is not deleted. The below plot shows the growth of these entries as a function of the date they were (first) tagged as an amenity. [Since tags can be added, changed and removed, this is not necessarily the same as the element creation date.]
plot_growth(flat_amenities)
In terms of monthly growth we obtain the following (log-scale) graph showing the age profile:
plot_age_profile(flat_amenities)
From the graphs one can see a number of vertical regions. Such jumps should be expected due to database imports. The first plot is similar (with the jump in 2009) to the plot of total accumulated map elements on the OSM wiki. Note, however, that the above plots only include elements in the current OSM map (as of 8/2015). For example, the plots would not include amenities that were added in 2010 and deleted in 2012.
Longest unmodified map elements
The below query (written using dplyr
) finds those ten map elements that have not been modified for the longest time. This is computed from the same data as for the above growth graph (from flat_amenities
).
oldest_live <- flat_amenities %>%
filter(last_is_visible == TRUE,
last_amenity_type != "") %>%
arrange(sec1970B) %>%
mutate(last_edit = from_epoch(sec1970B)) %>%
select(-sec1970A,
-sec1970B,
-last_is_visible) %>%
filter(row_number() <= 10)
kable(add_osm_links(oldest_live), format = 'markdown')
id | last_version | type | last_pos1 | last_pos2 | last_amenity_type | last_name | last_edit |
---|---|---|---|---|---|---|---|
3596840 | 1 | node | 53.4576282 | -2.2189334 | post_box | 2006-05-15 | |
4156519 | 1 | node | 52.1419264 | -0.4684423 | pub | The Balloon | 2006-05-18 |
5045684 | 1 | node | 52.0821249 | -0.3401749 | pub | The Hare and Hounds | 2006-05-21 |
5660444 | 1 | node | 51.8061523 | -1.5071322 | pub | The Lamb Inn | 2006-05-24 |
6536944 | 1 | node | 52.2259307 | -0.5493192 | park | 2006-05-28 | |
6935592 | 1 | node | 52.1502062 | -0.4279931 | post_box | 2006-05-31 | |
6975394 | 1 | node | 52.5960486 | -1.8509703 | school | Four Oaks Infant School | 2006-05-31 |
6975396 | 1 | node | 52.5958248 | -1.8517669 | school | Four Oaks Junior School | 2006-05-31 |
6977397 | 1 | node | 52.5949376 | -1.8482409 | telephone | 2006-05-31 | |
7045191 | 1 | node | 52.1460015 | -0.4201123 | post_box | 2006-06-01 |
All these entries have last_version=1
. Therefore the sec1970A
column is dropped, and the column sec1970B
is reformatted and relabeled into the more readable last_edit
column.
Top-50 amenity tags
The counts of the most popular amenity tags are shown below. An interactive version of this table (that includes all amenities) is available on the OSM taginfo website.
plot_top_amenities(flat_amenities, 50)
OSM License
The above analysis is based from the OpenStreetMap project, (c) OpenStreetMap contributors. The OSM data is available under the ODbL. The code for this analysis is available is available here (under the MIT license).