1 R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. It’s a great way to combine R programming with text and visualization.

What is more, when combining this with an open source publication of your dataset and scripts (for instance on a dataverse, Surfdrive or github), your complete workflow is public and reproducible! The code blocks in this example are all displayed for instruction purposes, but they can be hidden as well to keep only text and code output.

This demonstration will show how to integrate historical maps from the Special Collections of the Utrecht University Library with a research dataset, and visualize the results.


2 From archival records to dataset

One of my own datasets, collected together with a research assistant, documents the background of migrants entering the city of The Hague in the eighteenth century. The data was collected from digitized sources at the Gemeentearchief The Hague (see here).

Placenames were geocoded against Geonames with Openrefine using this Github repository.

Occupations were coded into HISCO, HISCLASS, and HISCAM classifications using this reference dataset for Dutch occupations.

After many hours of entering, cleaning, and standardizing data, this is the dataset: adm_journeymen.csv (sample).


3 Importing data in R

3.1 Tabular data

You may know that R is an open-source community-based platform that heavily relies on its users and enthusiastic programmers. R comes with basic functionalities (called “base R”). You can add functionalities to R by installing packages, that are developed by the community. There are many, many, many different packages. This also means that there are multiple ways to solve the same issue in R. This is a strong feature of R but can also be overwhelming. Also, not every package is as good or useful as the other. Most of the time, just a couple of packages will suit most of your needs.

Getting our Excel or csv file into R requires reading it as a data.frame. First we import the data and then take a look at the first three rows to see if it looks as it should. For this we use a package that is awesome for manipulating tabular data: data.table.

library(data.table)

adm <- fread("C:\\Users\\Schal107\\Documents\\UBU\\Team DH\\adm_jm_dec2020.csv") # import the csv file

head(adm, n=3) # display the first x rows of the dataset
##           occupation  V1   id           surname last_name sex         pob
## 1: Hoedemakersknecht 118 5408           Andries  Schroder   m      Lubeck
## 2: Kleermakersknecht 401 6203       Jacobus van      Lier   m      Laaren
## 3: Kleermakersknecht 403 7058 Johan Kort Herman  Brinkman   m Haustenbeek
##     pob_modern foreign_dummy pop_pob_1750  pob_lat pob_long dist_pob_denhaag
## 1:      Lubeck             1           NA 53.86547 10.68656        471.70698
## 2:       Laren             0           NA 52.25667  5.22778         66.53604
## 3: Haustenbeck             1           NA 51.83341  8.77527        307.73242
##    origin pob_eq_origin pop_origin_1750 origin_lat origin_long
## 1:                   NA              NA         NA          NA
## 2: Venray             0              NA     51.525       5.975
## 3:                   NA              NA         NA          NA
##    dist_origin_denhaag dist_pob_origin religion marital_status married_dummy
## 1:                  NA              NA     Luth         single             0
## 2:            148.9474        96.16829      Cat         single             0
## 3:                  NA              NA      Ref        married             1
##    children journeyman_dummy hisco hisclass hiscam hisco_minor skill_level
## 1:                         1 79310        8  57.54          79     Average
## 2:                         1 79100        8  51.40          79     Average
## 3:                         1 79100        8  51.40          79     Average
##    skill_num date_arrival month_arrival year_arrival allowed_stay_months
## 1:         3     17570615             6         1757                  NA
## 2:         2     17660505             5         1766                  12
## 3:         3     17740411             4         1774                   6
##    date_leaving month_leaving length_stay_if_left settlement_type left_dummy
## 1:           NA            NA                  NA    gepermiteerd          0
## 2:           NA            NA                  NA                          0
## 3:           NA            NA                  NA        admissie          0
##    citizenship_dummy settlement_date min_length_stay_if_settlement destination
## 1:                 0        17570713                     0.9205479            
## 2:                NA                                            NA            
## 3:                 0        17741031                     6.0000000            
##    dest_eq_pob pop_dest_1750 dest_lat dest_long dist_dest_denhaag dist_pob_dest
## 1:          NA            NA       NA        NA                NA            NA
## 2:          NA            NA       NA        NA                NA            NA
## 3:          NA            NA       NA        NA                NA            NA
##    pdf_page_no source                         remark entered_by sampled N
## 1:          78 1121_7                                     David       0 1
## 2:          99 1122_4 17-06-1967 is soldaat geworden      David       1 2
## 3:          52 1123_1                                     David       1 2
##    mechanical nonroutine ingenuity occ_class_new occ_class_new_ex_mech
## 1:          0          0         1             1                     1
## 2:          0          1         1             2                     2
## 3:          0          1         1             2                     2

Next, we check if the relevant variables are in the right format. Coordinates should be numeric otherwise we cannot plot them on a map:

str(adm[!is.na(origin_lat), list(pob_modern, pob_lat, pob_long, origin, origin_lat, origin_long, sex)])
## Classes 'data.table' and 'data.frame':   180 obs. of  7 variables:
##  $ pob_modern : chr  "Laren" "Bordeaux" "Maastricht" "Ghent" ...
##  $ pob_lat    : num  52.3 44.8 50.8 51 48.2 ...
##  $ pob_long   : num  5.228 -0.581 5.689 3.717 16.372 ...
##  $ origin     : chr  "Venray" "Amsterdam" "Maastricht" "Gent" ...
##  $ origin_lat : num  51.5 52.4 50.8 51.1 48.2 ...
##  $ origin_long: num  5.97 4.89 5.69 3.74 16.37 ...
##  $ sex        : chr  "m" "m" "m" "m" ...
##  - attr(*, ".internal.selfref")=<externalptr>

This all looks fine for now, but notice that sex is not defined as a factor but a string. This is something we might want to fix when doing other visualizations or analyses with this variable.

3.2 Spatial data

Spatial data can have many different formats and many ways of getting it into R. Because we’ll use an XYZ file of the Historical Maps collection one option is to use a package called leaflet. This is a Java-based package that allows for interactive features, such as pins, clusters, and label visualizations.

library(leaflet)

Now we can select a map from the Utrecht University Historical Maps Collection. The selected map is titled Belgii faederati nova descriptio and printed in Amsterdam in 1660.

We’ll store this map in R by simply copying the URL of the XYZ link from the Historical Maps website. To visualize it, we set the correct center and zoom level of the map box, and then simply call the URL inside the leaflet package. Notice that the quality of the scanned map allows for very detailed zooming.

url <- ("https://maps.georeferencer.com/georeferences/9ee5f3fd-740b-5206-8e85-2ecb94618bc9/2018-03-07T13:48:38.866245Z/map/{z}/{x}/{y}.png?key=ebGMmpORFAU1M65ypiIz") # store the URL from the website in R

leaflet() %>% setView(lng = 5.092092, lat = 52.093992, zoom = 7) %>%
  addTiles(url,
           options = WMSTileOptions(format = "image/png", transparent = F)
  ) # plot the map


3.3 Polygons

Because our map is georeferenced it comes with underlying coordinates, meaning that we can add georeferenced data to it. First, we will examine how historically correct the provincial borders of this map actually are. This can be done by importing a recent dataset from the IISH in Amsterdam. Researchers from the IISH are currently meticulously re-drawing provincial borders and publishing these as machine-readable files (GeoJSON). More information on this project here.

The borders are published as polygons. These are basically two-dimensional geometric figures that, in this case, follow the borders of historical regions of mint authorities. The files are not maps in itself but shapes, with accompanying metadata, that you can add to a map. We will do that below.

library(geojsonio)
## Registered S3 method overwritten by 'geojsonsf':
##   method        from   
##   print.geojson geojson
## 
## Attaching package: 'geojsonio'
## The following object is masked from 'package:base':
## 
##     pretty
mint <- geojson_read("C:\\Users\\Schal107\\Documents\\UBU\\Team DH\\Mint authorities.geojson", what = "s") # import IISH polygons

factpal <- colorFactor(topo.colors(10), mint$AUTHORITY) # define a color scheme for our polygons

leaflet() %>% setView(lng = 5.092092, lat = 52.093992, zoom = 7) %>%
  addTiles(url,
           options = WMSTileOptions(format = "image/png", transparent = F)
  )%>% 
  addPolygons(data = mint[mint$DATEfrom >= "1520-01-01" & mint$DATEto <= "1794-12-31" &
                            !grepl("Flanders", mint$AUTHORITY)  &
                            mint$AUTHORITY != "Mechelen" &
                            mint$AUTHORITY != "Tournai" &
                            mint$AUTHORITY != "Brabant" &
                            mint$AUTHORITY != "Namur" &
                            mint$AUTHORITY != "Hainaut" &
                            mint$AUTHORITY != "United Belgian States" &
                            mint$AUTHORITY != "Liège"  |
                            mint$AUTHORITY == "Culemborg" |
                            mint$AUTHORITY == "Vianen" |
                            mint$AUTHORITY == "Ravenstein" |
                            mint$AUTHORITY == "Buren" |
                            mint$AUTHORITY == "Batenburg" |
                            mint$AUTHORITY == "Cuijk" |
                            mint$AUTHORITY == "Gemert" |
                            mint$AUTHORITY == "Borculo" |
                            mint$AUTHORITY == "Cleves"
                          ,    ], color = ~factpal(mint$AUTHORITY), fillOpacity = .01) # this long list of subsetting is required to only retrieve the borders of regions within the Dutch Republic

Although the angles of polygons are a bit off compared to the map, it shows that the map and the official borders follow each other extremely closely.


4 Plotting the dataset on the historical map

Because our map is of the Dutch Republic, we’ll plot the origin of Dutch migrants to The Hague. This can be easily done by subsetting the data using data.table (adm[foreign_dummy == 0). Next, because many migrants came from the same cities or towns (Amsterdam, Deventer, etc.) we’ll define clustering to be able to visualize all migrants. The interactive map feature then allows for zooming in on the individuals from these clusters.

Many social and economic historians are interested in the social status of migrants. Did only the poor move to cities? To examine that for our migrants, we’ll use the HISCAM scale (see here). This is a social stratification scheme based on occupational titles. It runs from 40 (low status, such as day-labourers), to 100 (very high status, such as lawyers and physicians). Since the occupations of migrants were already coded into HISCAM (see above), we can assign a color corresponding to their status, ranging from white (low) to red (high). This is all done in the few lines of code below. Note that you can even hover over the individual migrants to display their HISCAM score!

qpal <- colorQuantile("Reds", adm$hiscam, n = 6) # define Hiscam color scale

leaflet(data = adm[foreign_dummy == 0]) %>% setView(lng = 5.092092, lat = 52.093992, zoom = 7) %>%
  addTiles(url,
           options = WMSTileOptions(format = "image/png", transparent = F)) %>%
  addCircleMarkers(lat = ~pob_lat, lng = ~pob_long,
                   color = ~qpal(adm$hiscam), label =  ~as.character(adm$hiscam),
                    clusterOptions = markerClusterOptions()
  ) %>% 
  addPolygons(data = mint[mint$DATEfrom >= "1520-01-01" & mint$DATEto <= "1794-12-31" &
                            !grepl("Flanders", mint$AUTHORITY)  &
                            mint$AUTHORITY != "Mechelen" &
                            mint$AUTHORITY != "Tournai" &
                            mint$AUTHORITY != "Brabant" &
                            mint$AUTHORITY != "Namur" &
                            mint$AUTHORITY != "Hainaut" &
                            mint$AUTHORITY != "United Belgian States" &
                            mint$AUTHORITY != "Liège"  |
                            mint$AUTHORITY == "Culemborg" |
                            mint$AUTHORITY == "Vianen" |
                            mint$AUTHORITY == "Ravenstein" |
                            mint$AUTHORITY == "Buren" |
                            mint$AUTHORITY == "Batenburg" |
                            mint$AUTHORITY == "Cuijk" |
                            mint$AUTHORITY == "Gemert" |
                            mint$AUTHORITY == "Borculo" |
                            mint$AUTHORITY == "Cleves"
                          ,    ], color = ~factpal(mint$AUTHORITY), fillOpacity = .01)
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored

You might perhaps be more interested in their actual occupation, or some other variable (like their age). We can plot that as well:

leaflet(data = adm[foreign_dummy == 0]) %>% setView(lng = 5.092092, lat = 52.093992, zoom = 7) %>%
  addTiles(url,
           options = WMSTileOptions(format = "image/png", transparent = F)) %>%
  addMarkers(~pob_long, ~pob_lat, label = ~as.character(adm$occupation), clusterOptions = markerClusterOptions())
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored

We can also subset the data on a specific occupation. Let’s see where construction workers came from, and use a map that covers more ground:

leaflet(data = adm[hisco_minor == 95]) %>% setView(lng = 8.75439, lat = 51.71905, zoom = 6) %>%
  addTiles("https://maps.georeferencer.com/georeferences/95c69e4c-410e-56e5-921e-bc5ef59f5be9/2019-11-15T12:39:45.001798Z/map/{z}/{x}/{y}.png?key=ebGMmpORFAU1M65ypiIz",
           options = WMSTileOptions(format = "image/png", transparent = F)) %>%
  addMarkers(lat = ~pob_lat, lng = ~pob_long,
                    label = ~as.character(adm$occupation[adm$hisco_minor == 95]),
                    clusterOptions = markerClusterOptions()  )
## Warning in validateCoords(lng, lat, funcName): Data contains 7 rows with either
## missing or invalid lat/lon values and will be ignored

Answer: most came from the surroundings of The Hague.


5 Combining functions

A very strong feature of R is that you can combine functions from different packages. Let’s demonstrate this by combining three variables in our map: occupational category (HISCO minor), occupational title, and the HISCAM score. First, we set a color scheme for the occupational categories using another package (viridis). Next, we combine the occupational title and the Hiscam score of the migrant into one text label, using the function paste0. Interesting to note here is that paste0 is a base R function, that you can use across packages. This goes for many common conditions, which means that you don’t always have to learn the specific code or functions of every package. The map below combines four different methods for visualization: base R, data.table, leaflet, and viridis:


package used for code
base R pasting labels paste0(adm$occupation, ", HISCAM = ", adm$hiscam)
data.table subsetting dataset adm[foreign_dummy == 0]
viridis color scale qpal2 <- colorNumeric("viridis", as.factor(adm$hisco_minor))
leaflet interactive map all the other lines ;)


library(viridis)
## Loading required package: viridisLite
qpal2 <- colorFactor("viridis", as.factor(adm$hisco_minor)) # define color scheme for occupational categories

leaflet(data = adm[foreign_dummy == 0]) %>% setView(lng = 5.092092, lat = 52.093992, zoom = 7) %>%
  addTiles(url,
           options = WMSTileOptions(format = "image/png", transparent = F)) %>%
  addCircleMarkers(lat = ~pob_lat, lng = ~pob_long,
                   color = ~qpal2(adm$hisco_minor), label = ~as.character(paste0(adm$occupation, ", HISCAM = ", adm$hiscam )),
                   clusterOptions = markerClusterOptions()
  )
## Warning in validateCoords(lng, lat, funcName): Data contains 1 rows with either
## missing or invalid lat/lon values and will be ignored

That’s it for now! Note that the dataset used here required quite some work, but as long as you have coordinates (latitude and longitude), or even just place names, you can already perform GIS visualizations.


6 Want to know more?

Join the R courses of RDM and/or check the upcoming events at the Living Lab Digital Humanities!

Questions: