--- title: "Generating static linguistic motion charts" author: "[Gede Primahadi Wijaya Rajeg](http://figshare.com/authors/Gede_Primahadi_Wijaya_Rajeg/1234749) & [I Made Rajeg](http://figshare.com/authors/I_Made_Rajeg/4052377)" date: "5 February 2018" output: bookdown::html_document2: fig_caption: TRUE fig_width: 6 number_sections: TRUE toc: TRUE html_notebook: toc: TRUE fig_caption: TRUE fig_width: 6 html_document: fig_caption: TRUE fig_width: 6 number_sections: TRUE toc: FALSE bibliography: "hotwarm.bib" link-citations: TRUE csl: "apa-old-doi-prefix.csl" --- # Introduction This document shares our R codes to generate publication-ready, static *linguistic motion charts*. The codes were initially designed to create "Figure 2" in our paper [@primahadi-wijaya-r._visualising_2014] on the distributional change of nominal collocates for *hot* and *warm*. That paper was inspired by a pioneering study by [Martin Hilpert](http://members.unine.ch/martin.hilpert/motion.html) [cf. @hilpert_dynamic_2011], who is the first to introduce motion charts to diachronic linguistics as tools for visualising the dynamic process of language change. The static version of the charts represents sequentially ordered bubble/scatter plots ([Figure \@ref(fig:plot-loop)](#plot-loop)) of certain linguistic phenomena across different diachronic corpus periods on the basis of the [*Corpus of Historical American English*](http://corpus.byu.edu/coha/) (COHA) [@davies_expanding_2012]. We learnt the R function (i.e., the base R `symbols()` (cf., [\@ref(basehotwarm)](#basehotwarm))) to generate the bubble(/scatter) plot in our paper from section 11.1.4 in Kabacoff's [-@kabacoff_r_2011, pp. 278-279] book about the "*bubble plot*". There have been two accessible tutorials on generating motion charts. One is from Martin Hilpert himself via his *YouTube* [demo and tutorials](http://members.unine.ch/martin.hilpert/motion.html) to generate the *dynamic/interactive* version of the motion charts using the [`googleVis`](http://cran.r-project.org/web/packages/googleVis/index.html) R package [@gesmann_googlevis_2011]. The second one is by Desagulier [-@desagulier_motion_2016], who demonstrates how one can retrieve the data from COHA and wrangle it in R into the required input format for the `gvisMotionChart()` function in the `googleVis` package. The present document aims to complement these with our R codes to create the *static* version of the motion charts. We present here two flavours of the codes: (i) the base R (with minor tweak from the [2014 version](https://figshare.com/articles/Generating_static_motion_charts/5863176) used in our paper [and in Pichler [-@pichler_diachronic_2016]]) ([\@ref(basehotwarm)](#basehotwarm)); and (ii) the `ggplot2` version ([\@ref(ggplothotwarm)](#ggplothotwarm)). We first illustrate the codes using our `hotwarm.RData` available for [download](http://drive.google.com/open?id=12NwnhyeEw_lBrvtcb_KawD1PCfvtOhBp). Then, we re-use the (much simpler) `ggplot2` codes to reproduce the static motion chart in Hilpert's [-@hilpert_dynamic_2011, p. 455, Figure 8] case study 2 on the English complement-taking predicates, testing the potential reproducibility of the `ggplot2` version ([\@ref(ggplothilpert)](#ggplothilpert)) (but, readers are encouraged to play around with the base R version). # Main steps Once the `hotwarm.RData` file has been downloaded, the easiest way to load the data is by double-clicking the file to open an R session. Alternatively, one may load it from an R console using `load()`, provided that the `.RData` is stored in the same working directory with the current R session. ```{r load-data} load(file = "hotwarm.RData") ``` The R workspace consists of a data frame called `hotwarm` with the following format. ```{r peek-hotwarm} head(hotwarm) ``` The `decade` represents the COHA decades. In the paper, we track the nominal collocational change of *hot* and *warm* across fifteen decades, from the `r unique(hotwarm$decade)[1]`s to the `r unique(hotwarm$decade)[length(unique(hotwarm$decade))]`s. The `hot` and `warm` columns show the normalised co-occurrence frequencies per million words of the nouns (in the `word` column) with each *hot* and *warm* respectively in the [adj + Noun] schema. In other words, we retrieved the R1 collocates of the two adjectives from COHA. The `freq` column contains the joint frequency of the nouns in the specified schema across *hot* and *warm* (i.e., the sum of values in `hot` and `warm`). Finally, the `diff` column uses the values from the "SCORE" column produced by the "compare" feature of COHA (this choice follows the suggestion in Martin Hilpert's tutorial). ## Base R version of the static motion chart {#basehotwarm} The steps for generating the static chart with the base R are described below. (@decade_lab) Generate character vectors for the labels of each decade. ```{r decade-label} decade.label <- as.character(unique(hotwarm$decade)) decade.label ``` (@colloc_lab) Create a character vector containing collocates whose distribution with *hot* and *warm* will be traced through. This vector will be used as the selected labels for the bubbles in the scatter plots. ```{r colloc-label} selected.collocs <- c("bath","day","dog","heart","pursuit","smile","spot","water","welcome") ``` (@colloc_lab_df) Subset the data points for the selected collocate-labels defined in step (@colloc_lab) from the `hotwarm` data frame. ```{r colloc-df} dframe.label <- subset(hotwarm, word %in% selected.collocs) head(dframe.label) ``` (@split_df) Split the `hotwarm` data frame into a list of subset data frame by decades. ```{r split-df-by-decade} # data frame for all data points df.main <- split(hotwarm, hotwarm$decade) # check the type of the data and the length typeof(df.main) length(df.main) ``` (@split_df_label) Split also the `dframe.label` produced in (@colloc_lab_df) above into a list of data frame per decade, containing the data for the collocates labels. ```{r split-label-df} # data frame for the selected collocates df.labs <- split(dframe.label, dframe.label$decade) ``` (@plot_loop) Now comes the `for` loop to generate the scatter plots for each of the fifteen decades. Readers might do some adjustment from the codes below, such as the limits, or tick-labels, of the *x*- (`xlim`) and *y*-axis (`ylim`), given the nature of their data. Before running the `for` loop, we need to define a number of graphical parameters for the ploting (with `par()`, cf. 6.1 in the chunk below), especially for the number of row and column in the plot. The `hotwarm` data consists of fifteen decades. So we divide the plot into 5-by-3 matrix (hence, `mfrow = c(5, 3)`). Details for the list of arguments in the `par()` call below can be reviewed by typing `?par()` in the console. ```{r plot-loop, fig.width = 8, fig.asp = 1.25, dpi = 300, fig.cap = "Changes in the collocational profiles of *hot* (*x*-axis) and *warm* (*y*-axis), *COHA* 1860s-2000s [@primahadi-wijaya-r._visualising_2014, p. 253]."} # 6.1 Define plotting parameter--------- par(mfrow = c(5, 3), # set a 5-by-3 plotting space cex = 0.425, mar = c(2.5, 2, 1, 1), mgp = c(3, 0.75, 0), tck = -0.025, las = 1) # 6.2 "for" loop to generate the sequential scatterplots--------- for (i in seq_along(decade.label)) { # loop sequence is derived from the total number of (15) decades # preparing empty plot field from the dataset of all decades plot(df.main[[i]][,3], # column [,3] for the x-axis df.main[[i]][,4], # column [,4] for the y-axis type = "n", xlim = c(-0.5, max(hotwarm$hot)+1), ylim = c(-0.5, max(hotwarm$warm)+0.5), yaxt = "n", xaxt = "n", ann = F) # adding major y-axis to the empty plot axis(side = 2, at = c(0, 1, 2, 3), labels = c(0, 1, 2, 3)) # (OPTIONAL) adding minor y-axis to the empty plot #axis(side = 2, at = c(1, 3), labels = FALSE, tcl = -0.2) # adding major x-axis to the empty plot axis(side = 1, at = c(0, 2, 4, 6, 8), labels = c(0, 2, 4, 6, 8), tick = TRUE) # (OPTIONAL) adding minor x-axis to the empty plot #axis(side = 1, at = c(1, 3, 5, 7), labels = FALSE, tcl = -0.2) # adding gridlines onto the empty plot field grid(lwd = 1.15) # adding the decade labels onto the plot text(4, 3, labels = decade.label[i], col = "grey85", cex = 4.15, font = 2) # adding the bubble symbol onto the plot from all data points (cf. Kabacoff, 2011, pp. 278-279) symbols(df.main[[i]][,3], df.main[[i]][,4], inches = 0.150, fg = "white", bg = "grey85", circles = sqrt(df.main[[i]][,5]/pi), # column [,5] for the "freq" used as bubble-size input add = TRUE) # adding the bubble symbol onto the plot with darker colour for the labelled data points symbols(df.labs[[i]][,3], df.labs[[i]][,4], inches = 0.150, fg = "white", bg = "grey65", circles = sqrt(df.labs[[i]][,5]/pi), add = TRUE) # adding the selected collocate labels onto the plot text(df.labs[[i]][,3], df.labs[[i]][,4], labels = df.labs[[i]][,2], cex = 1.25) } # end of "for-loop" ``` In order to save the above plot, first set up a `.png` file (or other formats, such as `.jpeg` or `.pdf`) for the resulting plot, before executing the above codes, and define the plot's resolution and sizes. After that, run the previous codes in `6.1` and `6.2` (i.e., step (@plot_loop)), and end with `dev.off()` to close the graphic device and save the plot. See the code-chunk below. ```{r save-plot, eval = FALSE} # set up a .png file for the resulting plot, including the plot's resolution and sizes png("hotwarmPLOT-loop.png", res = 450, width = 6.5 * 500, height = 8.5 * 500) # Insert here (and then run) the (i) plotting parameter call (6.1) and (ii) the for loop (6.2) # close the active graphic device to save the plot dev.off () ``` Note that to achieve the desired results in terms of the size and resolution of the plot, a couple of trial-and-errors should be expected. ## The `ggplot2` version of the static motion chart {#ggplothotwarm} The `ggplot2` package offers a *facetting* feature that is the key to a (much) simpler code for producing the static motion charts. In `ggplot2`, we can use the `facet_wrap()` function, which can be directly fed with the `decade` column and let `ggplot2` do the job to split the plots by decades. To run the following codes, readers need to either install `ggplot2` package separately or together as a part of a suite of packages called the [`tidyverse`](http://www.tidyverse.org) [@wickham_r_2017]. Optionally, readers might also want to install the `ggrepel` package that is used here to handle the overlapping labels. Here are the steps for the `ggplot2` version. 1. Prepare two data frames: (i) one for the selected collocate/item labels (`df.label`), and (ii) one for the remaining items (`df.others`), which will not be labelled in the plot. Here, we also generate the axes labels. ```{r df-ggplot2} # get the data for the selected collocates df.label <- subset(hotwarm, word %in% selected.collocs) # get the data for the remaining collocates df.others <- subset(hotwarm, !word %in% selected.collocs) # x-axis labels xaxis <- expression(paste("Co-occurrence frequency per million words with ", italic("hot"), sep = "")) # y-axis labels yaxis <- expression(paste("Co-occurrence frequency per million words with ", italic("warm"), sep = "")) ``` 2. Generate the first `ggplot()` call for the non-labelled data and save the output into `p`. ```{r ggplot-call1, fig.asp = 0.9, fig.width=9, warning = FALSE, message = FALSE, dpi = 300} # load the ggplot2 via tidyverse (or individually) and load the ggrepel library(ggplot2) library(ggrepel) # generate plot for the non-labelled data p <- ggplot(data = df.others, aes(x = hot, y = warm)) + geom_point(aes(size = freq), alpha = 1/8, show.legend = FALSE) + facet_wrap(~decade) # here is the key to split the plot by decades! ``` 3. Add the labelled data points to the previous plot (`p`) on the basis of the `df.label` data frame. ```{r ggplot-call2, fig.asp = 0.9, fig.width = 9, warning = FALSE, message = FALSE, dpi = 300, fig.cap = "Changes in the nominal collocational profiles of *hot* and *warm*, *COHA* 1860s-2000s [@primahadi-wijaya-r._visualising_2014, p. 253]."} p <- p + geom_point(data = df.label, aes(size = freq), alpha = 1/2, colour = "royalblue", show.legend = FALSE) + # if you are not using `ggrepel` below, use the standard `geom_text()` function geom_text_repel(data = df.label, aes(label = word), size = 2.5) + labs(x = xaxis, # insert the `xaxis` data y = yaxis) + # insert the `yaxis` data theme_bw() # call out the plot p ``` 4. Save the plot as `.png` file using `ggsave()`. ```{r save-ggplot, eval = FALSE} ggsave(file = "hotwarm.png", plot = p, width = 20, height = 20, unit = "cm", dpi = 600) ``` Note that trial and errors are expected to get the desired results in terms of the size and resolution. For the previous plot, the parameters for `width`, `height`, and resolution (i.e., `dpi`) given above produce a desirable result from our own perspective. ## Recreating previous motion chart in Hilpert [-@hilpert_dynamic_2011] {#ggplothilpert} To recreate the chart (i.e., "Figure 8") in Hilpert's [-@hilpert_dynamic_2011] paper, we need to download the data from Martin Hilpert's [motion charts resource page](http://members.unine.ch/martin.hilpert/motion.html). The data is available as [zip file](http://members.unine.ch/martin.hilpert/mot.zip), which contains the `mot.RData` workspace. After the downloads, open/load the R workspace that contains two data frames: (i) `convdata` and (ii) `compdata`. The former is the data for Hilpert's [-@hilpert_dynamic_2011, p. 444] case study 1 on the "ambicategorical nouns and verbs". We will instead use the `compdata` to produce the static motion charts for the English complement-taking predicates (readers can try the codes for themselves with the `convdata`). ```{r peek-compdata} load("mot.RData") head(compdata) ``` The *x*- and *y*-axis represent the dimensions of the metric Multidimensional Scaling (MDS) from the (dis)similarity measures between the verbs on the basis of their complementation profiles [@hilpert_dynamic_2011, pp. 451-452, 454]. The `freq` column indicates the normalised text frequencies of the verbs [@hilpert_dynamic_2011, p. 452]. The steps to generate the `ggplot2` motion charts from the `compdata` data are as follows. 1. As above, define two data frames, one for the labelled items and one for the remainder of the data points. We also store the labels for the axes [cf., @hilpert_dynamic_2011, p. 453, for the interpretation given on the *x*- and *y*-axes]. ```{r df-hilpert} # selected complement-taking predicates predicates <- c("confirm", "await", "enjoy", "demand", "consider", "remember", "hope", "dislike", "want", "try") # get the data for the selected predicates df.label <- subset(compdata, verb %in% predicates) # get the data for the remaining predicates df.others <- subset(compdata, !verb %in% predicates) # xaxis label xaxis <- expression(paste(italic("that"), "-clauses <---------- ----------> ", italic("to"), "-infinitives", sep = "")) # yaxis label yaxis <- "nominal compl. <---------- ----------> complex verbal/clausal compl." ``` 2. Generate the first `ggplot()` call for the non-labelled data and save the output into `p`. ```{r ggplot-hilpert1, fig.asp = 0.9, fig.width=9, warning = FALSE, message = FALSE, dpi = 300} # generate plot for the non-labelled data p <- ggplot(data = df.others, aes(x = x, y = y)) + geom_point(aes(size = freq), alpha = 1/8, show.legend = FALSE) + facet_wrap(~decade) # here is the key to split the plot by decades! ``` 3. Add the labelled data points to the previous plot (`p`) on the basis of the `df.label` data frame. ```{r ggplot-hilpert2, fig.asp = 0.9, fig.width = 9, warning = FALSE, message = FALSE, dpi = 300, fig.cap = "Two-dimensional MDS solution for the English complement-taking predicates, *COHA* 1860s-2000s (Hilpert, 2011, p. 455, Figure 8)."} p <- p + geom_point(data = df.label, aes(size = freq), alpha = 1/2, colour = "royalblue", show.legend = FALSE) + # if you are not using `ggrepel` below, use the standard `geom_text()` function geom_text_repel(data = df.label, aes(label = verb), size = 2.5) + labs(x = xaxis, # insert the `xaxis` data y = yaxis) + # insert the `yaxis` data theme_bw() # call out the plot p ``` We can save the plot using the `ggsave()` function following the code in section [\@ref(ggplothotwarm)](#ggplothotwarm) with the `hotwarm` data. # Coda We hope this document can be useful for those interested in creating other motion charts, especially the static ones. The `.Rmd` source file of this document, including the `.bib` and `csl` files, are available in our *figshare* accounts (cf. [here](http://figshare.com/authors/Gede_Primahadi_Wijaya_Rajeg/1234749) and/or [here](http://figshare.com/authors/I_Made_Rajeg/4052377)). This HTML document is rendered with Xie's [-@xie_bookdownbook_2016] `bookdown` R package in RStudio. For any feedback and comments, please drop us email at primahadiwijaya@gmail.com. We would be happy to hear from you! ```{r session-info, eval = FALSE} # session info for this R document. > sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X 10.13.3 (unknown) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ggrepel_0.6.5 ggplot2_2.2.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 rstudioapi_0.7 knitr_1.15.1 magrittr_1.5 devtools_1.12.0 [6] munsell_0.4.3 colorspace_1.2-6 rlang_0.1.4 stringr_1.2.0 plyr_1.8.4 [11] tools_3.3.1 grid_3.3.1 gtable_0.2.0 withr_1.0.2 htmltools_0.3.5 [16] yaml_2.1.16 lazyeval_0.2.0 digest_0.6.12 rprojroot_1.2 tibble_1.3.4 [21] bookdown_0.3.12 base64enc_0.1-3 memoise_1.0.0 evaluate_0.10 rmarkdown_1.8 [26] labeling_0.3 stringi_1.1.5 scales_0.5.0 backports_1.0.5 jsonlite_1.5 ``` # References