--- title: "Array Variables" description: "Crunch supports several types of 'array' variables, which group together multiple columns of data in ways common in survey research." output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Array Variables} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- [Previous: variables](variables.html) ```{r, results='hide', echo=FALSE, message=FALSE} library(crunch) load("vignettes.RData") ``` Crunch supports more types of variables than many of the data formats from which you can create datasets. Plain-text .csv files, for example, can't express that some columns are actually indicator matrices of multiple selections (Multiple Response Variables). Most SPSS .sav files do not indicate Categorical Arrays (otherwise known as "grids") as being part of a group---they are simply several categorical variables. The same is true for an R `data.frame`. You can use `crunch` to "bind" categorical variables into Multiple Response and Categorical Array variables. ## Creating an array One of the reasons to use R with Crunch is to leverage the power of scripting for tasks that would be repetitive in a GUI. Many crunch functions operating on Crunch datasets have an optional `pattern` argument that lets you use regular expressions for these "bulk" operations. In our sample _Economist_ dataset, we have set of variables prefixed with "imiss": ```{r, eval=FALSE} grep("^imiss_", names(ds), value=TRUE) ``` ```{r, echo=FALSE} grep("^imiss_", names(start_make_array), value=TRUE) ``` These correspond to a survey grid question about how important respondents view a set of issues. Examining one of them, we see ```{r, eval=FALSE} ds$imiss_b ``` ```{r, echo=FALSE} cat(show_imiss_b, sep="\n") ``` All of these "imiss" categorical variables have the same structure. We can combine them into a categorical array variable with `makeArray`: ```{r, eval=FALSE} ds$imiss <- makeArray(ds[grep("^imiss_", names(ds))], name="Issue importance") ds$imiss ``` ```{r, echo=FALSE} cat(show_imiss, sep="\n") ``` The set of "important issue" variables have gone from thirteen separate categorical variable cards to just one, where the subvariables are shown as rows, and the common categories across all of them are shown as columns. In our example dataset, the categorical variables `imiss_*` are now not visible in the dataset directly, but we can access them as "subvariables" of the array we just created. ```{r, eval=FALSE} subvariables(ds$imiss) ``` ```{r, echo=FALSE} cat(show_imiss_subvars, sep="\n") ``` We can also step into the subvariables and access the underlying categorical variables: ```{r, eval=FALSE} ds$imiss$imiss_b ``` ```{r, echo=FALSE} cat(show_imiss_b, sep="\n") ``` ## Manipulating subvariables The names `imiss_b` through `imiss_t` are unsatisfying from a human-readability perspective: you can't tell which political issues correspond to the variables. Unfortunately, we didn't have additional metadata on these survey questions in the `data.frame` we imported initially. However, we can rectify this. Subvariables have methods similar to those for [categories](variables.html). They have a names attribute that we can get: ```{r, eval=FALSE} names(subvariables(ds$imiss)) ``` ```{r, echo=FALSE} names_imiss_subvars ``` We can set it, too: ```{r, eval=FALSE} names(subvariables(ds$imiss)) <- c("The economy", "Immigration", "The environment", "Terrorism", "Gay rights", "Education", "Health care", "Social security", "The budget deficit", "The war in Afghanistan", "Taxes", "Medicare", "Abortion") subvariables(ds$imiss) ``` ```{r, echo=FALSE} cat(show_imiss_subvars2, sep="\n") ``` Another useful thing we can do with array subvariables is reorder them. Let's alphabetize the subvariables: ```{r, eval=FALSE} sorting <- order(names(subvariables(ds$imiss))) subvariables(ds$imiss) <- subvariables(ds$imiss)[sorting] subvariables(ds$imiss) ``` ```{r, echo=FALSE} cat(show_imiss_subvars3, sep="\n") ``` ## Creating multiple response variables Just as we created a categorical array, we can create a multiple response variable. Like categorical arrays, multiple responses contain a set of subvariables, categorical variables with a common list of categories. However, the subvariables in a multiple response are treated as dichotomous indicators. We specify one or more categories that indicate "selected" versus "not selected". Hence, when a multiple response appears in the web app, it looks like a single categorical variable, each subvariable shown like a category. Unlike a categorical variable, though, the multiple responses are not mutually exclusive, so tabulations with them may not sum to 100 percent. In the _Economist_ dataset, we have another set of parallel categorical variables, "boap", which indicate approval of President Obama on a range of issues. ```{r, eval=FALSE} ds$boap_4 ``` ```{r, echo=FALSE} cat(show_boap_4, sep="\n") ``` In the questionnaire that collected this data, "boap" appeared as a grid question, just as "imiss" did. But, for illustration purposes---and to show how you can convert between categorical array and multiple response---let's treat this as multiple response. The function `makeMR` works like `makeArray` but with an additional argument, "selections", in which you specify the category name(s) that identify which category or categories should be the dichotomous indicator. ```{r, eval=FALSE} ds$boap <- makeMR(ds[grep("^boap_[0-9]+", names(ds))], name="Approval of Obama on issues", selections=c("Strongly approve", "Somewhat approve")) ds$boap ``` ```{r, echo=FALSE} cat(show_boap, sep="\n") ``` ## (un)dichotomize Multiple response variables can be thought of as categorical arrays that have extra metadata indicating which categories are "selected". This metadata can be manipulated, and we can thus transform categorical arrays into multiple response and vice versa. The function `undichotomize` removes the dichotomization metadata: ```{r, eval=FALSE} ds$boap <- undichotomize(ds$boap) ds$boap ``` ```{r, echo=FALSE} cat(show_boap2, sep="\n") ``` We can add that information with `dichotomize`. Taking our categorical array "boap," let's make it into a multiple response, but this time, let's only include the "Strongly approve" category: ```{r, eval=FALSE} ds$boap <- dichotomize(ds$boap, "Strongly approve") ds$boap ``` ```{r, echo=FALSE} cat(show_boap3, sep="\n") ``` ## Splitting arrays back into categoricals As noted above, when we make an array, its subvariables no longer appear in the dataset outside of the array. ```{r} grep("boap", names(ds), value=TRUE) ``` We can access the subvariables and do things with them directly via the `subvariables` method, but the case may arise in which we want to undo our binding of these subvariables into the array. The function `unbind` deletes the array variable and restores the subvariables as top-level variables. ```{r, eval=FALSE} unbind(ds$boap) ds <- refresh(ds) ``` ```{r, results='hide', echo=FALSE, message=FALSE} ds <- start_make_array ``` Note the use of `refresh`. Most functions that modify objects on the server refresh their local copies in our R session automatically; however, because `unbind` doesn't assign back into `ds`, the local dataset object doesn't get updated with the change, so we need to refresh it manually. Now, if we check the names of `ds`, we see our full set of `boap_*` former subvariables: ```{r} grep("boap", names(ds), value=TRUE) ``` [Next: variable organization](variable-order.html)