--- title: "Analyzing Crunch Data" description: "With Crunch, you can harness the power of R to do computations with your datasets on the server that would be difficult or impossible to accomplish in a graphical user interface." output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Analyzing Crunch Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- [Previous: transform and derive](derive.html) ```{r, results='hide', echo=FALSE, message=FALSE} library(crunch) load("vignettes.RData") options(width=120) ``` With `crunch`, you can harness the power of R to do computations with your datasets in Crunch that would be difficult or impossible to accomplish in a graphical user interface. ## Crosstabbing While the web application certainly supports crosstabbing, you may want to do aggregations like this in R. Crosstabbing in R with `crunch` may allow you to easily do additional computations on the result, for example. `crunch` contains the `crtabs` (Crunch-tabs) function, which largely emulates the design of the `xtabs` function in base R. In essence, you define a formula and provide data in which to evaluate it. In this case, we'll be providing a `CrunchDataset`. ### Basic examples Like `xtabs`, `crtabs` takes a formula and a data argument. Dimensions of your crosstab go on the right side of the `~`. For a univariate table of frequencies by education, we can do ```{r, eval=FALSE} tab1 <- crtabs(~ educ, data=ds) tab1 ``` ```{r, echo=FALSE} tab1 ``` Additional dimensions are added with `+`. For a two-way table of education and gender, ```{r, eval=FALSE} tab2 <- crtabs(~ educ + gender, data=ds) tab2 ``` ```{r, echo=FALSE} tab2 ``` ### Weighting `crtabs` takes advantage of several Crunch features that `xtabs` does not support. First, it respects weight variables that have been set on the server. This dataset is not currently weighted ```{r, eval=FALSE} weight(ds) ``` ```{r, echo=FALSE} print(NULL) ``` but we can very easily change that. Let's use the "weight" variable that already exists in the dataset: ```{r, eval=FALSE} weight(ds) <- ds$weight ``` Now, if we request the same two-way table as before, we'll get weighted results: ```{r, eval=FALSE} crtabs(~ educ + gender, data=ds) ``` ```{r, echo=FALSE} tab2weighted ``` If we want unweighted data, that's easy enough: ```{r, eval=FALSE} crtabs(~ educ + gender, data=ds, weight=NULL) ``` ```{r, echo=FALSE} tab2 ``` ### Proportion tables As with any `array` data type, we can compute margin tables, and the `prop.table` function in R provides a convenient way for sweeping a table by a margin. These work on the output of `crtabs`, too: ```{r} prop.table(tab1) ``` For column proportions, specify margin=2 (by rows, margin=1): ```{r} prop.table(tab2, 2) ``` Let's make that more readable: ```{r} round(100*prop.table(tab2, 2)) ``` ### Complex data types `crtabs` also comfortably handles the more complex data types that Crunch supports, including categorical array and multiple response variables. In the [array variables vignette](array-variables.html), we created a categorical array, "imiss", for "Important issues". We can crosstab with arrays just as we do non-arrays. ```{r, eval=FALSE} tab3 <- crtabs(~ imiss + gender, data=ds) tab3 ``` ```{r, echo=FALSE} tab3 ``` Note that even though we specified two variables in our formula, because "imiss" itself is two dimensional, our result is a three-dimensional array. To illustrate working with multiple response variables, let's convert "imiss" to multiple response, selecting its positive categories as indicating selection: ```{r, eval=FALSE} ds$imiss <- dichotomize(ds$imiss, c("Very Important", "Somewhat Important")) ``` Now, when we crosstab it, we'll get a two-dimensional table because multiple response variables present a one-dimensional interface: ```{r, eval=FALSE} tab3mr <- crtabs(~ imiss + gender, data=ds) tab3mr ``` ```{r, echo=FALSE} tab3mr ``` It's worth noting here that the result of `crtabs` isn't an `array` object but a `CrunchCube` object. ```{r} class(tab3mr) ``` This allows us to do the appropriate calculations on arrays and multiple response variables when `prop.table` is called. To compute a margin table over a multiple response variable, summing along the dimension would give an incorrect value because the responses in a multiple response are not mutually exclusive--they can't be assumed to sum to 100 percent. However, the `margin.table` method on `CrunchCubes` can compute the correct margin, so `prop.table` gives correct proportions: ```{r} round(100*prop.table(tab3mr, 2)) ``` Finally, just as we saw in the [array variables](array-variables.html) vignette, we can grab individual subvariables and crosstab with them: ```{r, eval=FALSE} crtabs(~ imiss$imiss_f + gender, data=ds) ``` ```{r, echo=FALSE} tab3subvar ``` ### N-way tables It's worth noting that we can extend the crosstabbing to higher dimensions, just by adding more terms on the right-hand side of the formula: ```{r, eval=FALSE} round(crtabs(~ imiss + educ + gender, data=ds)) ``` ```{r, echo=FALSE} round(tab4) ``` ### Numeric aggregations `crtabs` can also compute quantities other than counts. Using the left-hand side of the formula, we can specify other aggregations to put in the cells of the table. For example, in the [deriving variables vignette](derive.html), we created an "age" variable. We can easily compute the average age by gender and education: ```{r, eval=FALSE} crtabs(mean(age) ~ educ + gender, data=ds) ``` ```{r, echo=FALSE} tab5 ``` Other supported aggregations include `min`, `max`, `sd`, and `sum`. For the minimum age by gender and education, ```{r, eval=FALSE} crtabs(min(age) ~ educ + gender, data=ds) ``` ```{r, echo=FALSE} tab6 ``` We can get unconditional (univariate) statistics by making the right-hand side of your formula be just the number `1`: ```{r, eval=FALSE} crtabs(min(age) ~ 1, data=ds) ``` ```{r, echo=FALSE} tab6a ``` Numeric aggregation functions also work with categorical variables that have numeric values defined for their categories; this is the reason why numeric values for categories are defined, in fact. In the [variables vignette](variables.html), we worked with the "On the right track" question and set some numeric values: ```{r, eval=FALSE} categories(ds$track) ``` ```{r, echo=FALSE} print(track.cats) ``` We can use these numeric values to compute an "on the right track index" by averaging them. If the index is greater than zero, more people thing things are going well, and if it is negative, more respondents are pessimistic. ```{r, eval=FALSE} round(crtabs(mean(track) ~ educ + gender, data=ds), 2) ``` ```{r, echo=FALSE} round(tab7, 2) ``` Looks like most people surveyed thought that the U.S. is on the wrong track, but that pessimism is less pronounced for women with higher levels of education. ### Subsetting data We can also specify a subset of `ds` to analyze, just as if it were a data.frame. Let's do the same calculation for Democrats only: ```{r, eval=FALSE} round(crtabs(mean(track) ~ educ + gender, data=ds[ds$pid3 == "Democrat",]), 2) ``` ```{r, echo=FALSE} round(tab8, 2) ``` Not surprisingly, Democrats were less pessimistic about the direction of the country than the general population. A few final observations about `crtabs`. First, all of these calculations have been weighted by the weight variable we set above. We set it and could then forget about it--we didn't have to litter all of our expressions with `ds$weight` and extra arithmetic to do the weighting. Crunch handles this for us. Second, none of these aggregations required pulling case-level data to your computer. `crtabs` sends Crunch expressions to the server and receives in return an `n`-D array of results. The only computations happening locally are the margin tables and sweeping in `prop.table`, computing on the aggregate results. Your computer would work exactly as hard with this example dataset of 1000 rows as it would with a dataset of 100 million rows. ## Statistical modeling Any statistical modeling function that takes a `data` argument should happily accept a `CrunchDataset` and just do the right thing--no extra effort or thought required. Let's fit a basic Ordinary Least Squares (OLS) model. In our dataset, we have a few questions about Edward Snowden, such as: ```{r, eval=FALSE} ds$snowdenleakapp ``` ```{r, echo=FALSE} cat(snowdenleakapp.var, sep="\n") ``` We can use `lm` to fit our model. Let's explore the relationship between approval of Snowden's leak and respondents' interest in current events, party identification, gender, and age. ```{r, eval=FALSE} ols1 <- lm(I(snowdenleakapp == "Strongly approve") ~ newsint2 + pid3 + gender + age, data=ds) summary(ols1) ``` ```{r, echo=FALSE} summary(ols1) ``` Looks like partisanship is weakly associated with approval of the NSA leak, but overall the model isn't a great fit, given our data. (For what it's worth, we're working on a randomly drawn subset of the survey so that the size of data included with package is small. Results are more meaningful with the full dataset.) Nevertheless, this example illustrates how straightforward it is to do statistical analysis with data in Crunch. Even though your dataset lives on the server, you can think of it like a local `data.frame`. Note, for example, that our categorical variables (News Interest, Party ID, and Gender) expand their categories out as dichotomous indicators, just as if they were `factor` variables in a `data.frame`. Given that we're estimating a model with a dichotomous dependent variable, perhaps a logit would be more appropriate than a strict linear predictor. We can use `glm` instead: ```{r, eval=FALSE} logit1 <- glm(I(snowdenleakapp == "Strongly approve") ~ newsint2 + pid3 + gender + age, family=binomial(link="logit"), data=ds) summary(logit1) ``` ```{r, echo=FALSE} summary(logit1) ``` As before, not a particularly interesting result, but this is just the beginning of the analysis process. Using `crunch`, you can keep exploring the data and perhaps find a better fit. Unlike the previous examples, these modeling functions do have to pull columns of data from the server to your local machine. However, only the columns of data you reference in your formula are copied, and if you specify a subset of the dataset to regress on (as we did above with `crtabs` when we looked at just Democrats), only those rows are retrieved. This helps minimize the time spent shipping data across the network. Moreover, because of the `crunch` package's query cache, subsequent models that incorporate any of those variables will not have to go to the server to get them. [Next: filtering datasets](filters.html)