--- title: "Doing CRD Analysis in R correctly" author: "Chris Williams" # date: "8/29/2021" output: pdf_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ## Our taste test data We can read in the taste test data and look at a boxplot: ```{r taste1} tastetest <- read.table("c://temp/taste507aug.txt",header=TRUE) boxplot(taste ~ treat, data = tastetest, main = "Taste test data") ``` This does seem to show differences among groups. ## Analyzing the data (Take 1) We create an object using the lm() function, which creates an lm object. The first argument to lm() is of the form 'y ~ x' where y is the dependent variable and x is a variable used to predict y. We then use functions like anova() and summary() to give us information about the fitted model. ```{r taste2} tastetest.lm <- lm(taste ~ treat, data = tastetest) anova(tastetest.lm) summary(tastetest.lm) ``` Note that R does not print the total SS or df, unlike SAS and some other packages. Something seems strange here, because the degrees of freedom values (Df) do not match what we had in class previously. Why? What does the summary() function tell us? In the anova table, having df = 1 for treat might mean that there are only two groups, but we know that is not true. Here it is because R thinks we want to fit a regression model: $$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}*x_i $$ with a continuous covariate x. The values from the summary() function are the estimated intercept and slope, so we have: $$\hat{y_i} = 0 + 2*x_i = 2*x_i $$ However, $x_i$ is not a measured variable but just represents the type of item being tasted. ## Analyzing the data (Take 2) To avoid this misunderstanding, we need to tell R that the variable treat is a factor, using the as.factor() function: ```{r taste3} # we must make sure that treat is a factor, not numeric tastetest$treat <- as.factor(tastetest$treat) tastetest.lm2 <- lm(taste ~ treat, data = tastetest) anova(tastetest.lm2) summary(tastetest.lm2) ``` Now the ANOVA table matches what we had before. The coefficient estimates from the summary() function give us a different prediction equation: $$\hat{y_i} = \hat{\beta_0} + \hat{\beta_1}*I(treat 2) + \hat{\beta_2}*I(treat 3) $$ or $$\hat{y_i} = 2 + 2*I(treat 2) + 4*I(treat 3) $$ Here $I(treat 2)$ is an indicator function, equal to 1 if the observation has $treat=2$ and 0 otherwise. $I(treat 3)$ is defined similarly for $treat 3$. This gives us predicted values of 2, 4, and 6, for observations in group 1, 2, or 3, respectively. These values equal the respective group means. For ANOVA models the coefficient estimates from summary() are usually of less interest than the group mean estimates, so we often do not focus on this output as much for ANOVA models. When we have models using both factors and measured variables, like in the analysis of covariance, then we will again be more interested in the summary() estimates. In SAS we tell it about factors using the CLASS statement in Proc GLM and other procedures.