Building regression models

Some reasons for using multiple regression

1) Adjusting for explanatory variables (Seeing the effect of a new teaching method on a post-test score after adjusting for pre-test, gender, etc.)

2) Investigating which variables explain a response. (What habitat variables in the forest account for variation in tree growth?)

3) Prediction (What is the value of the stock market next week?)

4) Parameter estimation (Others have used a particular model before, how does our data agree with theirs?)

For 1, 2, and 3, it is rarely clear as to which model is best. This is most problematic for reason 2 .

Building multiple regression models

1) Select the independent variables

2) Model formation

3) Residual analysis

Selecting the independent variables

1) Known, because the model is already specified

2) Use

All possible regressions
Best subset regression: doesn’t look at all models
Automatic search procedures: stepwise, forward, or backward selection. Generally yields a single final model.

Stepwise selection

Let a₁ = significance level to enter; a₂ = significance level to stay

Pick best single covariate (if P < a₁)
Try to add another term using MS_droptests (if P < a₁)
For variables in the model use MS_droptests to see if any term can be removed (if P > a₂)
Cycle between steps 2 and 3 until no variable can enter or be dropped.

Forward and backward selection only perform step 2 or step 3, respectively.

Model formation

Using the selected independent variables: What is the correct order of the terms? (Are squared or cross-product terms needed?)

Residual plots and variable selection methods can be used.