Some reasons for using multiple regression

1) Adjusting for explanatory variables (Seeing the effect of a new teaching method on a post-test score after adjusting for pre-test, gender, etc.)

2) Investigating which variables explain a response. (What habitat variables in the forest account for variation in tree growth?)

3) Prediction (What is the value of the stock market next week?)

4) Parameter estimation (Others have used a particular model before, how does our data agree with theirs?)

For 1, 2, and 3, it is rarely clear as to which model is best. This is most problematic for reason 2 .

 

Building multiple regression models

1) Select the independent variables

2) Model formation

3) Residual analysis

 

Selecting the independent variables

1) Known, because the model is already specified

2) Use

 

 

Stepwise selection

Let a1 = significance level to enter; a2 = significance level to stay

  1. Pick best single covariate (if P < a1 )
  2. Try to add another term using MSdrop tests (if P < a1)
  3. For variables in the model use MSdrop tests to see if any term can be removed (if P > a2 )
  4. Cycle between steps 2 and 3 until no variable can enter or be dropped.

Forward and backward selection only perform step 2 or step 3, respectively.

 

Model formation

Using the selected independent variables: What is the correct order of the terms? (Are squared or cross-product terms needed?)

Residual plots and variable selection methods can be used.