Stat 324 – HW 5

Due Friday, May 8, 2pm

 

1) Data reported by Burt (1966) proposed to examine the relationship between the IQ scores of identical twins, one raised in a foster home, and the other raised by natural parents (twinsIQ.mtw).  Warning:  There is now extensive evidence Burt faked his data!

Suppose we want to predict natural IQ from foster IQ for these 27 twins. Actually, we want to decide more than that these variables are related but that whether the IQs are the same on average, b0=0 and b1=1.  (Note to Firefox users, these are betas.) Previously we looked at these hypotheses separately, but now we can test them simultaneously.

Let the “full model” be E(natural IQ) = b0 + b1foster IQ.

(a) Regress natural IQ on foster IQ to estimate the full model.  What is the SSR and MSE of this model? What is SST?

(b) If we want to test H0: b0 =0 and b1=1, what is the “reduced model”?

(c) If you were to fit this reduced model, describe how you determine the i values.

(d) Now using the predicted values in (c), determine the sum of the squared residuals SSE = S? You can use Minitab to determine this sum, but clarify how you do so.

(e) If that is SSE, what is SSR for the reduced model? (Hint: Use the SST you found in (a).)

(f) Now find the F-statistic [(SSR(full)-SSR(reduced))/df]/MSE(full). What are the degrees of freedom in the numerator? That is, what is the difference in the number of parameters being estimated between the two models? What conclusion do you draw from this F-test?

 

2) a) The data in coasters.mtw are for 128 roller coasters in the United States including year opened, maximum speed achieved by the coaster (miles per hour), height (the structure’s greatest height, in feet, measured from the ground to the track level. Railings, flagpoles and such are not counted as part of the roller coaster’s height), maximum vertical drop (feet), length (feet), and number of inversions. (Data downloaded from http://www.rcdb.com Nov, 2003.)

(a) Fit a simple linear regression model to predict speed from height and use it to predict the average maximal speed of all coasters with greatest height equal to 150 ft with 95% confidence. What is the width of this interval?

(b) If height is entered into the model, do you think it will be useful to also enter drop?  Explain based on some preliminary explorations of the data.

(c) Choose Stat > Regression > Regression and enter both height and drop in the Predictors box.  Record the regression equation and provide an interpretation for each of the three coefficients.

(d) Use your model in (c) to predict the average maximal speed of all coaster with greatest height equal to 150 ft and maximum vertical drop of 200 ft with 95% confidence. What is the width of this interval? How does this compare to the width in (a)? Explain why the second interval is wider.

(e) Determine the Variance Inflation Factor values for your model in (c).  Do they suggest a problem?  Explain.

(f) Now fit the regression model to predict speed from height and length.  Do the Variance Inflation Factors suggest a problem?

(g) Consider adding drop to a model that already includes height (for predicting speed).  Examine an added-variable plot. Does it suggest adding drop?

(h) Repeat (g) for adding length to a model that already includes height.

(i) Of course, part of the reason height and length are not as strongly related is two unusual coasters. Identify them by name and how they are unusual in this context.

 

3) An older version of the data set we examined in class on televisions and life expectancy can be found here: tableB.16.mtw (as found in Montgomery, Peck, and Vining)

(a) Use life expectancy as the response variables and examine a matrix scatterplot with the two explanatory variables.  Does it appear any transformations will be necessary? Does it appear multicollinearity could be an issue?  Are there any unusual observations? (If so, identify the country/countries by name).

(b) Examine case influence statistics for the model predicting life expectancy from ln(People-per-TV).   Pick the country with the largest residual; is it a statistically significant outlier?  Comment on its leverage and case influence.

(c) Remove the most unusual country (any contextual justification at all for doing this?!) and fit a regression model for this response variable on ln(People-per-TV), storing the residuals.

Extra Credit: Search for information about this country online and see if you can justify why it may represent a different “population” with respect to these variables and how they are related.

(d) Create an added variable plot to assess the viability of adding People-per-Dr into the model.  Comment on what this plot reveals.  In particular, does it seem like People-per-Television and People-per-Dr are measuring the same thing? Does it suggest the variable should be transformed first? Does it suggest any unusual observations? (If so, identify the country/countries by name.)

(e) Fit the regression model for life expectancy on both ln(People-per-TV) and ln(People-per-Dr).  Is there evidence of multicollinearity?

 

4) Exercise 6.12 (p. 212-213)

Explain your choices.

 

5) For the stateSAT.mtw data, regress SAT on income and store the leverage (hi) values, the residuals, studentized residuals, deleted t residuals, Cook’s Di, and DEFITS values.

(a) For Alaska, verify that di =  (show the values plugged into the formula)

(b) For Alaska, verify that ti = (yes, determine s(i) first)

(c) For Alaska, verify that ti =

(d) For Alaska, verify that

(e) For Alaska, verify that DFFITS =