Stat 324 – Project Assignment
There will be 3 components to this project, due at different times during the quarter. I recommend that you work in groups of 2-3 (with data complexity increasing with group size). The data can be collected in stages, but you should have a coherent plan for the total data collection from the start. It will be easiest to focus on one context but increasing the number of variables examined in each part of the project. There will be points for creativity/originality of topics chosen.
For each part you will submit a typed report with all appropriate computer output incorporated into the body of the report. A large portion of your grade will be determined by how effectively you communicate and present (style, readability, grammar, spelling) your results. You should also select the most relevant parts of the analysis, not turning in gobs of output with no “story” or explanation. Your discussion should use language understandable by a non-statistician, but you may use standard technical terms and notation.
Note that fulfilling the requirements is C work. Try to incorporate some “extras” to raise your grade.
Data Collection: For this
series of projects you will need to collect data for at least 50 observational
units. Your data should include at least 8 variables, with at least 3
quantitative variables and at least 1 binary and one nonbinary categorical
variable for these 50 observational units. You will collect this data
either from available data (e.g., the web) or by designing a survey or an
experiment (which is often more fun and interesting). Be thinking about
what relationship might occur between these variables that your data will
address. For example:
Suppose
that you are interested in purchasing a used car. How much should you
expect to pay? Obviously the price will depend on the type of car you get
(the model) and how much it’s been used. Your project would investigate
how price might depend on age of the car (in years). At the
same time you can collect data on mileage of the car, color of
the car and model of the car. Do your shopping on the internet,
e.g., autobytel.com or autograder.com or cars.com or
another site you find. Initially focus on one model of car but try to
choose a model that has been around for a while (so you get some variety in
ages). You should try to make your sample as “random” as possible (be
careful if the website displays the results in order). Be sure that you
are getting prices for actual cars – not “blue book” theoretical prices.
There are many, many
sources of data on the internet. Some can be linked from http://statweb.calpoly.edu/bchance/stat_stuff.html
and http://nilesonline.com/data/
If you decide to conduct your own survey or experiment,
turn in your data collection plan to me by April 15.
Part I: Simple Linear Regression (due April 22)
Introduction (6 pts): Briefly describe your investigation and
why it was of interest to you (remember to focus on relationships
between variables). Describe the source of your data and your data
collection plan. Identify two variables that you will focus on in Part I
(e.g., price and age). Which are you considering the explanatory variable and
which the response variable? You should have some replicates at one or more explanatory
variable values. Specify any
initial conjectures you have about the relationship you will find between these
two variables (before you saw the data).
Descriptive Statistics (6 pts): Produce and describe a scatterplot between these two variables. Are these data behaving as you conjectured? Is the relationship linear or do you need to perform linearizing transformations? Once you have a reasonably linear relationship, calculate and interpret the correlation coefficient in context.
Model (8 pts):
- Fit the least squares line to your (transformed?) data. Include a “fitted line plot” displaying the regression line on your (transformed?) scatterplot.
· You may need to iterate through several models/residual plots before finalizing your model
-
Interpret both regression coefficients in context.
-
Report and interpret the coefficient of determination in context.
- Identify and explain (if possible) any unusual observations (e.g., outliers, potential influential observations)
Statistical Inference (24 pts):
- Specify the null and alternative hypotheses for your conjecture. Is the alternative hypothesis one or two-sided?
- Check the model assumptions (include and discuss residual plots, lack of fit test if possible). Comment on whether you think the model assumptions are sufficiently met. If not, try (variance stabilizing?) transformations of the data. Summarize the overall fit of your model (if it’s not great, that’s fine, just say so).
- Carry out a test of significance for the slope and find a confidence interval for the “population” slope. Describe what this population slope represents. Provide a detailed interpretation of the p-value and confidence interval and discuss their implications.
- Create intervals for the mean prediction and the future response prediction at some interesting value of the explanatory variable. (That is, at some x-value that is interesting to you and why it’s of interest.) Interpret these intervals.
Conclusion (3 pts): Summarize your results. Comment on anything of interest that occurred to you during the project. Did the data behave roughly as you expected or did some of the results surprise you? Point out any unusual data values, interesting phenomenon, or obvious departures from regression assumptions. What other questions would you like to ask about the data?
Appendix: Email a copy of your Minitab worksheet, clearly identifying the variables, their units, and the source of the data.
Overall presentation/style/communication (3 pts)
Part II: Multiple Regression (due May 20)
For this portion of the project you will need to use at least 4 variables
(response should be quantitative and you should have at least one quantitative
and one binary explanatory variable). You are encouraged to continue with the
dataset you collected for Part I (e.g., age, mileage, price, and model).
Follow the same format for the report as in Part I (e.g., Introduction (5 pts),
Descriptive Analysis (10 pts), Modelling/Inference (30 pts), Conclusions (5
pts)).
Descriptive Statistics (10 pts):
- A matrix plot of the response and explanatory variables. Which explanatory variables appear most strongly associated with the response variable? Are these associations linear or should you perform any transformations? Are any of the explanatory variables highly correlated with each other? Do the associations behave as you expected? Any unusual observations (id by name)?
Possible extras: inclusion and discussion of univariate plots of individual variables, interaction plots of categorical variables
Model (15 pts):
- The regression analysis for your full model of (transformed?) variables (at least 4).
- Interpretation of the regression coefficients
- Interpretation of the R2 value
- Residual analysis for this model, including checks for
multicolllinearity
Keep in mind it's
possible you won't find a great model, but describe the strengths and
weaknesses of the one you have
- Case influence diagnostics and identification of any cases with high residuals or high influence. (Re-examine these cases and determine whether or not they should be excluded from the data set)
Possible extras: variable selection techniques and/or other model simplification strategies (but keep at least one quantitative and one categorical variable in model regardless of significance), model validation, interpretations of transformed variables
Inference (15 pts)
- Do the additional variables significantly improve the
model from Part I (or a simple linear regression model if you are using a new
data set)? Show the formal details of a test of significance to answer this
question.
- Consider adding a quadratic term for (one of) the quantitative variable(s).
Is it statistically signficant? (Show details.)
- Carry out a separate lines regression analysis including the binary explanatory variable to compare the two categories. Include a scatterplot showing the two lines and interpret the interaction effect. Perform a formal test to determine whether the linear relationships differ significantly across the groups (e.g., how the mileage and price relationship differs for two different models of car). Include statements of the hypotheses, details of the test statistic and p-value calculation, and your conclusion.
- Use Minitab to determine CIs for a mean response value and a future predicted value for at least one combination of X’s. Interpret these intervals.
Possible extras: interpreting quadratic behavior, using a categorical variable with more than two groups, comparisons to previously constructed confidence intervals from part I
Conclusion (5 pts): Summarize the final formulations of your model and
discuss conclusions in context.
Is the model valid? Is it significant? What insight does it give you about your
response variable?
Overall presentation/style/communication
Possible extras: Relating you
results to other studies, great layout/organization/graphics
Appendix: Email a copy of your Minitab worksheet, clearly identifying
the variables, their units, and the source of the data.
You may work with up two 2 other people on this project. Your report must be word processed with all relevant Minitab output incorporated into the body of the report. (Graphs and output should be integrated into the report, not just as an appendix. You can include additional details as an appendix but still need to select the relevant information as part of the discussion.) You should again send the original Minitab worksheet as an attachment in an email to me.
Presentations: Be ready to make a small presentation (< 5 min!) from your 3 projects to the rest of the class June 2: What did you analyze, what were the most interesting results? Small changes to the third report can be made between June 2 and June 4 based on the discussion that occurs during the presentations. You should select a few graphs and a bit of output to show to the rest of the class (does not have to be a transparency or powerpoint).
Data: Your data should consist of a binary response variable having just two categories and several potential predictors. At least two of the predictors should be good quantitative variables and one should be binary. Your goal will be to investigate logistic regression models to study how the chance of being in one of the two groups is related to the other variables. In some cases there might be an obvious binary variable to predict, such as whether or not a sports team wins a game, whether a car model is domestic or foreign, or whether a state has a Republican or Democratic governor. In other cases, you can “manufacture” a binary response from a quantitative variable. For example, you might want to study “poor” vs. “rich” countries and use some cutoff on per capita GDP as a means to assign the groups. Note that you can code the response variable as character data in Minitab (like “W” and “L”) to make the output more immediately understandable, otherwise make sure your codes are clearly defined.
The report: Make sure your name(s) and Spring 2009 are prominent (especially in Word files)
Introduction (2 pts): Summarize the history/background of your data, any prior suspicions.
Descriptive Statistics (5 pts): Explore some simple relationships in your data.
- Choose a single quantitative predictor and examine a graph comparing the distribution of this variable between the two groups (e.g., an Individual Value Plot or stacked dotplots). Describe how the distributions compare.
- Choose a single binary predictor and create a 2×2 contingency table with the response variable. Calculate and interpret the odds ratio from this table.
Single Predictor Model (20 pts): Choose a single quantitative predictor and run a logistic regression.
- Show (by hand) how to use the fitted model for predicting the probability of success for a particular outcome, being sure to explain what you are finding in terms of your data situation.
- Interpret the intercept coefficient of the model in context.
- Verify Minitab’s calculation of the odds ratio.
- Include a graph of the estimated probabilities vs. the explanatory variable with commentary.
- Discuss the meaning of the “odds ratio” value from the Minitab output and accompanying confidence interval (make sure you interpret the CI).
- Is the predictor variable statistically significant? Provide a careful interpretation of b1 including an interpretation of H0 and Ha in words (what does it mean to say b1 =0), report the appropriate test statistic and p-value, and your conclusion in English.
- Create a prediction table and comment on the accuracy of your predictions.
- Finally, include some assessment of the appropriateness of this model and single predictor.
Multiple Predictors Model (20 pts): Choose your best model using at least two predictors (at least one quantitative). Try to balance getting a good fit with keeping the model simple (but use at least two predictors – even if one or both are not very effective).
- Use and describe in detail a selective backward elimination process to pare down the model to 2 predictors.
- Include a coded scatterplot of the final 2 explanatory variables using the binary response as the groups, with commentary.
- As with the single case, show (by hand) how to use the fitted model for predicting a couple of typical cases.
- Comment on the effectiveness of each predictor in the model as well as the overall fit (using whatever parts of the Minitab output are appropriate).
- Carry out a drop in deviance test for quadratic terms and a test for interactions. Provide a careful interpretation of what the interaction would indicate (including a graph for your data?), whether or not it is statistically significant.
- For your final model, include plots of residuals, delta, and leverage values with commentary. Be sure to identify and attempt to comment on any unusual observations.
- Create a prediction table and compare the accuracy of your predictions to the single predictor model.
Examples of possible extras: More graphs/exploration; Trying several cut-off values in the prediction table; Using nonbinary categorical predictors.
Conclusion (3 pts): Provide an overall summary of the analyses in this project, including a recommend of which model you would use. Also include a critique of your analysis and suggestions for future analyses with your data.
Appendix: Email a copy of your Minitab worksheet, clearly identifying the variables, their units, and the source of the data.
Extra Credit Option: Create an ordinal or nominal regression model.
My main goal
is to see if you can carry out a slight extension to the procedures we have
discussed. I don’t expect perfection
here, but will give credit for serious attempts. For example, Ordinal or Nominal Logistic Regression: If the response variable is
categorical with more than 2 categories. If the categories are ordered (e.g.,
strongly disagree, disagree, neutral, agree, strongly agree) this is ordinal. Ordinal logistic regression assumes parallel
lines. If this is not a reasonable
assumption (or the categories are not ordered), use nominal logistic
regression. These correspond to the later options in the Minitab menu under
Stat > Regression.
The Minitab
help menus should be useful (see examples and interpreting results.)
If you refer
to any other sources, be sure to cite them.
Examples that follow are from Agresti’s Introduction to Categorical Data Analysis text.
Example: Suppose we have data on alligators that
concerns their length and the primary food type found in the alligator’s
stomach: Fish, Invertebrate, and Other.
Applying a nominal logistic regression model to fit logits for (J -1) pairs we see:
Logistic Regression Table
Odds 95% CI
Predictor Coef
SE Coef Z P
Ratio Lower Upper
Logit 1: (I/O)
Constant 5.697 1.794
3.18 0.001
length -2.4654 0.8997
-2.74 0.006 0.08 0.01
0.50
Logit 2: (F/O)
Constant 1.618 1.307
1.24 0.216
length -0.1101 0.5171
-0.21 0.831 0.90 0.33
2.47
Log(
f/
o) = 1.618-.1101length Log(
i/
o) = 5.697-2.465length
Consequently, log(
f/
i) = -4.079 + 2.35length
For alligators of length x+1 meters, the estimated odds that the
primary fish type is “fish” rather than “invertebrate” is e2.35
=10.48 times the estimated odds for alligators of length x meters.
Can convert to find the
estimated probabilities as a function of length
for each of the 3 groups. Interpret
log-likelihood and goodness of fit tests the same way.
Example:
Suppose we have information of people’s political ideology (very liberal,
slightly liberal, moderate, slightly conservative, very conservative, j=1,2,3,4,5) and their political party affiliation
(democrat, republican), then applying an ordinal logistic regression model, we
have (with Republicans as reference level):
Logistic Regression Table
Odds 95% CI
Predictor Coef
SE Coef Z P
Ratio Lower Upper
Const(1) -2.4690
0.1318 -18.73 0.000
Const(2) -1.4745
0.1091 -13.52 0.000
Const(3) 0.23712
0.09485 2.50 0.012
Const(4) 1.0695
0.1046 10.23 0.000
party
1 0.9745
0.1291 7.55 0.000 2.65
2.06 3.41
For any political ideology category, the
estimated odds that a Democrat’s response is in the liberal direction (Y<j vs. Y>j) rather than the conservation direction = e.975 = 2.65 times the estimated odds for a Republican,
indicating the Democrats tend to be more liberal than Republicans. (The odds
ratio applies to each of the four collapsings into a 2×2 table.) These models also work with cumulative logits, the log-odds of P(Y<j) = aj+bx. So if x=1
(Democrats) the log-odds P(very liberal) = -2.469+.9745 = 1.50
So
odds P(very liberal) = .224 and P(very liberal) = .183. There were 428 Democrats, so we expect
.183(428)=78.4 of them to be Democrats.
We get the following fitted values:
|
|
Very liberal |
Slightly lib |
Moderate |
Slightly conserv |
Very conserv |
Total |
|
Democratic |
78.4 |
83.2 |
168.2 |
49.1 |
49.1 |
428 |
|
Republican |
31.8 |
44 |
151.7 |
75.5 |
104.0 |
407 |