Stat 324 – Project Assignment

 

There will be 3 components to this project, due at different times during the quarter.  I recommend that you work in groups of 2-3 (with data complexity increasing with group size).  The data can be collected in stages, but you should have a coherent plan for the total data collection from the start.  It will be easiest to focus on one context but increasing the number of variables examined in each part of the project.  There will be points for creativity/originality of topics chosen.

For each part you will submit a typed report with all appropriate computer output incorporated into the body of the report.  A large portion of your grade will be determined by how effectively you communicate and present (style, readability, grammar, spelling) your results.  You should also select the most relevant parts of the analysis, not turning in gobs of output with no “story” or explanation.  Your discussion should use language understandable by a non-statistician, but you may use standard technical terms and notation.

Note that fulfilling the requirements is C work.  Try to incorporate some “extras” to raise your grade.

 

Data Collection: For this series of projects you will need to collect data for at least 50 observational units.  Your data should include at least 8 variables, with at least 3 quantitative variables and at least 1 binary and one nonbinary categorical variable for these 50 observational units.  You will collect this data either from available data (e.g., the web) or by designing a survey or an experiment (which is often more fun and interesting).  Be thinking about what relationship might occur between these variables that your data will address.  For example:

Suppose that you are interested in purchasing a used car.  How much should you expect to pay?  Obviously the price will depend on the type of car you get (the model) and how much it’s been used.  Your project would investigate how price might depend on age of the car (in years).  At the same time you can collect data on mileage of the car, color of the car and model of the car.  Do your shopping on the internet, e.g., autobytel.com or autograder.com or cars.com or another site you find.  Initially focus on one model of car but try to choose a model that has been around for a while (so you get some variety in ages).  You should try to make your sample as “random” as possible (be careful if the website displays the results in order).  Be sure that you are getting prices for actual cars – not “blue book” theoretical prices.

There are many, many sources of data on the internet.  Some can be linked from http://statweb.calpoly.edu/bchance/stat_stuff.html and http://nilesonline.com/data/

If you decide to conduct your own survey or experiment, turn in your data collection plan to me by April 15.

 

Part I: Simple Linear Regression (due April 22)

Introduction (6 pts): Briefly describe your investigation and why it was of interest to you (remember to focus on relationships between variables).  Describe the source of your data and your data collection plan.  Identify two variables that you will focus on in Part I (e.g., price and age). Which are you considering the explanatory variable and which the response variable?  You should have some replicates at one or more explanatory variable values.  Specify any initial conjectures you have about the relationship you will find between these two variables (before you saw the data).

 

Descriptive Statistics (6 pts): Produce and describe a scatterplot between these two variables.  Are these data behaving as you conjectured?  Is the relationship linear or do you need to perform linearizing transformations? Once you have a reasonably linear relationship, calculate and interpret the correlation coefficient in context. 

 

Model (8 pts):

- Fit the least squares line to your (transformed?) data.  Include a “fitted line plot” displaying the regression line on your (transformed?) scatterplot.  

·      You may need to iterate through several models/residual plots before finalizing your model

- Interpret both regression coefficients in context.

- Report and interpret the coefficient of determination in context.

- Identify and explain (if possible) any unusual observations (e.g., outliers, potential influential observations)

 

Statistical Inference (24 pts):

- Specify the null and alternative hypotheses for your conjecture. Is the alternative hypothesis one or two-sided?

- Check the model assumptions (include and discuss residual plots, lack of fit test if possible). Comment on whether you think the model assumptions are sufficiently met. If not, try (variance stabilizing?) transformations of the data. Summarize the overall fit of your model (if it’s not great, that’s fine, just say so).

- Carry out a test of significance for the slope and find a confidence interval for the “population” slope. Describe what this population slope represents. Provide a detailed interpretation of the p-value and confidence interval and discuss their implications.

- Create intervals for the mean prediction and the future response prediction at some interesting value of the explanatory variable. (That is, at some x-value that is interesting to you and why it’s of interest.) Interpret these intervals.

 

Conclusion (3 pts): Summarize your results.  Comment on anything of interest that occurred to you during the project.  Did the data behave roughly as you expected or did some of the results surprise you?  Point out any unusual data values, interesting phenomenon, or obvious departures from regression assumptions.  What other questions would you like to ask about the data?

 

Appendix: Email a copy of your Minitab worksheet, clearly identifying the variables, their units, and the source of the data.

 

Overall presentation/style/communication (3 pts)

 

 

Part II: Multiple Regression (due May 20)
For this portion of the project you will need to use at least 4 variables (response should be quantitative and you should have at least one quantitative and one binary explanatory variable). You are encouraged to continue with the dataset you collected for Part I (e.g., age, mileage, price, and model).  Follow the same format for the report as in Part I (e.g., Introduction (5 pts), Descriptive Analysis (10 pts), Modelling/Inference (30 pts), Conclusions (5 pts)).

 

Descriptive Statistics (10 pts):

- A matrix plot of the response and explanatory variables.  Which explanatory variables appear most strongly associated with the response variable? Are these associations linear or should you perform any transformations? Are any of the explanatory variables highly correlated with each other? Do the associations behave as you expected? Any unusual observations (id by name)?

Possible extras: inclusion and discussion of univariate plots of individual variables, interaction plots of categorical variables


Model
(15 pts):

- The regression analysis for your full model of (transformed?) variables (at least 4).

- Interpretation of the regression coefficients

- Interpretation of the R2 value

- Residual analysis for this model, including checks for multicolllinearity
            Keep in mind it's possible you won't find a great model, but describe the strengths and weaknesses of the one you have

- Case influence diagnostics and identification of any cases with high residuals or high influence. (Re-examine these cases and determine whether or not they should be excluded from the data set)

Possible extras: variable selection techniques and/or other model simplification strategies (but keep at least one quantitative and one categorical variable in model regardless of significance), model validation, interpretations of transformed variables


Inference
(15 pts)

- Do the additional variables significantly improve the model from Part I (or a simple linear regression model if you are using a new data set)? Show the formal details of a test of significance to answer this question.
- Consider adding a quadratic term for (one of) the quantitative variable(s).  Is it statistically signficant? (Show details.)

- Carry out a separate lines regression analysis including the binary explanatory variable to compare the two categories. Include a scatterplot showing the two lines and interpret the interaction effect.  Perform a formal test to determine whether the linear relationships differ significantly across the groups (e.g., how the mileage and price relationship differs for two different models of car). Include statements of the hypotheses, details of the test statistic and p-value calculation, and your conclusion.

- Use Minitab to determine CIs for a mean response value and a future predicted value for at least one combination of X’s. Interpret these intervals.

        Possible extras: interpreting quadratic behavior, using a categorical variable with more than two groups, comparisons to previously constructed confidence intervals from part I


Conclusion
(5 pts): Summarize the final formulations of your model and discuss conclusions in context.
Is the model valid? Is it significant? What insight does it give you about your response variable?
Overall presentation/style/communication
        Possible extras: Relating you results to other studies, great layout/organization/graphics

Appendix: Email a copy of your Minitab worksheet, clearly identifying the variables, their units, and the source of the data.

 

 

 

Part III: Logistic Regression Models (due on Friday June 4 or before)

 

You may work with up two 2 other people on this project.  Your report must be word processed with all relevant Minitab output incorporated into the body of the report.  (Graphs and output should be integrated into the report, not just as an appendix.  You can include additional details as an appendix but still need to select the relevant information as part of the discussion.)  You should again send the original Minitab worksheet as an attachment in an email to me.

 

Presentations: Be ready to make a small presentation (< 5 min!) from your 3 projects to the rest of the class June 2: What did you analyze, what were the most interesting results?  Small changes to the third report can be made between June 2 and June 4 based on the discussion that occurs during the presentations.  You should select a few graphs and a bit of output to show to the rest of the class (does not have to be a transparency or powerpoint).

 

Data: Your data should consist of a binary response variable having just two categories and several potential predictors.  At least two of the predictors should be good quantitative variables and one should be binary. Your goal will be to investigate logistic regression models to study how the chance of being in one of the two groups is related to the other variables.   In some cases there might be an obvious binary variable to predict, such as whether or not a sports team wins a game, whether a car model is domestic or foreign, or whether a state has a Republican or Democratic governor.  In other cases, you can “manufacture” a binary response from a quantitative variable.  For example, you might want to study “poor” vs. “rich” countries and use some cutoff on per capita GDP as a means to assign the groups.  Note that you can code the response variable as character data in Minitab (like “W” and “L”) to make the output more immediately understandable, otherwise make sure your codes are clearly defined.

 

The report: Make sure your name(s) and Spring 2009 are prominent (especially in Word files)

 

Introduction (2 pts): Summarize the history/background of your data, any prior suspicions.

 

Descriptive Statistics (5 pts): Explore some simple relationships in your data.

- Choose a single quantitative predictor and examine a graph comparing the distribution of this variable between the two groups (e.g., an Individual Value Plot or stacked dotplots).  Describe how the distributions compare.

- Choose a single binary predictor and create a 2×2 contingency table with the response variable.  Calculate and interpret the odds ratio from this table.

 

Single Predictor Model (20 pts): Choose a single quantitative predictor and run a logistic regression.

- Show (by hand) how to use the fitted model for predicting the probability of success for a particular outcome, being sure to explain what you are finding in terms of your data situation. 

- Interpret the intercept coefficient of the model in context.

- Verify Minitab’s calculation of the odds ratio.

- Include a graph of the estimated probabilities vs. the explanatory variable with commentary. 

- Discuss the meaning of the “odds ratio” value from the Minitab output and accompanying confidence interval (make sure you interpret the CI). 

- Is the predictor variable statistically significant? Provide a careful interpretation of b1 including an interpretation of H0 and Ha in words (what does it mean to say b1 =0), report the appropriate test statistic and p-value, and your conclusion in English.

- Create a prediction table and comment on the accuracy of your predictions.

- Finally, include some assessment of the appropriateness of this model and single predictor.

 

Multiple Predictors Model (20 pts): Choose your best model using at least two predictors (at least one quantitative).  Try to balance getting a good fit with keeping the model simple (but use at least two predictors – even if one or both are not very effective). 

- Use and describe in detail a selective backward elimination process to pare down the model to 2 predictors. 

- Include a coded scatterplot of the final 2 explanatory variables using the binary response as the groups, with commentary.

- As with the single case, show (by hand) how to use the fitted model for predicting a couple of typical cases. 

- Comment on the effectiveness of each predictor in the model as well as the overall fit (using whatever parts of the Minitab output are appropriate). 

- Carry out a drop in deviance test for quadratic terms and a test for interactions.  Provide a careful interpretation of what the interaction would indicate (including a graph for your data?), whether or not it is statistically significant. 

- For your final model, include plots of residuals, delta, and leverage values with commentary.  Be sure to identify and attempt to comment on any unusual observations.

- Create a prediction table and compare the accuracy of your predictions to the single predictor model.

 

Examples of possible extras: More graphs/exploration; Trying several cut-off values in the prediction table; Using nonbinary categorical predictors.

 

Conclusion (3 pts): Provide an overall summary of the analyses in this project, including a recommend of which model you would use.  Also include a critique of your analysis and suggestions for future analyses with your data.

 

Appendix: Email a copy of your Minitab worksheet, clearly identifying the variables, their units, and the source of the data.

 

Extra Credit Option: Create an ordinal or nominal regression model.

 

My main goal is to see if you can carry out a slight extension to the procedures we have discussed.  I don’t expect perfection here, but will give credit for serious attempts.  For example, Ordinal or Nominal Logistic Regression: If the response variable is categorical with more than 2 categories. If the categories are ordered (e.g., strongly disagree, disagree, neutral, agree, strongly agree) this is ordinal.  Ordinal logistic regression assumes parallel lines.  If this is not a reasonable assumption (or the categories are not ordered), use nominal logistic regression. These correspond to the later options in the Minitab menu under Stat > Regression. 

The Minitab help menus should be useful (see examples and interpreting results.)

If you refer to any other sources, be sure to cite them.  Examples that follow are from Agresti’s Introduction to Categorical Data Analysis text.

 

Example:  Suppose we have data on alligators that concerns their length and the primary food type found in the alligator’s stomach: Fish, Invertebrate, and Other.  Applying a nominal logistic regression model to fit logits for (J -1) pairs we see:

Logistic Regression Table

                                                  Odds         95% CI

Predictor       Coef    SE Coef        Z     P    Ratio    Lower    Upper

Logit 1: (I/O)

Constant       5.697      1.794     3.18 0.001

length       -2.4654     0.8997    -2.74 0.006     0.08     0.01     0.50

Logit 2: (F/O)

Constant       1.618      1.307     1.24 0.216

length       -0.1101     0.5171    -0.21 0.831     0.90     0.33     2.47

 

Log(f/o) = 1.618-.1101length                    Log(i/o) = 5.697-2.465length

Consequently, log(f/i) = -4.079 + 2.35length

For alligators of length x+1 meters, the estimated odds that the primary fish type is “fish” rather than “invertebrate” is e2.35 =10.48 times the estimated odds for alligators of length x meters.

Can convert to find the estimated probabilities as a function of length for each of the 3 groups.  Interpret log-likelihood and goodness of fit tests the same way.

 

Example: Suppose we have information of people’s political ideology (very liberal, slightly liberal, moderate, slightly conservative, very conservative, j=1,2,3,4,5) and their political party affiliation (democrat, republican), then applying an ordinal logistic regression model, we have (with Republicans as reference level):

Logistic Regression Table

                                                   Odds        95% CI

Predictor       Coef    SE Coef        Z     P    Ratio    Lower    Upper

Const(1)     -2.4690     0.1318   -18.73 0.000

Const(2)     -1.4745     0.1091   -13.52 0.000

Const(3)     0.23712    0.09485     2.50 0.012

Const(4)      1.0695     0.1046    10.23 0.000

party    

 1            0.9745     0.1291     7.55 0.000     2.65     2.06     3.41

     For any political ideology category, the estimated odds that a Democrat’s response is in the liberal direction (Y<j vs. Y>j) rather than the conservation direction = e.975 = 2.65 times the estimated odds for a Republican, indicating the Democrats tend to be more liberal than Republicans. (The odds ratio applies to each of the four collapsings into a 2×2 table.)  These models also work with cumulative logits, the log-odds of P(Y<j) = aj+bx. So if x=1 (Democrats) the log-odds P(very liberal) = -2.469+.9745 = 1.50

So odds P(very liberal) = .224 and P(very liberal) = .183.  There were 428 Democrats, so we expect .183(428)=78.4 of them to be Democrats.  We get the following fitted values:

 

Very liberal

Slightly lib

Moderate

Slightly conserv

Very conserv

Total

Democratic

78.4

83.2

168.2

49.1

49.1

428

Republican

31.8

44

151.7

75.5

104.0

407