STAT 150       Project 3: Regression and Prediction

 

This project entails your applying a regression analysis to data that you find.  The goal is to find a useful regression model for predicting the values of one variable based on the values of several other variables. 

 

You will be assigned to work with one new partner, with whom you should collaborate on all aspects of this project.

 

Timeline:

 

As always, you should incorporate all computer output into the body of the report.  Your report should be a well-organized and self-contained.  Write as if your audience is college students who know some statistics.  Do not simply provide a list of answers to the questions posed below, which are meant to guide your analysis. 

 

For your oral presentation, you should have graphical and numerical summaries prepared for display, separate from the written report.  Your oral presentation should last 4-6 minutes.  Because of this time restriction, do not try to present all of your analyses; choose the most interesting aspects to present.  All members of the team must participate in the presentation.

 

Feel free to ask your instructors for help and feedback as you work on this.  Please do not wait until the last minute.

 

Notice that the last page of this document gives advice concerning things that you should have learned from earlier projects.  Please follow this advice in preparing your written report and oral presentation.

 

Project Components:

1. Collect the Data

You will need to gather data for at least 20 observational units.  Your data should include one quantitative variable that you will consider the response variable (to be predicted) and at least two potential explanatory variables, at least one of which must be quantitative.  You can collect these data either from the web or another reference source, or by designing your own survey or experiment.  Some topics considered by students in previous years include:

·         Finding jewelry prices online to investigate variables (such as number of carats and type of metal) that might be related to a bracelet’s price.

·         Finding apartment rental data online to investigate variables (such as size and distance from campus) that might be related to the price of the rent.

·         Finding NBA player salary data online to investigate whether variables such as age when entered the league and points scored per game were useful predictors of salary.

·         Taking a random sample of textbooks from the bookstore and trying to predict the price of a book based on its number of pages and year of publication.

You should try to make your sample selection process as “random” as possible.  There are many, many sources of data on the internet.  Be sure to mention in your report how, where, and when you collected the data, and any potential measurement problems with the data.

 

2. Examine the Data

a. Start with numerical and graphical summaries of each quantitative variable individually.  Comment on what they reveal.  (Recall that graphical summaries for a quantitative variable include dotplots and boxplots.  Numerical summaries should address aspects of center and spread.)

b. Describe the relationship between the response variable and each quantitative explanatory variable (one at a time).  Include appropriate graphical displays (scatterplots).  Comment on what they reveal (direction, strength, form).

 

c. Decide whether to apply transformations to any of your quantitative variables: If either scatterplot is not linear, take a log transformation of one or both variables in each graph (e.g., MTB> let c4=logt(c3)). If any of these three possibilities lead to a linear form in the scatterplot, proceed with the transformed data.  If not, you should find another quantitative variable.

 

d. If you have a categorical explanatory variable, create coded scatterplots of your response variable vs. each quantitative explanatory variable, using your categorical variable for the coding.  Comment on what these reveal.

 

3. Fit Regression Models

a. Start with the quantitative explanatory variable that is most strongly associated with the response variable.  Determine the equation of the least squares line, and produce a “fitted line plot” with this regression line superimposed on the scatterplot.  Report and interpret the value of R2.  Report and interpret the value of the slope and intercept coefficients (in context). Report the p-value and comment on what it indicates about the relationship between these variables.

 

b. Pick one observational unit from your data set that you suspect has a large influence on the regression model.  You might use Minitab’s output of unusual observations to help you to pick one.  Identify it by name if possible.  Remove that observation and re-create your analysis.  Comment on how much this observation seems to have influenced your results (regression line, R2, and p-value).

 

c. Fit a multiple regression model using all of your explanatory variables.  (Note: If you have a categorical variable, code it as 0s for one category and 1s for the other category.)  Report and interpret the value of R2.  Report and interpret the value of all model coefficients (slopes and intercept).  Report all coefficient p-values and comment on what they indicate about which variables are useful to include in the model.

 

d. Again pick one observation to remove from your analysis. Re-do the analysis with this observation excluded, and comment on the influence of this observation.

 

e. Compare these two models and comment on which you think is better for predicting the response variable.  Feel free to try additional models, and then report the one that you think is best, along with your justification. 

 

 

Final Report

Your final report should consist of the following sections (with section headings).  Remember to write in full and complete sentences, and feel free to be creative.

 

I. Introduction - Why was this data set of interest and what did you expect to see?  Why did you think the explanatory variables you chose would help to explain the variability in the response variable?

 

II. Summary of Data Collection Methods - Where, when, how did you find the data?  Are there any potential measurement issues?  Are there any definitions the reader should be aware of?  Make sure it is clear how the reader could find the data for him or herself.  (We also ask that you provide us with a Minitab file containing your data.)

 

III. Analysis of Results (remember to sprinkle in the appropriate computer output throughout)

Write a section describing the distributions of the variables individually.

 

Write a section describing the relationships between the quantitative variables.  Explain why you decided to use transformations or not (include any relevant scatterplots to your discussion).  Indicate which quantitative explanatory variable you considered more strongly related to the response, and explain why.

 

Write a section describing your two regression models.  Include all of output and the interpretations (of R2, of coefficients, of p-values) mentioned above.  Indicate which model you prefer, and justify your choice.

 

IV. Conclusion - Summarize what you learned about the relationships among these variables.  Suggest some additional variables/comparisons that could be used in future analyses and what you think those analyses might reveal.

 

 

 


Some things to remember/learn from previous projects about the written report:

·        Give your report a descriptive title, something more informative and creative than “Regression Project.”

·        Use section headings, and perhaps sub-headings, to organize your report and help your reader to see the overall structure.

·        Make the report look nice, for example with good formatting and well-placed page breaks.

·        Integrate graphs directly into your paragraphs, using descriptive lead-ins, where they naturally come up.  Write things like “As the scatterplot below reveals ...”  If instead you put graphs at the end of the report, label them as “Figure 1,” ... and in your paragraphs say things like “As shown in Figure 1 at the end of this report ...”

·        Feel free to improve on Minitab’s presentations.  For example, you can format and reorder numerical summaries.  You can also add/change labels to graphs (if you double click on a graph you get some “tools” you can use).

·        Support your conclusions with clear references to graphical and/or numerical summaries and computer output.

·        Edit computer output before including it in your report.  Only include the necessary parts.  Delete parts that are not related to your analysis and conclusions.

·        Include a well-written introduction and conclusion.

·        Be very careful with your statistical language, particularly terms like “random” and “correlation” and “effect” and “significant.”

·        Identify unusual observations by name, if possible.

·        When you think the report is done, remember to proofread and especially to compare what you have done to what the assignment asked for.  If each member of your team writes a separate section of the report, be sure to proof-read and make suggestions for improving each other’s sections.  Also make sure that the report reads well as a coherent whole and not just as a collection of pieces stapled together.

·        For a group project such as this, use “we” and not “I” throughout the report. The report should be written collaboratively.

 

Some things to remember/learn from previous projects about the oral presentation:

·        Pay attention to, and make efforts to improve on, the individual feedback that we have provided on your earlier presentations.  (This will carry considerable weight in determining the score for your oral presentation grade.)

·        Feel free to be creative to make your oral presentation and written report interesting and appealing.

·        Begin your oral presentation with a slide displaying your title and names.  Also include a slide giving an outline of your presentation.

·        Use visual aids (e.g., Powerpoint slides) to convey information such as your research question, observational units, response and explanatory variables.

·        When you talk about graphs and statistics and computer output, be sure to display it, in a well-labeled manner, so the audience can follow along.

·        Rehearse your presentation with your partner, and make sure that it fits within the time restrictions. 

·        Show enthusiasm and maintain eye contact during your oral presentation. 

·        Take an interest in each other’s presentations and ask questions!