Stat 512 – Final Review

 

Chance Office Hours: Mon 12-2pm, Tues 8-9am, Wed 9:30-10:30am, 1-2pm, Fri 12-2pm

Final Review Session:  Tuesday, 7-9pm, in regular studio classroom (02-206)

Final Exam Times: Wednesday, 4-7pm, and Friday 7-10pm

If you forget which day you are, see http://statweb.calpoly.edu/chance/stat512/finalSched.html

 

Final Exam:  The final will mostly focus on the material since the midterm, but will have a cumulative component.  The final will be a three-hour exam in the studio.  The format will be similar to the midterm (mixture of short answer and analysis questions) and is open book, notes, computers.  You should bring your course notes and a calculator.  You will probably be asked to carry out analyses in Minitab as well as to interpret Minitab output.

 

Things you can do for yourself:

·       Fill out the definitions of symbols we have used (see last page of handout)

·       Be extra familiar with the “Overview of Statistical Procedures” handout and with the table at the beginning of Lecture 7.

·       Organize your notes with respect to Minitab commands/menus we have used this quarter.

·       Work problems!  Put pencil to paper and go through the calculations and explanations (without using the word “it”), especially for confidence intervals and significance tests.  Browsing the text and notes is not enough, you need to put yourself in the test-taking environment and start the problem from scratch.  You want the calculations to be automatic so they don’t slow you down on the exam.  Pick up graded (printed) assignments from the wooden box outside my office.  Reworking old homework questions and class examples will allow you to check your answers with the key, while providing you with “representative questions” (solutions at statweb.calpoly.edu/chance/stat512/).  It’s possible your projects will be graded and available to you on Monday.

·       Pretend the final is closed book.

·       See earlier review handout for Midterm

·       Ask questions!  Use the email list and instructor to your advantage to clear up concepts. 

·       Get plenty of sleep

 

Topics since the Midterm

 

From Lecture 12 you should be able to:

·       Recognize when the research question of interest is a comparison of two independent populations

·       Carry out and interpret a two sample z-test and two sample z-interval for the difference in population proportions, p1-p2

o      Realize that the SE calculation differs slightly for CI and for test statistic

o      Realize what factors affect the p-value and the width of the confidence interval

o      The conditions required for validity of the procedure and how to check them

·       Carry out and interpret a two sample z-test and two sample z interval for the true treatment effect between two proportions (large-sample approximation to two-way table simulation)

o      The difference between an empirical randomization distribution and an empirical sampling distribution.

·       Duality between confidence intervals and tests of significance


From Lecture 13 you should be able to:

·       Recognize when the research question of interest is a comparison of two independent means

·       Carry out and interpret a two sample t-test and two sample t-interval for the difference in population means

o      Understand the need for the t distribution

o      Realize what factors affect the p-value and the width of the confidence interval

o      In particular, understand the benefits and methods of reducing variability

o      The conditions required for validity of the procedure and how to check them

·       Carry out and interpret a two sample t-test and two sample t interval for the true treatment effect between two means (large-sample approximation to randomization test)

·       Define the implications of a Type I Error in a particular context

·       Recognize when the research question of interest leads to a chi-square analysis

o        Comparing two or more population distributions on a categorical variable

o        Association between two categorical variables in one population

·       Carry out and interpret a chi-square test (see also Lecture 14)

o      State hypotheses corresponding to how data were collected (case 1, 2 or 3)

o      Understand what expected counts represent and how they are calculated

o      Perform follow-up analysis of cell contribution values

o      The conditions required for validity of the procedure and how to check them

o   Randomness condition also depends on which case are in

 

From Lecture 14 you should be able to:

·       Recognize when the research question of interest leads to ANOVA

·       Carry out and interpret a one-way Analysis of Variance

o      Understand how ANOVA compares between group variability to within group variability

o      Realize what factors affect the p-value

o      The conditions required for validity of the procedure and how to check them

·       Carry out and interpret Tukey’s Multiple Comparison procedure

 

From Lecture 15 you should be able to:

·       Carry out an Analysis of Variance with multiple explanatory variables (e.g., “two-way”)

o      Interpret and test for an interaction between two explanatory variables

o                            Does the effect of one variable depend on the outcome of the other (e.g., the city route to school is faster in the morning but  the freeway route is faster in the evening)

o      Recognize the benefits of including a “blocking” variable in the design and analysis

o   E.g., what’s wrong with treating “paired” data as independent samples?

o   Carry out the randomization within each block (e.g., order of chips)

o   How to enter data into Minitab and carry out a repeated measures analysis

o      How to interpret the individual p-values for each factor

·       Understand the benefit of reducing the “Error” component

 

From Lecture 16 you should be able to:

·       Construct and interpret a scatterplot to analyze the relationship in two quantitative variables

o      EV on horizontal, RV on vertical axis

o      Discuss form, direction, and strength

·       Report and interpret the correlation coefficient

o      Think about properties of r (units, possible values, resistance)

o      Remember to see if relationship is linear first

·       Calculate and interpret residuals

·       Explain what is meant by “least squares” regression

·       Calculate and interpret the least-squares regression line

o      How to write it out as an equation for a line to predict the response variable

o      Think about properties of the least squares line (resistance)

o      Interpret the slope and intercept coefficients in context (on average/predicted)

o      Use the regression line to make predictions

·       Report and interpret the r2 value

o      Relationship to r

 

From Lecture 17 you should be able to:

·       Discuss and check the conditions for the basic regression model

o      Produce and interpret residual plots

·       Understand when there is a need to transform data and how to decide whether a transformation has been effective

·       Discuss the sampling distribution of the sample slope (what does it represent? what does it tell you?)

o      We looked at the sampling distribution, but the randomization distribution looks just like it!

·       Produce and interpret Minitab output for carrying out inference for a regression slope

o      State hypotheses in terms of b, no association (be sure to ID the two variables)…

o      Adjust test statistic and p-value if hypothesized value differs from 0 or Ha is two-sided

 

From Lecture 18 you should be able to:

·       Use Minitab to produce a matrix scatterplot and decide which explanatory variable(s) appear to have the strongest linear relationship with the response variable

·       Use Minitab to produce a multiple regression equation

o      Interpret slope coefficients (all other variables constant)

o      Understand the benefits of including more explanatory variables in the model

o      Check the validity of inference procedures using residual plots

o      Analyze the overall F test/Model Utility test (state appropriate hypotheses)

o      Analyze the individual t tests (state appropriate hypotheses)

o      Interpret the R2 value and use R2(adj) to compare models

·       Be cautious of “multicollinearity”

o      Are the explanatory variables strongly related?  Could make slope estimates unstable.

·       Include a binary explanatory variable in the regression model

o      Interpret slope coefficient and p-value

 

Some Big Picture Ideas

·       Observational units, variables, population vs. sample, parameter vs. statistic

o      State variables as variables

o      Define parameters as numbers

·       The statistical process (Formulating research question, data collection methods, descriptive statistics – numerical and graphical summaries, inferential statistics, stating conclusions)

o      In stating conclusions, consider statistical significance, generalizability and cause and effect

·       Experiments vs. observational studies

o      How to design an experiment, How to properly select a sample 

o      Scope of conclusions depend on how study was conducted (Can you draw a cause and effect conclusion? Can you generalize to a larger population?)

o      Sampling errors, nonsampling errors, and random sample errors (and which of these are measured by the “margin of error” and p-value?), how do we reduce them?

o      Distinguishing between random samples and randomization in hypotheses and statements of study conclusions

o      Association vs. causation

·       Describing and comparing distributions of data

o      Categorical: segmented bar graphs, conditional percentages, difference in proportions

o      Quantitative: shape, center, and spread, boxplots, histograms, dotplots, resistance of median and IQR

·       Carrying out and interpreting descriptive and inferential procedures (e.g., population vs. sample results)

o      Determining which descriptive and inferential procedures to use based on the research question

o      When and how to use “paired t” and “repeated measures” procedures (e.g., twins, shopping data, chocolate melting)

o      How to state technical conditions for the procedures and how to check them

o      Sample size effects

·       Using simulation to generate empirical sampling distributions and approximate p-values

·       How to interpret “confidence” (level) and p-values in context and factors affecting them

 

What is Statistical Inference?

We will consider two types of inference: inference about populations and inference about treatment effects.  The parameter and the statistic summarize the same variable.   The parameter summarizes the variable for the population or for the true underlying treatment effect, which is what we want to know, e.g. p or m­1-m2.  However, we can’t observe the whole population or all possible outcomes of the randomization process, so we don’t know what the parameter value really is.  However, we can measure the variable on a random sample or randomized groups and compute a sample statistic, e.g. or 12. The question is what can we infer about the parameter based on this statistic. Since we saw that the sampling distribution/ randomization distribution of these statistics follow a regular pattern, we can calculate probabilities of different values of the statistic occurring.  Different statistics follow different distributions, but once we know which distribution we should use, we can make predictions about the value of the parameter, e.g. it’s in some interval or we have evidence that it is not a particular value.

Sampling Distributions  If we specify a value for the population parameter, we can take (or simulate) lots of random samples from this population and calculate a statistic for each sample.  This allows us to examine the behavior (distribution) of the statistic so we can discuss the shape, center, and variability of that distribution.  For example, what types of values do we expect the statistic to have, how far away is the observed statistic from the hypothesized parameter? 

Randomization Distributions  If we assume no treatment effect, we can simulate lots of different randomizations of the experimental units to treatment groups and calculate a statistic for each randomization.  This allows us to examine the behavior (distribution) of the statistic so we can discuss the shape, center, and variability of that distribution.  For example, what types of values do we expect the statistic to have, how far away is the observed value from the center of zero (we know the distribution centers at zero since we told the simulation that there was no treatment effect)? 

Therefore, seeing where the statistic falls in this distribution gives us an indication of whether it is a surprising outcome or not, given our (null) hypothesis about the population/treatment effect.

 

Hypothetical Inference

Often times, we do not have a random sample or randomization.  Technically, inference should not be performed in these cases.   However, it may still be of interest to ask whether we can eliminate “chance” as an explanation for the differences observed in the study.  You may ask, how often would we expect to find a result this extreme if it had been a random sample or if the observational units had been randomly assigned.  In this case, we can carry out a hypothetical inference procedure, but then put up very large red flags in our conclusions.  We may eliminate chance as an explanation but we have not eliminated bias (if a convenience sample) or confounding (if an observational study).  For example, we can say there were fewer black 3rd base coaches than we would have expected if the coaches had been randomly assigned to each base but we can’t draw any cause and effect conclusions.

 

Confidence Intervals Estimate population/treatment parameter

The goal of a confidence interval is to get a range of plausible values that we think the population/treatment parameter could be equal to.  To do this, we use the sample statistic and a measure of the sampling/randomization variability of the statistic.  This lets us form an interval around the sample statistic that should contain the population/treatment parameter.  Note, we are trying to contain the parameter in the interval, not the data and not the statistic.

 

Tests of Significance Test claim about population/treatment parameter

The goal of a test of significance is to make a decision about the population parameter.  Here are the steps we use:

0) Define the parameter(s) of interest.  (Should also be able to define the OUs and variable.)

1) Specify the hypotheses (e.g. H0: p=1/3, m1-m2=0, m1= m2= m3, no relationship between vars) 

Always in terms of the population parameters since that’s what is unknown and what we are trying to make statements about (take off the hats!)

            The null hypothesis is the “dull hypothesis” or the “ho-hum hypothesis”

            The alternative hypothesis specifies something interesting

                        One or two-sided (decide based on wording in research question)

2) Check the technical conditions, sketch the sampling/radnomization distribution of the statistic assuming H0 is true, and identify the appropriate test procedure

3) Calculate the test statistic comparing the data observed in the sample to “expected”

4) Find the p-value=probability of observing a value of the test statistic as extreme or more

    extreme when H0 is true.

E.g., if our test statistic follows a Normal distribution when H0 is true, we find P(Z>z) or P(Z<z) (how far out in the tails does z fall?) for a one-sided test (direction determined by Ha), or 2Pr(Z>|z|) for a two-sided test.

Know when to multiply by two for two-sided alternatives!

5) Decision: Decide to reject or fail to reject Ho

Reject if p-valuea, synonymous with saying result is “statistically significant”

6) Conclusion: Make conclusion about research question of interest (back to English)

 

If we repeatedly took different samples/randomizations and calculated the value of the test statistic for each sample, the p-value indicates how often we would expect to see the test statistic value that we actually did observe, or one more extreme, when Ho is true.  If the test statistic value is very unlikely (so small p-value) we stop believing H0 (recall the loaded dice example). We can compare to the significance level a as a benchmark to decide if the p-value is “too small.”

 

T vs Z With a binary qualitative variable, our observations consist of “yeses” and “nos” for each observational unit in the population.  A picture of this population is simply a bar graph.  In particular, we don’t worry about its shape or variability. We will always approximate the distribution of the sample proportion with the normal distribution. Thus, we never worry about using the t distribution with proportions.  With means, the t distribution is used to take into account the extra variation we will see in the sampling distribution if we also substitute the sample standard deviation, s, into the equation.

 

Population vs Variable vs Parameter A population is a group of objects, a variable is what we measure about the objects, a question (e.g., height).  The observational units are the objects we measure (e.g., buildings, volleyball players).  You need to be able to decide how many populations you have and how many variables, e.g., are you measuring two different things/answering two different questions about the objects (e.g., height and age); are you measuring the same thing on two different groups (e.g. heights of men and heights of women).  Parameters are numbers, we just may not know their exact numerical value, that describe the population (e.g., the average height of all buildings, the average age of all volleyball players).

·        We look at associations between variables and differences between groups

·        Consider the data collection, was one sample taken and two or more variables measured (e.g., eye condition and lighting condition) or were separate samples taken w/ one variable measured on members of each group (e.g., verbal ability of males and verbal ability of females)?

 

Independence Consider three uses of the term “independence” but remember that something is always independent from something else (e.g., can’t have one independent sample)

·        Independent Observations:  Does the outcome for one observational unit depend on the outcome of another observational unit?  For example, if people collaborate together on an assignment, their scores are probably not independent.  Knowing whether person A got a good score, changes my prediction for whether person B got a good score.  We can assume that the observations in our study are independent when we take random samples, take systematic samples where the spacing between observations is far enough apart (e.g., not waiting time of successive people in line), or by preventing subjects from communicating (e.g., not allowing peer pressure in survey responses).

·        Independent samples: We will assume two samples are independent if the selection of units for one sample does not influence the selection of units for the other sample.  An obvious counter example is when the members of both groups are the same (paired data) – once I know who is in the first sample, that completely determines who is in the second sample.  Sometimes we have true independent samples (e.g., separate random samples from two different years), other times, we must be wiling to consider the samples independent (e.g., one random sample split into males and females), though with chi-square procedures we made a little bit bigger deal about whether we truly had separate samples or if we just recorded the gender as one of the measured variables.

·        Independent variables:  Two variables are independent if knowledge of the outcome of one variable provides no new information about which outcomes of the other variable are more likely.  For example, gender and handedness are dependent if males tend to be left-handed more often than females; so that once I know I have selected a male, I would increase my estimate of the probability that the individual was left-handed than if I didn’t know the person’s gender.  We perform tests of significance (chi-square, regression, even ANOVA) to help us decide whether two variables are independent in the population.  It’s pretty unlikely for two variables to be completely independent (e.g., identical conditional proportions) in the sample, so we ask whether the dependence strong enough to eliminate chance as an explanation for the observed dependence? Note that we talk about independence/association between two variables.  We don’t talk about the outcomes of the variables or levels of the variables, but the entire variable. 

 

 


Symbols

a (alpha)

 

a

 

b

 

b

 

X2

 

d.f.

 

F

 

H0, Ha

 

IQR

 

m (mu)

 

mo, po

 

n, (ni)

 

p (pi)

 

 (p hat)

 

p-value

 

r

 

r2

 

R2(adj.)

 

s, s2

 

SE

 

s2, s (sigma)

 

t

 

t0

 

 

 

z

 

 

 

Review problems

 

Identify the error in the following analyses:

1)  A confidence interval comparing the rate of developing a flu-like illness from a vaccinated and an unvaccinated group p1-p2 is determined to be (-.056, .016). 

(a) This indicates that 90% of the sample proportions are contained in this interval.

(b) This indicates that we are 90% confident that between 1.6% and 5.6% of the population developed a flu-like illness.

 

2) A confidence interval comparing the average number of words remembered for two different ways of presenting the words (familiar chunks and unfamiliar chunks) is determined to be (1.65, 8.60) with a p-value of .003.

(a) Since the interval does not contain 0, we are 95% confident that the scores for the first group are larger than the scores for the second group.

(b) Those receiving the letters in familiar chunks will perform better 95% of the time than those receiving the letters in unfamiliar chunks.

(c) The p-value of .003 indicates that there is a .003 probability that those receiving the letters in familiar chunks will perform better than those receiving the letters in unfamiliar chunks.

 

3) Another study of housing prices (in thousands of dollars) found the equation (for Bakersfield homes in April, 2003):

            predicted price = 30.15 + .0695 sq ft.   r2 = 56.4%, p-value < .001

(a) The sample slope coefficient reveals that a house’s price goes up by $69.50 for each additional square foot of size.

(b) If the technical conditions are met, the very small p-value suggests that there is no linear association between a house’s size and its price.

(c) If the technical conditions are met and if the p-value had been larger than .10, you could have concluded that the sample data provide strong evidence that there is no association between a house’s size and its price.

(d) Adding square footage to a house causes the price to increase by $69.50 on average.

 

4) Suppose we want to predict hiking time from hiking distance for Day Hikes in San Luis Obispo County and find predicted time = -1.27 + 31.5distance with r2 = .838.  Identify at least one problem with each of the following interpretations.

(a) The slope shows that for each additional minute, we predict the hike is 31.5 miles longer.

(b) The slope shows that a hiker’s time increases by 31.5 minutes for each additional mile.

(c) The predicted time for a 5-mile hike is about 156.

(d) About 84% of hikes have times that are correctly predicted by the line.

(e) About 84% of the variability in hikes is explained by time.

 

Sample Problems from a previous Stat 512 Final Exam (different instructor)

1.  The MINITAB output that follows resulted from taking observations on the percentage of body fat taken from teenagers defined as clinically obese.

 

T Confidence Intervals

 

Variable     N      Mean    StDev  SE Mean       99.0 % CI

Body Fat    18     28.95     4.53     1.07  (   25.86,   32.05)

 

a)     Define the parameter(s) of interest.

b)     Is the value of 28.95 a parameter or a statistic?  Explain.

c)     True or false:

i)      For this interval to be valid, it is necessary that the population of percentages of body fat be normally distributed.

ii)    A larger sample would make the population distribution more normal.

iii)  For this interval to be valid, it is necessary that the population standard deviation of the percentages of body fat be known.

iv)   99% of the time, the mean percentage of body fat will fall between 25.86 and 32.05.

v)     You can be absolutely sure that the mean percentage of body fat is between 25.86 and 32.05.

vi)   It is possible that the mean percentage of body fat of the population is not between 25.86 and 32.05.

vii)  99% of all intervals created in this fashion will contain the mean percentage of body fat.

viii)         A 95% confidence interval obtained from the same data would be wider.

ix)   If a two-tailed hypothesis test of H0: m = 30 were performed using these data, H0 would be rejected.

 

2.     “Photo-volume and weight tables for Central Coast hardwoods have not been available prior to this study.  Their importance to hardwood resource evaluation efforts is twofold.  First, a general reconnaissance survey can be rapidly conducted from aerial photos of 1:10,000 scale or greater in the comfort of an office.  Secondly, with a small number of field samples, relatively accurate volume and weight estimates can be obtained for stands of hardwoods.”  This quote is from a Master’s thesis Tree Photo Volume and Weight Tables for California’s Central Coast (Brockhaus, John A., Cal Poly).  As part of the research project described therein, data were collected from multiple forest stands.  Aerial photographs of the stands were taken and used to produce photo volume--an estimate of the volume of wood in a stand.  Then foresters traveled to the same stands and used standard procedures to determine the field volume of the stand--an accepted measure of the total volume of wood in a stand, but one that takes more work, time, and resources than aerial photography.  A lumber company, wanting to see if photo volume would produce adequate estimates of field volume, used these data to generate the following MINITAB output.

 

The regression equation is

Field = 34.4 + 1.14 Photo

 

Predictor        Coef     SE Coef          T        P

Constant        34.37       72.64       0.47    0.642

Photo         1.13710     0.08953      12.70    0.000

 

S = 209.8       R-Sq = 90.0%     R-Sq(adj) = 89.4%

 

Analysis of Variance

 

Source            DF          SS          MS         F        P

Regression         1     7099245     7099245    161.31    0.000

Residual Error    18      792170       44009

Total             19     7891416

 

Unusual Observations

Obs      Photo      Field         Fit      SE Fit    Residual    St Resid

 10       1944     2000.0      2244.9       127.5      -244.9       -1.47 X

 17        957     1876.0      1122.6        55.8       753.4        3.73R

 

R denotes an observation with a large standardized residual

X denotes an observation whose X value gives it large influence.

 

a)      From the output, read or calculate the values of the following.

i)      The y-intercept.

ii)              The estimate of the average change in field volume for an increase of one in photo volume.

iii)            The quantity that the least squares line minimizes.

iv)   The standard deviation in the estimate of the slope.

v)     The quantity that measures the proportion of error removed from the estimation of field volume by using a linear regression model with photo volume rather than using the sample mean field volume as the estimate.

vi)   The sample correlation coefficient.

b)     Test to see if there is a linear relationship between photo volume and field volume.

i)      Define the parameter of interest.

ii)    What are the hypotheses?

iii)  Give the values from the MINITAB output of the two test statistics that may be used to perform the test.

iv)   Reach and justify a decision at a = .05.  Provide an interpretation of the decision.

 

3.     For a class project, a student took a sample of students and determined their age, gender, whether they belonged to a fraternity or sorority, how many years they had been attending college, and the number of alcoholic drinks they had in a week.  A regression analysis by MINITAB resulted in the following output.

 

The regression equation is

num.drinks/week = 2.6 + 1.01 age - 0.01 sex - 9.67 frat/sor

           - 0.10 yr. in school

 

Predictor        Coef     SE Coef          T        P

Constant         2.59       12.48       0.21    0.838

age            1.0091      0.7192       1.40    0.181

sex            -0.013       1.671      -0.01    0.994

frat/sor       -9.666       1.811      -5.34    0.000

yr. in s       -0.102       1.164      -0.09    0.931

 

S = 3.571       R-Sq = 66.0%     R-Sq(adj) = 56.9%

 

Analysis of Variance

 

Source            DF          SS          MS         F        P

Regression         4      371.26       92.81      7.28    0.002

Residual Error    15      191.29       12.75

Total             19      562.55

 

a)      Perform a test to see if at least one of the predictors aids in the prediction of the response.

i)       What are the hypotheses?

ii)      Make and justify your decision at a = .05 and provide an interpretation.

b)     Perform a test to decide if sex is a significant part of the current model.

i)       Give the hypotheses

ii)      Make and justify your decision at a = .05 and provide an interpretation.

c)      The student believed that whether a student belonged to a fraternity or sorority was they only variable that would predict alcoholic ingestion, and that the other variables would not be necessary components of the model.  Based on the t-tests and their associated p-values (use a = .05), does the student have sufficient justification for her claim?  Explain. 

 


Additional Review Problems

1) Suppose that instructors A, B, and C are each teaching three large sections of a course, and each instructor wants to study whether the mean exam scores differ significantly across the three sections.  Suppose that each takes a random sample of ten students, and calculates the following descriptive statistics:

 

A1

A2

A3

B1

B2

B3

C1

C2

C3

Sample size

10

10

10

10

10

10

10

10

10

Sample mean

50

60

70

50

60

70

57

60

63

Sample std dev

24

24

24

5

5

5

5

5

5

(a) Based on these statistics, which instructor has the strongest evidence that the mean scores differ significantly across his/her three sections?  Which has the least evidence?  Explain your answers.

 

2) Consider the following four data sets, each consisting of four (x, y) data points :

            A: (1,3) (2,5) (3,6) (4,8)                         B: (1,4) (2,7) (3,2) (4,4)

            C: (1,8) (2,6) (3,2) (4,3)                         D: (1,5) (2,3) (3,5) (4,2)

Based on the changes in the x and y values, arrange these data sets in order from the most negative correlation to the most positive.  Explain your reasoning.

 

3) An article in the May 24, 2004 issue of Sports Illustrated raised two separate questions about seven-game series in professional team sports.  One question concerns the proportion of seven-game series that have gone to the full length of seven games.  The article reported that through the year 2003, 44 of 131 (34%) series went to the full length in baseball, compared to 111 of 471 (24%) in hockey and 85 of 303 (28%) in basketball.

(a) Conduct a chi-square analysis of whether these percentages differ more than would be expected by random variation.  Begin with graphical displays and numerical summaries, and then proceed to a chi-square test.  Which type of chi-square test did you do? Summarize your conclusions.

(b) Comment on whether these data come from random samples or from randomization to groups, or whether the randomness is hypothetical here.

(c) The other question posed by the article compares the proportion of “game sevens” that are won by the home team across these sports.  The article reported that 23 of 44 (52%) were won by the home team in baseball, compared to 70 of 111 (63%) in hockey and 70 of 85 (82%) in basketball.  Analyze these data to assess whether they provide evidence that the three proportions differ significantly, and write a paragraph or two summarizing your conclusions.

 

4) A student conducted a student to examine the ages of people who joined a local health club.  The participants were chosen by systematically sampling the men and the women who joined the health club in August and September 2004.  The data are in the Minitab worksheet GymMembership.mtw.

Analyze the data to compare the mean ages of the men and women and also the mean ages of those who joined the club in August and in September, conditional on the other variable.  Produce and comment on numerical and graphical summaries, state hypotheses, check the technical conditions for each procedure.  Also comment on whether there appears to be a statistically significant interaction between gender and month joined and which factor (gender or month) appears to be more strongly related to the ages of new members.

 

5) In a “matched pairs” experiment, each subject receives both treatments, in random order.  This allows us to see if the treatment is consistently effective, comparing each person to themselves, instead of across individuals, allowing a more direct comparison and “controlling for” the person to person variability.  To analyze the data, we just take the differences in the results for each person and see if the average difference is significantly different from zero.  A “repeated measures” design is just this idea extended to 2 or more treatments.  “Blocking” is the same logic, but we group experimental units that are very similar to each other instead of using the same unit more than once.  We have still minimized the variability for trying to detect the treatment effect itself.  In both these cases, we include “subjects” or “blocks” as one of the variables in an ANOVA analysis (assuming a quantitative response).

     Researchers who are studying a new shampoo formula plan to compare the condition of hair for people who use the new formula with the condition of hair for people who use the current formula.  Twelve volunteers are available to participate in this study.  Information on these volunteers (numbered 1-12) is shown in the table below.

Volunteer

Gender

Age

1

Male

21

2

Female

20

3

Male

47

4

Female

60

5

Female

62

6

Male

61

7

Male

58

8

Female

44

9

Male

44

10

Female

24

11

Male

23

12

Female

46

(a) The researchers want to conduct an experiment involving the two formulas (new and current) of shampoo.  They believe that the condition of hair changes with age but not gender.  Because researchers want the size of the bocks in an experiment to be equal to the number of treatments, they will use blocks of size 2 in their experiment.  Identify the volunteers (by number) that would be included in each of the six blocks and give the criteria you used to form the blocks.

(b) Other researchers believe that hair condition differs with both age and gender.  These researchers will also use blocks of size 2 in their experiment.  Identify the volunteers (by number) that would be included in each of the six blocks and give the criteria you used to form the blocks.

(c) The researchers in (b) decide to select three of the six blocks to receive the new formula and to give the other three blocks the current formula.  Is this an appropriate way to assign treatments?  If so, describe a method for selecting the three blocks to receive the new formula.  If not, describe an appropriate method for assigning treatments.

 

Review problems from text

1) p. 236, problem 7; p. 244, problems 43, 47

2) p. 287-8, problem 5

3) p. 351, problems 33, 39

4) p. 434, problem 45

5) p. 481, problem 17