Stat 512 – Final Review
Chance Office Hours: Mon 12-2pm, Tues 8-9am, Wed 9:30-10:30am, 1-2pm, Fri 12-2pm
Final Review Session: Tuesday, 7-9pm, in regular studio classroom (02-206)
Final Exam Times: Wednesday, 4-7pm, and Friday 7-10pm
If you forget which day you are, see http://statweb.calpoly.edu/chance/stat512/finalSched.html
Final Exam: The final will mostly focus on the material since the midterm, but will have a cumulative component. The final will be a three-hour exam in the studio. The format will be similar to the midterm (mixture of short answer and analysis questions) and is open book, notes, computers. You should bring your course notes and a calculator. You will probably be asked to carry out analyses in Minitab as well as to interpret Minitab output.
Things you can do for
yourself:
· Fill out the definitions of symbols we have used (see last page of handout)
· Be extra familiar with the “Overview of Statistical Procedures” handout and with the table at the beginning of Lecture 7.
· Organize your notes with respect to Minitab commands/menus we have used this quarter.
· Work problems! Put pencil to paper and go through the calculations and explanations (without using the word “it”), especially for confidence intervals and significance tests. Browsing the text and notes is not enough, you need to put yourself in the test-taking environment and start the problem from scratch. You want the calculations to be automatic so they don’t slow you down on the exam. Pick up graded (printed) assignments from the wooden box outside my office. Reworking old homework questions and class examples will allow you to check your answers with the key, while providing you with “representative questions” (solutions at statweb.calpoly.edu/chance/stat512/). It’s possible your projects will be graded and available to you on Monday.
·
Pretend the final is closed book.
·
See earlier review handout for Midterm
·
Ask questions!
Use the email list and instructor to your advantage to clear up
concepts.
·
Get plenty of sleep
Topics since the
Midterm
From Lecture 12 you should be able to:
· Recognize when the research question of interest is a comparison of two independent populations
· Carry out and interpret a two sample z-test and two sample z-interval for the difference in population proportions, p1-p2
o Realize that the SE calculation differs slightly for CI and for test statistic
o Realize what factors affect the p-value and the width of the confidence interval
o The conditions required for validity of the procedure and how to check them
· Carry out and interpret a two sample z-test and two sample z interval for the true treatment effect between two proportions (large-sample approximation to two-way table simulation)
o The
difference between an empirical randomization
distribution and an empirical sampling
distribution.
· Duality between confidence intervals and tests of significance
From Lecture 13 you should be able to:
· Recognize when the research question of interest is a comparison of two independent means
· Carry out and interpret a two sample t-test and two sample t-interval for the difference in population means
o Understand the need for the t distribution
o Realize what factors affect the p-value and the width of the confidence interval
o In particular, understand the benefits and methods of reducing variability
o The conditions required for validity of the procedure and how to check them
· Carry out and interpret a two sample t-test and two sample t interval for the true treatment effect between two means (large-sample approximation to randomization test)
· Define the implications of a Type I Error in a particular context
· Recognize when the research question of interest leads to a chi-square analysis
o Comparing two or more population distributions on a categorical variable
o Association between two categorical variables in one population
· Carry out and interpret a chi-square test (see also Lecture 14)
o State hypotheses corresponding to how data were collected (case 1, 2 or 3)
o Understand what expected counts represent and how they are calculated
o Perform follow-up analysis of cell contribution values
o The conditions required for validity of the procedure and how to check them
o Randomness condition also depends on which case are in
From Lecture 14 you should be able to:
· Recognize when the research question of interest leads to ANOVA
· Carry out and interpret a one-way Analysis of Variance
o Understand how ANOVA compares between group variability to within group variability
o Realize what factors affect the p-value
o The conditions required for validity of the procedure and how to check them
· Carry out and interpret Tukey’s Multiple Comparison procedure
From Lecture 15 you should be able to:
· Carry out an Analysis of Variance with multiple explanatory variables (e.g., “two-way”)
o Interpret and test for an interaction between two explanatory variables
o Does the effect of one variable depend on the outcome of the other (e.g., the city route to school is faster in the morning but the freeway route is faster in the evening)
o Recognize the benefits of including a “blocking” variable in the design and analysis
o E.g., what’s wrong with treating “paired” data as independent samples?
o Carry out the randomization within each block (e.g., order of chips)
o How to enter data into Minitab and carry out a repeated measures analysis
o How to interpret the individual p-values for each factor
· Understand the benefit of reducing the “Error” component
From Lecture 16 you should be able to:
· Construct and interpret a scatterplot to analyze the relationship in two quantitative variables
o EV on horizontal, RV on vertical axis
o Discuss form, direction, and strength
· Report and interpret the correlation coefficient
o Think about properties of r (units, possible values, resistance)
o Remember to see if relationship is linear first
· Calculate and interpret residuals
· Explain what is meant by “least squares” regression
· Calculate and interpret the least-squares regression line
o How to write it out as an equation for a line to predict the response variable
o Think about properties of the least squares line (resistance)
o Interpret the slope and intercept coefficients in context (on average/predicted)
o Use the regression line to make predictions
· Report and interpret the r2 value
o Relationship to r
From Lecture 17 you should be able to:
· Discuss and check the conditions for the basic regression model
o Produce and interpret residual plots
· Understand when there is a need to transform data and how to decide whether a transformation has been effective
· Discuss the sampling distribution of the sample slope (what does it represent? what does it tell you?)
o We looked at the sampling distribution, but the randomization distribution looks just like it!
· Produce and interpret Minitab output for carrying out inference for a regression slope
o State hypotheses in terms of b, no association (be sure to ID the two variables)…
o Adjust test statistic and p-value if hypothesized value differs from 0 or Ha is two-sided
From Lecture 18 you should be able to:
· Use Minitab to produce a matrix scatterplot and decide which explanatory variable(s) appear to have the strongest linear relationship with the response variable
· Use Minitab to produce a multiple regression equation
o Interpret slope coefficients (all other variables constant)
o Understand the benefits of including more explanatory variables in the model
o Check the validity of inference procedures using residual plots
o Analyze the overall F test/Model Utility test (state appropriate hypotheses)
o Analyze the individual t tests (state appropriate hypotheses)
o Interpret the R2 value and use R2(adj) to compare models
· Be cautious of “multicollinearity”
o Are the explanatory variables strongly related? Could make slope estimates unstable.
· Include a binary explanatory variable in the regression model
o Interpret slope coefficient and p-value
Some Big Picture
Ideas
· Observational units, variables, population vs. sample, parameter vs. statistic
o State variables as variables
o Define parameters as numbers
· The statistical process (Formulating research question, data collection methods, descriptive statistics – numerical and graphical summaries, inferential statistics, stating conclusions)
o In stating conclusions, consider statistical significance, generalizability and cause and effect
· Experiments vs. observational studies
o How to design an experiment, How to properly select a sample
o Scope of conclusions depend on how study was conducted (Can you draw a cause and effect conclusion? Can you generalize to a larger population?)
o Sampling errors, nonsampling errors, and random sample errors (and which of these are measured by the “margin of error” and p-value?), how do we reduce them?
o Distinguishing between random samples and randomization in hypotheses and statements of study conclusions
o Association vs. causation
· Describing and comparing distributions of data
o Categorical: segmented bar graphs, conditional percentages, difference in proportions
o Quantitative: shape, center, and spread, boxplots, histograms, dotplots, resistance of median and IQR
· Carrying out and interpreting descriptive and inferential procedures (e.g., population vs. sample results)
o Determining
which descriptive and inferential procedures to use based on the research
question
o When
and how to use “paired t” and
“repeated measures” procedures (e.g., twins, shopping data, chocolate melting)
o How to state technical conditions for the procedures and how to check them
o Sample size effects
· Using simulation to generate empirical sampling distributions and approximate p-values
· How to interpret “confidence” (level) and p-values in context and factors affecting them
What is Statistical Inference?
We will consider two types of inference: inference about populations and inference about treatment effects. The parameter
and the statistic summarize the same
variable. The parameter summarizes the
variable for the population or for the true underlying treatment effect, which
is what we want to know, e.g. p or m1-m2. However, we can’t observe the whole
population or all possible outcomes of the randomization process, so we don’t
know what the parameter value really is.
However, we can measure the variable on a random sample or randomized
groups and compute a sample statistic, e.g.
or
1 –
2. The question is what can we infer about the parameter based on this
statistic. Since we saw that the sampling distribution/ randomization
distribution of these statistics follow a regular pattern, we can calculate
probabilities of different values of the statistic occurring. Different statistics follow different distributions,
but once we know which distribution we should use, we can make predictions
about the value of the parameter, e.g. it’s in some interval or we have
evidence that it is not a particular value.
Sampling Distributions If we specify a value for the population parameter, we can take (or simulate) lots of random samples from this population and calculate a statistic for each sample. This allows us to examine the behavior (distribution) of the statistic so we can discuss the shape, center, and variability of that distribution. For example, what types of values do we expect the statistic to have, how far away is the observed statistic from the hypothesized parameter?
Randomization Distributions If we assume no treatment effect, we can simulate lots of different randomizations of the experimental units to treatment groups and calculate a statistic for each randomization. This allows us to examine the behavior (distribution) of the statistic so we can discuss the shape, center, and variability of that distribution. For example, what types of values do we expect the statistic to have, how far away is the observed value from the center of zero (we know the distribution centers at zero since we told the simulation that there was no treatment effect)?
Therefore, seeing where the statistic falls in this distribution gives us an indication of whether it is a surprising outcome or not, given our (null) hypothesis about the population/treatment effect.
Hypothetical Inference
Often times, we do not have a random sample or randomization. Technically, inference should not be performed in these cases. However, it may still be of interest to ask whether we can eliminate “chance” as an explanation for the differences observed in the study. You may ask, how often would we expect to find a result this extreme if it had been a random sample or if the observational units had been randomly assigned. In this case, we can carry out a hypothetical inference procedure, but then put up very large red flags in our conclusions. We may eliminate chance as an explanation but we have not eliminated bias (if a convenience sample) or confounding (if an observational study). For example, we can say there were fewer black 3rd base coaches than we would have expected if the coaches had been randomly assigned to each base but we can’t draw any cause and effect conclusions.
Confidence Intervals
Estimate population/treatment parameter
The goal of a confidence interval is to get a range of plausible values that we think the population/treatment parameter could be equal to. To do this, we use the sample statistic and a measure of the sampling/randomization variability of the statistic. This lets us form an interval around the sample statistic that should contain the population/treatment parameter. Note, we are trying to contain the parameter in the interval, not the data and not the statistic.
Tests of Significance
Test claim about population/treatment
parameter
The goal of a test of significance is to make a decision about the population parameter. Here are the steps we use:
0) Define the parameter(s) of interest. (Should also be able to define the OUs and variable.)
1) Specify the hypotheses (e.g. H0: p=1/3, m1-m2=0, m1= m2= m3, no relationship between vars)
Always in terms of the population parameters since that’s what is unknown and what we are trying to make statements about (take off the hats!)
The null hypothesis is the “dull hypothesis” or the “ho-hum hypothesis”
The alternative hypothesis specifies something interesting
One or two-sided (decide based on wording in research question)
2) Check the technical conditions, sketch the sampling/radnomization distribution of the statistic assuming H0 is true, and identify the appropriate test procedure
3) Calculate the test statistic comparing the data observed in the sample to “expected”
4) Find the p-value=probability of observing a value of the test statistic as extreme or more
extreme when H0 is true.
E.g., if our test statistic follows a Normal distribution when H0 is true, we find P(Z>z) or P(Z<z) (how far out in the tails does z fall?) for a one-sided test (direction determined by Ha), or 2Pr(Z>|z|) for a two-sided test.
Know when to multiply by two for two-sided alternatives!
5) Decision: Decide to reject or fail to reject Ho
Reject if p-value
a, synonymous with
saying result is “statistically significant”
6) Conclusion: Make conclusion about research question of interest (back to English)
If we repeatedly took different samples/randomizations and calculated the value of the test statistic for each sample, the p-value indicates how often we would expect to see the test statistic value that we actually did observe, or one more extreme, when Ho is true. If the test statistic value is very unlikely (so small p-value) we stop believing H0 (recall the loaded dice example). We can compare to the significance level a as a benchmark to decide if the p-value is “too small.”
T vs Z With a binary qualitative variable, our observations consist of “yeses” and “nos” for each observational unit in the population. A picture of this population is simply a bar graph. In particular, we don’t worry about its shape or variability. We will always approximate the distribution of the sample proportion with the normal distribution. Thus, we never worry about using the t distribution with proportions. With means, the t distribution is used to take into account the extra variation we will see in the sampling distribution if we also substitute the sample standard deviation, s, into the equation.
Population vs Variable vs Parameter A population is a group of objects, a variable is what we measure about the objects, a question (e.g., height). The observational units are the objects we measure (e.g., buildings, volleyball players). You need to be able to decide how many populations you have and how many variables, e.g., are you measuring two different things/answering two different questions about the objects (e.g., height and age); are you measuring the same thing on two different groups (e.g. heights of men and heights of women). Parameters are numbers, we just may not know their exact numerical value, that describe the population (e.g., the average height of all buildings, the average age of all volleyball players).
· We look at associations between variables and differences between groups
· Consider the data collection, was one sample taken and two or more variables measured (e.g., eye condition and lighting condition) or were separate samples taken w/ one variable measured on members of each group (e.g., verbal ability of males and verbal ability of females)?
·
Independent
Observations: Does the outcome for
one observational unit depend on the outcome of another observational
unit? For example, if people collaborate
together on an assignment, their scores are probably not independent. Knowing whether person A got a good score,
changes my prediction for whether person B got a good score. We can assume that the observations in our
study are independent when we take random samples, take systematic samples
where the spacing between observations is far enough apart (e.g., not waiting
time of successive people in line), or by preventing subjects from
communicating (e.g., not allowing peer pressure in survey responses).
·
Independent
samples: We will assume two samples are independent if the selection of
units for one sample does not influence the selection of units for the other
sample. An obvious counter example is
when the members of both groups are the same (paired data) – once I know who is
in the first sample, that completely determines who is in the second
sample. Sometimes we have true
independent samples (e.g., separate random samples from two different years),
other times, we must be wiling to consider the samples independent (e.g., one
random sample split into males and females), though with chi-square procedures
we made a little bit bigger deal about whether we truly had separate samples or
if we just recorded the gender as one of the measured variables.
·
Independent
variables: Two variables are
independent if knowledge of the outcome of one variable provides no new
information about which outcomes of the other variable are more likely. For example, gender and handedness are
dependent if males tend to be left-handed more often than females; so that once
I know I have selected a male, I would increase my estimate of the probability
that the individual was left-handed than if I didn’t know the person’s
gender. We perform tests of significance
(chi-square, regression, even ANOVA) to help us decide whether two variables are
independent in the population. It’s
pretty unlikely for two variables to be completely independent (e.g., identical
conditional proportions) in the sample, so we ask whether the dependence strong
enough to eliminate chance as an explanation for the observed dependence? Note
that we talk about independence/association between two variables. We don’t talk
about the outcomes of the variables or levels of the variables, but the entire
variable.
|
a (alpha) |
|
|
a |
|
|
b |
|
|
b |
|
|
X2 |
|
|
d.f. |
|
|
F |
|
|
H0, Ha |
|
|
IQR |
|
|
m (mu) |
|
|
mo, po |
|
|
n, (ni) |
|
|
p (pi) |
|
|
|
|
|
p-value |
|
|
r |
|
|
r2 |
|
|
R2(adj.) |
|
|
s, s2 |
|
|
SE |
|
|
s2, s (sigma) |
|
|
t |
|
|
t0 |
|
|
|
|
|
|
|
|
z |
|
Review problems
Identify the error in the following analyses:
1) A confidence interval comparing the rate of developing a flu-like illness from a vaccinated and an unvaccinated group p1-p2 is determined to be (-.056, .016).
(a) This indicates that 90% of the sample proportions are contained in this interval.
(b) This indicates that we are 90% confident that between 1.6% and 5.6% of the population developed a flu-like illness.
2) A confidence interval comparing the average number of words remembered for two different ways of presenting the words (familiar chunks and unfamiliar chunks) is determined to be (1.65, 8.60) with a p-value of .003.
(a) Since the interval does not contain 0, we are 95% confident that the scores for the first group are larger than the scores for the second group.
(b) Those receiving the letters in familiar chunks will perform better 95% of the time than those receiving the letters in unfamiliar chunks.
(c) The p-value of .003 indicates that there is a .003 probability that those receiving the letters in familiar chunks will perform better than those receiving the letters in unfamiliar chunks.
3) Another study of housing prices (in
thousands of dollars) found the equation (for
predicted price = 30.15 + .0695 sq ft. r2 = 56.4%, p-value < .001
(a) The sample slope coefficient reveals that a house’s price goes up by $69.50 for each additional square foot of size.
(b) If the technical conditions are met, the very small p-value suggests that there is no linear association between a house’s size and its price.
(c) If the technical conditions are met and if the p-value had been larger than .10, you could have concluded that the sample data provide strong evidence that there is no association between a house’s size and its price.
(d) Adding square footage to a house causes the price to increase by $69.50 on average.
4) Suppose we want to predict hiking time from hiking distance for Day Hikes in San Luis Obispo County and find predicted time = -1.27 + 31.5distance with r2 = .838. Identify at least one problem with each of the following interpretations.
(a) The slope shows that for each additional minute, we predict the hike is 31.5 miles longer.
(b) The slope shows that a hiker’s time increases by 31.5 minutes for each additional mile.
(c) The predicted time for a 5-mile hike is about 156.
(d) About 84% of hikes have times that are correctly predicted by the line.
(e) About 84% of the variability in hikes is explained by time.
Sample Problems from
a previous Stat 512 Final Exam (different instructor)
1. The
MINITAB output that follows resulted from taking observations on the percentage
of body fat taken from teenagers defined as clinically obese.
T Confidence Intervals
Variable N
Mean StDev SE Mean
99.0 % CI
Body Fat 18
28.95 4.53 1.07
( 25.86, 32.05)
a) Define the parameter(s) of interest.
b) Is the value of 28.95 a parameter or a statistic? Explain.
c) True or false:
i) For this interval to be valid, it is necessary that the population of percentages of body fat be normally distributed.
ii) A larger sample would make the population distribution more normal.
iii) For this interval to be valid, it is necessary that the population standard deviation of the percentages of body fat be known.
iv) 99% of the time, the mean percentage of body fat will fall between 25.86 and 32.05.
v) You can be absolutely sure that the mean percentage of body fat is between 25.86 and 32.05.
vi) It is possible that the mean percentage of body fat of the population is not between 25.86 and 32.05.
vii) 99% of all intervals created in this fashion will contain the mean percentage of body fat.
viii) A 95% confidence interval obtained from the same data would be wider.
ix) If a two-tailed hypothesis test of H0: m = 30 were performed using these data, H0 would be rejected.
2. “Photo-volume
and weight tables for
The regression equation is
Field = 34.4 + 1.14 Photo
Predictor Coef SE Coef T P
Constant 34.37 72.64 0.47
0.642
Photo 1.13710 0.08953 12.70
0.000
S = 209.8 R-Sq = 90.0% R-Sq(adj) = 89.4%
Analysis of Variance
Source DF SS MS F P
Regression 1 7099245
7099245 161.31 0.000
Residual Error 18 792170 44009
Total 19 7891416
Unusual Observations
Obs Photo Field Fit SE Fit
Residual St Resid
10 1944
2000.0 2244.9 127.5 -244.9 -1.47 X
17 957
1876.0 1122.6 55.8 753.4 3.73R
R denotes an observation with a large standardized residual
X denotes an observation whose X value gives it large influence.
a) From the output, read or calculate the values of the following.
i) The y-intercept.
ii) The estimate of the average change in field volume for an increase of one in photo volume.
iii) The quantity that the least squares line minimizes.
iv) The standard deviation in the estimate of the slope.
v) The quantity that measures the proportion of error removed from the estimation of field volume by using a linear regression model with photo volume rather than using the sample mean field volume as the estimate.
vi) The sample correlation coefficient.
b) Test to see if there is a linear relationship between photo volume and field volume.
i) Define the parameter of interest.
ii) What are the hypotheses?
iii) Give the values from the MINITAB output of the two test statistics that may be used to perform the test.
iv) Reach and justify a decision at a = .05. Provide an interpretation of the decision.
3. For a class project, a student took a sample of students and determined their age, gender, whether they belonged to a fraternity or sorority, how many years they had been attending college, and the number of alcoholic drinks they had in a week. A regression analysis by MINITAB resulted in the following output.
The
regression equation is
num.drinks/week
= 2.6 + 1.01 age - 0.01 sex - 9.67 frat/sor
- 0.10 yr. in school
Predictor Coef
SE Coef T P
Constant 2.59 12.48 0.21
0.838
age 1.0091 0.7192 1.40
0.181
sex -0.013 1.671 -0.01
0.994
frat/sor -9.666 1.811 -5.34
0.000
yr.
in s -0.102 1.164 -0.09
0.931
S
= 3.571 R-Sq = 66.0% R-Sq(adj) = 56.9%
Analysis
of Variance
Source DF SS MS F P
Regression 4
371.26 92.81 7.28
0.002
Residual
Error 15 191.29 12.75
Total 19 562.55
a)
Perform a test to
see if at least one of the predictors aids in the prediction of the response.
i)
What are the
hypotheses?
ii)
Make and justify your
decision at a = .05 and provide an
interpretation.
b)
Perform a test to
decide if sex is a significant part
of the current model.
i)
Give the
hypotheses
ii)
Make and justify
your decision at a = .05 and provide an interpretation.
c)
The student
believed that whether a student belonged to a fraternity or sorority was they
only variable that would predict alcoholic ingestion, and that the other
variables would not be necessary components of the model. Based on the t-tests and their associated
p-values (use a = .05), does the student have sufficient justification for her
claim? Explain.
Additional Review Problems
1) Suppose that instructors A, B, and C are each
teaching three large sections of a course, and each instructor wants to study whether
the mean exam scores differ significantly across the three sections. Suppose that each takes a random sample of
ten students, and calculates the following descriptive statistics:
|
|
A1 |
A2 |
A3 |
B1 |
B2 |
B3 |
C1 |
C2 |
C3 |
|
Sample size |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
10 |
|
Sample mean |
50 |
60 |
70 |
50 |
60 |
70 |
57 |
60 |
63 |
|
Sample std dev |
24 |
24 |
24 |
5 |
5 |
5 |
5 |
5 |
5 |
(a) Based on these statistics, which instructor has the
strongest evidence that the mean scores differ significantly across his/her
three sections? Which has the least
evidence? Explain your answers.
2) Consider the following
four data sets, each consisting of four (x,
y) data points :
A: (1,3) (2,5) (3,6) (4,8) B: (1,4) (2,7) (3,2) (4,4)
C: (1,8) (2,6) (3,2) (4,3) D: (1,5) (2,3) (3,5) (4,2)
Based on the changes in the x and y values, arrange these data sets in order from the most negative
correlation to the most positive.
Explain your reasoning.
3) An article in the May 24,
2004 issue of Sports Illustrated
raised two separate questions about seven-game series in professional team
sports. One question concerns the proportion of seven-game series that
have gone to the full length of seven games.
The article reported that through the year 2003, 44 of 131 (34%) series
went to the full length in baseball, compared to 111 of 471 (24%) in hockey and
85 of 303 (28%) in basketball.
(a) Conduct a chi-square
analysis of whether these percentages differ more than would be expected by
random variation. Begin with graphical
displays and numerical summaries, and then proceed to a chi-square test. Which type of chi-square test did you do?
Summarize your conclusions.
(b) Comment on whether these
data come from random samples or from randomization to groups, or whether the
randomness is hypothetical here.
(c) The other question posed
by the article compares the proportion of “game sevens” that are won by the
home team across these sports. The article reported that 23 of 44 (52%)
were won by the home team in baseball, compared to 70 of 111 (63%) in hockey
and 70 of 85 (82%) in basketball. Analyze these data to assess whether
they provide evidence that the three proportions differ significantly, and
write a paragraph or two summarizing your conclusions.
4) A student conducted a
student to examine the ages of people who joined a local health club. The participants were chosen by
systematically sampling the men and the women who joined the health club in
August and September 2004. The data are
in the Minitab worksheet GymMembership.mtw.
Analyze the data to compare
the mean ages of the men and women and also the mean ages of those who joined
the club in August and in September, conditional on the other variable. Produce and comment on numerical and
graphical summaries, state hypotheses, check the technical conditions for each
procedure. Also comment on whether there
appears to be a statistically significant interaction between gender and month
joined and which factor (gender or month) appears to be more strongly related
to the ages of new members.
5) In a “matched pairs”
experiment, each subject receives both treatments, in random order. This allows us to see if the treatment is
consistently effective, comparing each person to themselves, instead of across
individuals, allowing a more direct comparison and “controlling for” the person
to person variability. To analyze the
data, we just take the differences in the results for each person and see if
the average difference is significantly different from zero. A “repeated measures” design is just this idea
extended to 2 or more treatments. “Blocking”
is the same logic, but we group experimental units that are very similar to
each other instead of using the same unit more than once. We have still minimized the variability for
trying to detect the treatment effect itself. In both these cases, we include “subjects” or “blocks”
as one of the variables in an ANOVA analysis (assuming a quantitative
response).
Researchers who are studying a new shampoo
formula plan to compare the condition of hair for people who use the new
formula with the condition of hair for people who use the current formula. Twelve volunteers are available to participate
in this study. Information on these
volunteers (numbered 1-12) is shown in the table below.
|
Volunteer |
Gender |
Age |
|
1 |
Male |
21 |
|
2 |
Female |
20 |
|
3 |
Male |
47 |
|
4 |
Female |
60 |
|
5 |
Female |
62 |
|
6 |
Male |
61 |
|
7 |
Male |
58 |
|
8 |
Female |
44 |
|
9 |
Male |
44 |
|
10 |
Female |
24 |
|
11 |
Male |
23 |
|
12 |
Female |
46 |
(a) The researchers want to
conduct an experiment involving the two formulas (new and current) of shampoo. They believe that the condition of hair changes
with age but not gender. Because
researchers want the size of the bocks in an experiment to be equal to the
number of treatments, they will use blocks of size 2 in their experiment. Identify the volunteers (by number) that would
be included in each of the six blocks and give the criteria you used to form
the blocks.
(b) Other researchers believe
that hair condition differs with both age and gender. These researchers will also use blocks of size
2 in their experiment. Identify the
volunteers (by number) that would be included in each of the six blocks and
give the criteria you used to form the blocks.
(c) The researchers in (b)
decide to select three of the six blocks to receive the new formula and to give
the other three blocks the current formula. Is this an appropriate way to assign
treatments? If so, describe a method for
selecting the three blocks to receive the new formula. If not, describe an appropriate method for
assigning treatments.
Review problems from text
1) p. 236, problem 7; p. 244,
problems 43, 47
2) p. 287-8, problem 5
3) p. 351, problems 33, 39
4) p. 434, problem 45
5) p. 481, problem 17