This week, we introduce the idea of comparing two groups to evaluate whether the data that we have sampled lead us to believe that the population distribution of the random variables are different. Of course, because we don’t get access to the function that describes the random variable, we can’t actually know if the populations are different. It is for this reason that we call it statistical inference – we are inferring from a sample some belief about the populations.
At the conclusion of this week, students will be able to:
Recognize the similarities between all frequentist hypothesis tests.
Evaluate the conditions that surround the data, and choose a test that is best-powered and justifiable.
Perform and Interpret the results of the most common statistical tests.
7.2 Class Announcements
Great work completing your final w203 test!
Unit 7 Homework is Group Homework, due next week.
The Hypothesis Testing Lab is released today!
Lab is due at Unit 09 Live Session (two weeks): Apply these fundamentals to analyze 2022 election data and write a single, three-page analysis
7.3 Roadmap
7.3.1 Rearview Mirror
Statisticians create a population model to represent the world
A population model has parameters we are interested in
Ex: A parameter might represent the effect that a vitamin has on test performance
A null hypothesis is a specific statement about a parameter
Ex: The vitamin has zero effect on performance
A hypothesis test is a procedure for rejecting or not rejecting a null, such the probability of a type 1 error is constrained.
7.3.2 Today
There are often multiple hypothesis tests you can apply to a scenario.
Our primary concern is choosing a test with assumptions we can defend.
Secondarily, we want to maximize power.
7.3.3 Looking ahead
Next week, we start working with models for linear regression
We will see how hypothesis testing is also used for regression parameters.
7.4 Teamwork Discussion
7.4.1 Working on Data Science Teams
Data science is a beautiful combination of team-work and individual-work. It provides the opportunity to work together on a data pipeline with people from all over the organization, to deal with technical, statistical, and social questions that are always interesting. While we expect that everyone on a team will be a professional, there is so much range within the pursuit of data science that projects are nearly always collaborative exercises.
Together as teams, we
Define research ambitions and scope
Imagine/envision the landscape of what is possible
Support, discuss, review and integrate individual contributions
Together as individuals we conduct the heads-down work that moves question answering forward. This might be reading papers to determine the most appropriate method to bring to bear on the question, or researching the data that is available, or understanding the technical requirements that we have to meet for this answer to be useful to our organization in real time.
What is your instructor uniquely capable of? Let them tell you! But, at the same time, what would they acknowledge that others are better than them?
See, the thing is, because there is so much that has to be done, there literally are very, very few people who are one-stop data science shops. Instead, teams rely on collaboration and joint expertise in order to get good work done.
7.4.2 The Problematic Psychology of Data Science
People talk about the impostor syndrome: a feeling of inadequacy or interloping that is sometimes also associated with a fear of under-performing relative to the expectation of others on the team. These emotions are common through data science, academics, computer science. But, these types of emotions are also commonplace in journalism, film-making, and public speaking.
Has anybody ever had the dream that they’re late to a test? Or, that that they’ve got to give a speech that they’re unprepared for? Does anybody remember playing an instrument as a kid and having to go to recitals? Or, play for a championship on a youth sports team? Or, go into a test two?
What are the feelings associated with those events? What might be generating these feelings?
7.4.3 What Makes an Effective Team?
This reading on effective teams summarizes academic research to argue:
What really matters to creating an effective tema is less about who is on the team, and more about how the team works together.
In your live session, your section might take 7 minutes to read this brief. If so, please read the following sections:
The problem statement;
The proposed solution;
The framework for team effectiveness, stopping at the section titled “Tool: Help teams determine their own needs.”
“Psychological safety refers to an individual’s perception of the consequences of taking an interpersonal risk. It is a belief that a team is safe for risk taking in the face of being seen as ignorant, incompetent, negative, or disruptive.”
“In a team with high psychological safety, teammates feel safe to take risks around their team members. They feel confident that no one on the team will embarrass or punish anyone else for admitting a mistake, asking a question, or offering a new idea.”
7.4.4 We All Belong
From your experience, can you give an example of taking a personal risk as part of a team?
Can you describe your emotions when contemplating this risk?
If you did take the risk, how did the reactions of your teammates affect you?
Knowing the circumstances that generate feelings of anxiety – what steps can we take as a section, or a team, to recognize and respond to these circumstances?
How can you add to the psychological safety of your peers in the section and lab teammates?
7.5 Team Kick-Off
Lab 1 Teams
Here are teams for Lab 1!
Team Kick-Off Conversation
In a 10 minute breakout with your team, please discuss the following questions:
How much time will you invest in the lab each week?
How often will you meet and for how long?
How will you discuss, review, and integrate individual work into the team deliverable?
What do you see as the biggest risks when working on a team? How can you contribute to an effective team dynamic?
7.6 A Quick Review
Review of Key Terms
Define each of the following:
Population Parameter
Null Hypothesis
Test Statistic
Null Distribution
Comparing Groups Review
Take a moment to recall the tests you learned this week. Here is a quick cheat-sheet to their key assumptions.
paired/unpaired
parametric
non-parametric
unpaired
unpaired t-test - metric var - i.i.d. - (not too un-)normal
Wilcoxon rank-sum ordinal var i.i.d.
paired
paired t-test metric var i.i.d. (not too un-)normal
Wilcoxon signed-rank metric var i.i.d. difference is symmetric
sign test ordinal var i.i.d.
7.7 Rank Based Tests
Darrin Speegle has a nice talk-through, walk through of rank based testing procedures, linked here. We’ll talk through a few examples of this, and then move to estimating against data for the class.
7.8 Comparing Groups R Exercise
The General Social Survey (GSS) is one of the longest running and extensive survey projects in the US. The full dataset includes over 1000 variables spanning demographics, attitudes, and behaviors. The file GSS_w203.RData contains a small selection of a variables from the 2018 GSS.
To learn about each variable, you can enter it into the search bar at the GSS data explorer
load('data/GSS_w203.RData')summary(GSS)
rincome happy sexnow
$25000 or more: 851 very happy : 701 women :758
$20000 - 24999: 107 pretty happy :1307 man :640
$10000 - 14999: 94 not too happy: 336 transgender : 2
$15000 - 19999: 61 DK : 0 a gender not listed here: 1
lt $1000 : 33 IAP : 0 Don't know : 0
(Other) : 169 NA : 0 (Other) : 0
NA's :1033 NA's : 4 NA's :947
wwwhr emailhr socrel
Min. : 0.00 Min. : 0.000 sev times a week:382
1st Qu.: 3.00 1st Qu.: 0.000 sev times a mnth:287
Median : 8.00 Median : 2.000 once a month :259
Mean : 13.91 Mean : 7.152 sev times a year:240
3rd Qu.: 20.00 3rd Qu.: 10.000 almost daily :217
Max. :140.00 Max. :100.000 (Other) :171
NA's :986 NA's :929 NA's :792
socommun numpets tvhours
never :510 Min. : 0.000 Min. : 0.000
once a month :243 1st Qu.: 0.000 1st Qu.: 1.000
sev times a week:219 Median : 1.000 Median : 2.000
sev times a year:196 Mean : 1.718 Mean : 2.938
sev times a mnth:174 3rd Qu.: 2.000 3rd Qu.: 4.000
(Other) :215 Max. :20.000 Max. :24.000
NA's :791 NA's :1201 NA's :793
major1 owngun
business administration: 138 yes :537
education : 79 no :993
engineering : 54 refused: 39
nursing : 51 DK : 0
health : 42 IAP : 0
(Other) : 546 NA : 0
NA's :1438 NA's :779
You have a set of questions that you would like to answer with a statistical test. For each question:
Choose the most appropriate test.
List and evaluate the assumptions for your test.
Conduct your test.
Discuss statistical and practical significance.
7.9 The Questions
7.9.1 Set 1
Do economics majors watch more or less TV than computer science majors?
GSS %>%filter(major1 %in%c('computer science', 'economics')) %>%ggplot() +aes(x = tvhours, fill = major1) +geom_histogram(bins =10, position ='dodge')
What kinds of tests could be reasonable to conduct? For what part of the data would we conduct these tests?
## The assumptions about the data drive us to the correct test. ## But, let's ask all the tests that could *possibly* make sense, and see how ## matching or mis-matching assumptions changes what we learn. ## Answers are in the next chunk... but don't jump to them right away.
Do Americans with pets watch more or less TV than Americans without pets?
7.9.2 Set 2
Do Americans spend more time emailing or using the web?
Paired t-test
data: wwwhr and emailhr
t = 13.44, df = 1360, p-value < 2.2e-16
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
5.530219 7.420553
sample estimates:
mean difference
6.475386
GSS %>%ggplot() +geom_histogram(aes(x = wwwhr), fill ='darkblue', alpha =0.5) +geom_histogram(aes(x = emailhr), fill ='darkred', alpha =0.5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
t.test(x = GSS$wwwhr, y = GSS$emailhr, paired =FALSE)
Welch Two Sample t-test
data: GSS$wwwhr and GSS$emailhr
t = 12.073, df = 2398.5, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
5.657397 7.851614
sample estimates:
mean of x mean of y
13.906021 7.151515
Do Americans spend more evenings with neighbors or with relatives?
wilcox_test_data <- GSS %>%select(socrel, socommun) %>%mutate(family_ordered =factor(x = socrel, levels =c('almost daily', 'sev times a week', 'sev times a mnth', 'once a month','sev times a year', 'once a year', 'never')),friends_ordered =factor(x = socommun, levels =c('almost daily', 'sev times a week', 'sev times a mnth', 'once a month','sev times a year', 'once a year', 'never')))
To begin this investigation, we’ve got to look at the data and see what is in it. If you look below, you’ll note that it sure seems that people are spending more time with their family… erp, actually no. They’re “hanging out” with their friends rather than taking their mother out to dinner.
wilcox_test_data %>%select(friends_ordered, family_ordered) %>%rename(Friends = friends_ordered, Family = family_ordered ) %>%drop_na() %>%pivot_longer(cols =c(Friends, Family)) %>%ggplot() +aes(x=value, fill=name) +geom_histogram(stat='count', position='dodge', alpha=0.7) +scale_fill_manual(values =c('#003262', '#FDB515')) +labs(title ='Do Americans Spend Times With Friends or Family?',subtitle ='A cutting analysis.', fill ='Friends or Family', x ='Amount of Time Spent') +scale_x_discrete(guide =guide_axis(n.dodge =2)) +theme_minimal()
Warning in geom_histogram(stat = "count", position = "dodge", alpha = 0.7):
Ignoring unknown parameters: `binwidth`, `bins`, and `pad`
With this plot created, we can ask if what we observe in the plot is the produce of what could just be sampling error, or if this is something that was unlikely to arise due if the null hypothesis were true. What is the null hypothesis? Well, lets suppose that if we didn’t know anything about the data that we would expect there to be no difference between the amount of time spent with friends or families.
## risky choice -- casting the factor to a numeric without checking what happens.wilcox_test_data %$%wilcox.test(x =as.numeric(family_ordered), y =as.numeric(friends_ordered),paired =FALSE )
Wilcoxon rank sum test with continuity correction
data: as.numeric(family_ordered) and as.numeric(friends_ordered)
W = 716676, p-value < 2.2e-16
alternative hypothesis: true location shift is not equal to 0
7.9.3 Set 3
Are Americans that own guns or Americans that don’t own guns more likely to have pets?
Are Americans with pets happier than Americans without pets?
7.9.4 Apply to a New Type of Data
Is there a relationship between college major and gun ownership?
7.10.1 Should we use a t-test or a wilcox sign-rank?
There is some open discussion in the applied statistics literature about whether we should ever be using a t-test. In particular, if the underlying distribution that generates the data is not normal, than the assumptions of a t-test are not, technically satisfied and the test does not produce results that have nominal p-value coverage. This is both technically and theoretically true; and yet, researchers, data scientists, your instructors, and the entire world runs t-tests as “test of first recourse.”
What is the alternative to conducting a t-test as the test of first recourse? It might be the Wilcox test. The Wilcox test makes a weaker assumption – of symmetry around the mean or median – which is weaker than the assumption of normality.
Additional points of argument, which you will investigate in this worksheet:
If the underlying data is normal, then the Wilcox test is nearly as well powered as the t-test.
If the underlying data is not normal, then the Wilcox test still maintains nominal p-value coverage, whereas the t-test might lose this guarantee.
7.11
7.11.1 The Poisson Distribution
The poisson distribution has the following PDF:
\[
f_X(x) = \frac{\lambda^n e^{-\lambda}}{n!}
\]
The key shape parameter for a poisson function is \(\lambda\); we show three different distributions, setting this shape parameter to be 1, 3, and 30 respectively. Notice that the limits on these plots are not set to be the same; for example, the range in the third plot is considerably larger than the first.
pois_lambda_1 <-rpois(n=1000, lambda=1)pois_lambda_3 <-rpois(n=1000, lambda=3)pois_lambda_30 <-rpois(n=1000, lambda=30)plot_1 <-ggplot() +aes(x=pois_lambda_1) +geom_histogram(bins=6, fill = berkeley_blue)plot_3 <-ggplot() +aes(x=pois_lambda_3) +geom_histogram(bins=10, fill = berkeley_gold)plot_30 <-ggplot() +aes(x=pois_lambda_30) +geom_histogram(bins=30, fill = berkeley_sather)plot_1 / plot_3 / plot_30
What does this changing distribution do to the p-values?