5.7 Write Code to Demo the Central Limit Theorem (CLT)

When you were reading for this week, did you sense the palpable joy of the authors when they were writing about the central limit theorem?

We now preset what is often considered the most profound and important theorem in statistics.

Wow. What excitement.

On its own, the result that across a broad range of generative functions the distribution of sample averages converges in distribution to follow a normal distribution would be a statistical curiosity. Along the lines of “did you know that dinosaurs might have had feathers,” or, “avocado trees reproduce using ’protogynous dichogamy”. While these factoids might be useful on your quiz-bowl team, they don’t get us very far down the line as practicing data scientists.

However, there is a very useful consequence of this convergence in distribution that we will explore in detail over the coming two weeks: because so many distributions produce sample averages that converge in distribution to a normal distribution, we can put together a testing framework for sample averages that works for an agnostic set of random variables. Wait for that next week, but know that there’s a reason that we’re as excited about this statement as we are.

5.7.1 Part 1

To begin with, we will use fair coins that have an equal probability landing heads and tails.

  1. Modify the function argument below so that it conduct one simulations, and in each simulation tosses ten coins, each with an equal probability of landing heads and tails. Look into the toss_coin function: is there a point that this function is producing a sample average? If so, where?
  2. Save values from the toss_coin function into an object, called sample_mean.
Code
# toss_coin()

The sample mean is a random variable – it is a function that maps from a random generative process’ sample space (the number of heads shown on dice) to the real numbers. To try to make this clear, visualize a larger number of simulations on the toss_coin function. That is, increase the number_of_simulations to be 10, or 100. and plot a histogram of the results. This is quite similar to what we have done earlier.

Code
# toss_coin()

If you replicate the simulation with ten coins enough times, will the distribution ever look normal? Why or why not?

5.7.2 Part 2

For this part, we’ll continue to study a fair coin.

What happens if you change the number of coins that you’re tossing? Here, set number_of_simulations=1000, and examine what happens if you alter the number of coins that you’re tossing? Is there a point where this distribution starts to “look normal” to you? (Later in the semester, we’ll formalize a test for this “looks normal” concept).

5.7.3 Part 3

What would happen if the coin was very, very unfair? For this part, study a coin that has a prob_heads=0.01. This is an example of a highly skewed random variable.

Start your study here tossing three coins, number_of_coins=3. What does this distribution look like?

What happens as you increase the number of coins that you’re tossing? Is there a point that the distribution starts to look normal?

5.7.4 Discussion Questions About the CLT

  1. How does the skewness of the population distribution affect the applicability of the Central Limit Theorem? What lesson can you take for your practice of statistics?
  2. Name a variable you would be interested in measuring that has a substantially skewed distribution.
  3. One definition of a heavy tailed distribution is one with infinite variance. For example, you can use the rcauchy command in R to take draws from a Cauchy distribution, which has heavy tails. Do you think a “heavy tails” distribution will follow the CLT? What leads you to this intuition?