13  Reproducible Research

13.1 Learning Objectives

13.2 Class Announcements

13.3 Roadmap

Rearview Mirror

Today

Looking Ahead

13.4 What data science hopes to accomplish

  • As a data scientist, our goal is to learn about the world:
    • Theorists and theologians build systems of explanations that are consistent with themselves
    • Analysts build systems of explanations that are consistent with the past
    • Scientists build systems of explanations that usefully predict events, or data, that hasn’t yet been seen

13.5 Learning from Data

  • As a data scientist, the way we learn about the world is through the streams of data that real world events produce
    • Machine processes
    • Political outcomes
    • Customer actions
  • The watershed moment in our field has been the profusion of data available, from many places, that is richer than at any other point in our past.
    • In 251, and 266 we place structure on data series like audio, video and text that are transcendently rich
    • In 261 we bring together flows of data that are generated at massive scales
    • In 209 we ask, “How can we take data, and produce a new form of it that is most effectively understood by the human visual and interactive mind?

13.6 Data Science and Statistics

  • So why statistics?
  • And why the way we’ve chosen to approach statistics in 203?

13.7 Why Statistics?: A Closing Argument for Statistics

  • Business, policy, education and medical decisions are made by humans based on data

  • A central task when we observe some pattern in data is to infer whether the pattern will occur in some novel context

  • Statistics, as we practice it in 203, allows us to characterize:

    • What we have seen
    • What we could have seen
    • Whether any guarantees exist about what we have seen
    • What we can infer about the population
  • So that we can either describe, explain or predict behavior.

13.8 Course Goals

13.8.1 Course Section III: Purpose-Driven Models

  • Statistical models are unknowing transformations of data

    • Because they’re built on the foundation of probability, we have certain guarantees what a model “says”
    • Because they’re unknowing, the models themselves know-not what they say.
  • As the data scientist, bring them alive to achieve our modeling goals

  • In Lab 2 we have expanded our ability to parse the world using regression, built a model that accomplishes our goals, and done so in a way that brings the ability to test under a “null” scenario

    • Key insight: regression is little more than conditional averages

13.8.2 Course Section II: Sampling Theory and Testing

  • Under very general assumptions, sample averages follow a predictable, known, distribution – the Gaussian distribution
  • This is true, even when the underlying probability distribution is very complex, or unknown!
  • Due to this common distribution, we can produce reliable, general tests!
  • In Lab 1 we computed simple statistics, and used guarantees from sampling theory to test whether these differences were likely to arise under a “null” scenario

13.8.3 Course Section I: Probability Theory

  • Probability theory
    • Underlies modeling and regression (Part III);
    • Underlies sampling, inference, and testing (Part II)
    • Every model built in every corner of data science

We can:

  • Model the complex world that we live in using probability theory;
  • Move from a probability density function that is defined in terms of a single variable, into a function that is defined in terms of many variables
  • Compute useful summaries – i.e. the BLP, expected value, and covariance – even with highly complex probability density functions.

13.8.4 Statistics as a Foundation for MIDS

  • In w203, we hope to have laid a foundation in probability that can be used not only in statistical applications, but also in every other machine learning application that are likely to ever encounter

13.9 Reproducibility Discussion

Green Jelly Beans

What went wrong here?

13.9.1 Discussion

Status Update You have a dataset of the number of Facebook status updates by day of the week. You run 7 different t-tests, one for posts on Monday (versus all other days), or for Tuesday (versus all other days), etc. Only the test for Sunday is significant, with a p-value of .045, so you throw out the other tests.

Should you conclude that Sunday has a significant effect on number of posts? (How can you address this situation responsibly when you publish your results?)

Such Update As before, you have a dataset of the number of Facebook status updates by day of the week. You do a little EDA and notice that Sunday seems to have more “status updates” than all other days, so you recode your “day of the week” variable into a binary one: Sunday = 1, All other days = 0. You run a t-test and get a p-value of .045. Should you conclude that Sunday has a significant effect on number of posts?

Sunday Funday Suppose researcher A tests if Monday has an effect (versus all other days), Researcher B tests Tuesday (versus all other days), and so forth. Only Researcher G, who tests Sunday finds a significant effect with a p-value of .045. Only Researcher G gets to publish her work. If you read the paper, should you conclude that Sunday has a significant effect on number of posts?

Sunday Repentence What if researcher G above is a sociologist that chooses to measure the effect of Sunday based on years of observing the way people behave on weekends? Researcher G is not interested in the other tests, because Sunday is the interesting day from her perspective, and she wouldn’t expect any of the other tests to be significant.

Decreasing Effect Sizes Many observers have noted that as studies yielding statistically significant results are repeated, estimated effect sizes go down and often become insignificant. Why is this the case?