Math 360: Supplementary Thoughts
Math Department, Eastern Michigan University
I always want to hear your
thoughts on the class so far, so I created an online form for anonymous
feedback. Please use it to let me know what you think (or if you are not
concerned about anonymity, just send an email) :
You can use this throughout the semester. You should remove the REMOVETHIS; I just put it in there to deter automated webcrawlers.
The American Statistical Association (ASA) has a statement of Ethical Guidelines for Statistical Practice
The Institute for Operations Research and Management Science (INFORMS) has a Certified Analytics Professional program that includes this Code of Ethics.
And here is a slightly more light-hearted code, originally written for a financial setting:
Emanuel Derman’s Hippocratic Oath of Modeling
• I will remember that I didn’t make the world and that it
doesn’t satisfy my equations.
• Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
• I will never sacrifice reality for elegance without explaining why I have done so. Nor will I give the people who use my model false
comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
• I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.
And a report on how web sites return different search results, prices, or ads based on the apparent race or location of the searcher:http://www.wnyc.org/story/dba2f97dd61e2035fd433a48/?utm_source=/story/128722-prime-number/&utm_medium=treatment&utm_campaign=morelikethis
searching for a traditionally black-sounding name such as “Trevon Jones” is 25 percent more likely to generate ads suggesting an arrest record—such as “Trevon Jones Arrested?”—than a search for a traditionally white-sounding name like “Kristen Sparrow,” according to a January 2013 study by Harvard professor Latanya Sweeney. Sweeney found this advertising disparity even for names in which people with the white-sounding name did have a criminal record and people with the black-sounding name did not have a criminal record.
And later in the report,
Our tests of the Staples website showed that areas with higher average income were more likely to receive discounted prices than lower-income areas.
Statistics is its own field, of course, but it is related to many others. People now talk about Analytics, which is often broken into
· Descriptive Analytics: what did happen? (EMU Math 360)
· Predictive Analytics: what will happen? (EMU Math 360, EMU Math 419W)
· Prescriptive Analytics: what’s the optimal thing to do? (EMU Math 319, EMU Math 560)
Of course other EMU statistics courses relate to Descriptive and Predictive analytics; I’ve only listed the ones I teach.
“Data Science” is another hot term these days. Some say it’s a combination of Statistics/Math/Operations Research, Computer Science, and Substantive Expertise:
The book “Doing Data Science”, page 42, says "Now the key here that makes data science special and distinct from statistics is that this data product then gets incorporated back into the real world, and users interact with that product, and that generates more data, which creates a feedback loop."
I disagree with that on two levels: first, plenty of people do what they’d call data science that doesn’t interact with users/create a feedback loop, and second, the idea of interaction or creating a feedback loop could still be called statistics, or Advanced Analytics/prescriptive analytics.
“Big Data” is also a hot topic these days. You could say it is any data set that is too big to fit onto one computer, and must be split across multiple computers. For example, Facebook profiles and clickstreams would constitute big data; some physics experiments also generate big data (100 TB per day!) Big data is often characterized by the “3 V’s”: Volume, Velocity, and Variety. Volume means how much data there is, Velocity means how fast it comes in, and Variety is the mix of numbers, text, sounds, and images. We won’t be dealing with Big Data in this class, but ask me if you want to know more about it.
We will use Excel a lot, even though there are well-documented problems with it for statistics; for example,
1. Data beat anecdotes
2. Association is not causation
3. The importance of study design
4. The omnipresence of variation
5. Conclusions are uncertain.
6. Observation versus experiment
7. Beware the lurking variable [confounding]
8. Is this the right question?
For project proposals, a few
things to remember:
* I encourage team projects, but solo projects are allowed as well.
* Team sizes are limited to 2 people (arranged with whomever you want).
* if anyone is looking for a partner, please let me know and I will do my best to play eHarmony.
* There is no competition for
project topics. Multiple people or teams may do the same project.
* I STRONGLY ENCOURAGE you to chat with me about project ideas well before the proposal deadline, either in person or by email.
* After the chatting is done, feel free to email me a draft of a proposal (and perhaps a data set) for some informal feedback.
* See below some sample project titles from last year's Math 360 class.
Proposals will generally be 1 to 2 pages, and will contain:
* title of project
* author(s) names
* a description of the problem you are facing
* a description of the available data or data collection plans (incl. a copy of the data if it’s already available, perhaps as a separate file)
* a description of the proposed analysis
* literature search? for many projects, no literature search is needed. Others may benefit from a literature search, and may get bonus points for doing one. Giving proper credit to your sources of information or ideas is always required.
* data? If you already have the data, include a spreadsheet of it as a 2nd file when you upload the presentation
If your project idea depends on getting a data set from your boss at work, you need to have the data set in hand by the time of the proposal. I've had a few projects go bad when a boss doesn't come through with promised data.
A proposal does not lock you in to a topic or analysis method. If your project is not working out, contact me immediately and together we will find a new project topic.
A general tip as you're doing your projects: doing confidence intervals is almost always better than doing a hypothesis test, because a CI can be converted into an HT very easily in your head (did the CI include zero? for example), but knowing the results of an HT doesn't give you much info about the related CI. There are some projects where a CI isn't applicable, though--often those related to Chapter 12 (chi-squared tests). The one nice thing about HT, though, is that you get a P-value, while the CI just lets you know that it was < 0.05 or whatever value you used for a CI.
It's important to create
artificial data sets similar to what your real data set is. That way, when you
do your processing, you can tell if you are getting what you expect to
get--it's a way of debugging. You start by copying the file with your original
data set, then in that copy, replacing the original data with artificial data. Then
you do your analysis on the artificial data. Once you've done it for artificial
data, you should be able to save another
copy of that file, then paste your real data in where the artificial data is,
and have all the calculations automatically update. This is vastly better than
trying to re-create the formulas in a new sheet, since that could introduce new
To generate a Standard Normal in excel, use
To generate a non-standard normal with mean 5 and std.dev. 3, use
=norminv(rand(),0,1)*3 + 5
Another big advantage of creating artificial data is then you can compute how much your output measurements change just due to random chance, by running a whole bunch of random trials.
Someone asked me if the random number generator in Excel is seedable (that is, can it be set to start at the same sequence over and over). There's no interface for doing that, but I researched the algorithm that the random number generator uses, and I've implemented it in simple formulas in a posted spreadsheet. You may ignore this if you want.
Another key component of some projects is the idea of Cross-Validation. Instead of fitting models to the entire data set, you pick a portion of it called the “Training” set and fit the models to that. Then you use those fitted models to make predictions for the rest of the data, called the “Test” set, to see which model does the best. Actually, if you then want to quantify the prediction errors you might expect to see, you need a 3rd portion of the data set: you fit the winning model to the training & test set, then make predictions for that 3rd portion, and measure the prediction error.
Doing Data Science says:
Out-of-Sample, and Causality
We need to establish a strict concept of in-sample and out-of-sample data. Note the out-of-sample data is not meant as testing data—that all happens inside in-sample data. Rather, out-of-sample data is meant to be the data you use after finalizing your model so that you have some idea how the model will perform in production. We should even restrict the number of times one does out-of-sample analysis on a given dataset because, like it or not, we learn stuff about that data every time, and we will subconsciously overfit to it even in different contexts, with different models.
Next, we need to be careful to always perform causal modeling (note this differs from what statisticians mean by causality). Namely, never use information in the future to predict something now. Or, put differently, we only use information from the past up and to the present
moment to predict the future. This is incredibly important in financial modeling. Note it’s not enough to use data about the present if it isn’t actually available and accessible at the present moment. So this means we have to be very careful with timestamps of availability as well as timestamps of reference. This is huge when we’re talking about lagged government data.
Your final report should be a roughly 5-to-10-page technical report (a Word file, usually). I don’t count pages, though, so don’t worry about the exact length. Please use the HomeHealthCare.doc file that I will email out as a template (remove their content, type in your own content).
Please upload both your
report file and your Excel file at the same time. But, your report should have
copies of any relevant figures; don't just say "see the Excel file".
If you are part of a team project, _each_ person should upload a copy of the presentation and report.
For your final presentation, you have 2 options:
· A 5-minute Powerpoint-style presentation that you stand up and give to the class (roughly 5 slides), or
· A poster presentation, which often consists of about 12 Powerpoint slides, printed out on paper and taped to the wall of the classroom (don’t buy/use posterboard).
Each person or team of 2 may decide whether they want to do a poster or oral presentation. Either way, presentation materials should be uploaded to a dropbox inside EMU-online.
Please do not feel obligated to dress up for our presentation day in Math 360. Anyone who does dress up will be a few standard deviations from the mean, as statisticians say. Either way, it will not affect your grade at all.
However, it is important to present in a professional way (aside from how you are dressed). If I write a letter of recommendation for you, I want to be able to say how polished your presentation was—not just your slides, but your manner of speaking. This can be especially important for future teachers. In a letter of recommendation I would hope I could say “While I’ve never observed ____ as they teach an actual class, their final project presentation in Math 360 convinces me that they have the presentation skills to be a great teacher.”
I will recommend that some people submit their work to the Undergraduate CLASS Project Competition (USCLAP)
The writeup for that has the following page limits (all in 11-point Arial, single-spaced, 1-inch margins):
1 page for title and abstract
<=3 pages for report
1 page for bibliography, if any (optional)
<=5 pages for appendices
So you might want to format your paper that way if you’re thinking of entering the contest.
Note that if you are using data from human subjects (or animals!) you will need to apply for permission from EMU’s Institutional Review Board (IRB) to use your data in the USCLAP contest. I can help you with this, but we need to do it early in the semester. If you aren’t hoping to submit to the USCLAP, then IRB approval is usually not required.
The judging criteria for that contest will be the basis of the grading system for projects:
1. Description of the data source (15%)
2. Appropriateness and correctness of data analysis (40%)
3. Appropriateness and correctness of conclusions and discussion (20%)
4. Overall clarity and presentation (15%)
5. Originality and interestingness of the study (10%)
NOTE: All essential materials addressing these criteria must be in the report, not confined to the spreadsheet file.
You can see the guidelines I give to my other project-based classes (Math 319, Math 419, Math 560) at this link:
though as you can see from the above, the requirements for Math 360 are a little different because of the statistical focus.
Sample project titles from previous years
Baseball player builds and home runs
Tennis serve accuracy
Noll-Scully simulation of sports rankings
Spring Training vs Regular Season
Finding a Piecewise Linear Breakpoint in Chemistry data
Music participation and GPA
Salary vs. Results in NCAA Tournament
Barbie Bungee Challenge
Incumbency advantage in elections
Comparing Distinct Audio Points in Classical and Rock Music
Barbie Bungee Challenge
Airbags, seat belts, bike helmets
Spring Training vs Regular Season
Golden Ratio in Art
Gender differences in SAT scores
Home health care data
GEAR-UP survey data
Normal distributions on Wall Street?
Naive Bayesian spam filtering
Honors college GPAs
Predicting Course Grades from Mid-Semester Grades
Salary vs. Results in NCAA Tournament
Incumbency advantage in elections
An Analysis of Correlations between Event Scores in Gymnastics Using Linear Regression
Accounting Fraud and Benford's Law
Appointment-Based Queueing and Kingman's Approximation
Are Consumers Getting all the Coconut Chocolaty Goodness They’re Paying For?
Are regular M&Ms more variable in weight than Peanut M&Ms?
Barbie Bungee Experiment
Breaking Eggs in Minecraft
Comparing the Efficiency of Introductory Sorting Algorithms
Distribution of File Sizes
Do students who score better on a test’s story problems score better on the test as a whole?
Do studying Habits affect your interest in math
Do young adults under 18 and 18 and older have the same completion rate of the 3-shot regimen for Gardasil?
Does age effect half-marathon completion time
Patterns in bulk discounts
Does having high payrolls mean you will win more Major League Baseball games?
Gardasil 3-shot vaccine completion, average number of shots
Getting Hot at the Right Time: A statistical analysis of variable relative strength in the NHL
Home Field Advantage in MLB, NFL, and NBA
How random are Michigan Club Keno and Java random numbers?
Ice Cream Sales and Temperature
Is there relationship between the length of songs at the #1 spot on the Billboard Hot 100 and their respective week at #1 in time?
Lunch vs Dinner Sales at Domino's
Math Lab demand data vs Section Enrollment by Hour
Modeling School of Choice Data in Lenawee County
Pharmacy prescription pick-up times
Piecewise-Linear Regression on Concentration / Conductivity Data
Proportion & Probability of 2-Neighborly Polytopes with m-Vertices in d-Dimensions
Ranking Types of Math Questions (Algebra-based)
Scoring Trends and Home Court Advantage in Men’s College Basketball
Skip Zone on the Sidewalk
Spaghetti Bridge and Pennies
What affects a pendulum’s behavior
While these are shown in various categories, each project idea is open to anyone in any major.
Are stock prices (or percent returns) normally distributed? See http://bestcase.wordpress.com/2010/08/01/outliers-in-the-nyt-reflections-on-normality/
Various questions on where the Daily Double in Jeopardy is located (ask me for more thoughts)
Song database: http://musicbrainz.org/doc/MusicBrainz_Database
correlation between LSAT, GPA, admission, and salary; ask me to dig the data out of my email if needed
Fermat's last theorem histogram: how close can the equation come to being true?
- by time of day
- by weather/day-to-day
- within span of a few seconds or minutes
- from device to device
Tablet/Smartphone Accelerometer Data:
- accuracy at 500 Hz vs 50 Hz vs 5 Hz
- correlation between devices
Mars Craters data set, craters.sjrdesign.net
Make your own crater data set with a bucket of sand and a heavy marble?
cepheid variable stars; ask me to dig data out of my email box
Asteroid size distribution: can get data from http://www.asterank.com http://www.minorplanetcenter.net/iau/lists/Dangerous.html
space weather, Coronal Mass Ejections CME (ask me to dig up some data on this out of my emails)
There’s a new ASA section on astrostatistics, described in AMstat News—see what they do?
http://en.wikipedia.org/wiki/Proton_decay In an experiment involving a series of particle collisions,
the amount of generated matter was approximately 1% larger than the amount of
generated antimatter. The reason for this discrepancy is yet unknown.
V.M. Abazov et al. (2010). "Evidence for an anomalous like-sign dimuon charge asymmetry". arXiv:1005.2757. http://arxiv.org/abs/1005.2757
Tennessee STAR study on small
problems with estimating from pie charts
parents probability of pulling kids from public schools (survey)
Regression through the origin: when?
Instead of a regular project, work on getting the Data Analysis electronic badge?
Barbie Bungee: make a bungee-cord out of rubber bands, and
send a Barbie (or similar toy) plunging toward the floor. Try it with a few
different lengths of cord, record how far she plunges, then forecast how many
rubber bands would be needed for a 12-foot drop. You can find more info online,
Spaghetti Bridges: make a simple bridge of straight spaghetti (not glued into a truss), see how much weight it can hold. Repeat with wider spans and/or more strands per bridge. You can find more info online. One reference is “Slope-Intercept Form—Beam Strength” from Exploring Algebra 1 with TI-Nspire, 2009, Key Curriculum Press.
* How far does a supersoaker shoot, based on how many pumps
you give it?
* How far does a supersoaker shoot, as a function of time as you hold down the trigger?
Here are some ideas about the
mechanics of statistics:
Confidence interval on s_e for linear regression
Partial Correlation in multivariate analysis
Simulate a thought experiment on publication bias
advanced work on causality: http://magazine.amstat.org/blog/2013/08/01/causality-in-stat-edu/
Judea Pearl work on causality
Machine Learning problems: logistic regression, SVM, etc.
Cross-Validation: training and test data sets
distribution of file sizes
- on a hard drive (correlated to time of creation, modification, or access?)
- on a web server
- as requested from a web server
distribution of packet sizes, and correlation from one to the next?
distribution of time gap between packets, and correlation from one to the next?
spam filtering; try the Enron email database at http://www.cs.cmu.ed/~enron
durations of jobs on the CPU
memory sizes of jobs
Network round-trip times for pings
Sleep vs Cron repetitive wakings
look into what gets presented at ACM SIGMETRICS
Medicare Home Health Compare
Gardasil data set:
SEER cancer data set,
(need to submit application to use it)
National Longitudinal Study of Adolescent Health, via IPCSR/umich (easiest to use wave 1)
Health Evaluation and Linkage to Primary Care (HELP), data set HELPrct from Project Mosaic
painkiller prescription and overdose rates by state; I have some of the data saved in an email
Pick your favorite sport and
ask a statistical question about it. Some examples:
* predicting player performance from previous years (helpful first step to choosing a fantasy team)
* quantifying home-field advantage
* (harder) quantifying time-zone advantage
How consistent is a participant's
performance (#fish? weight? rank? z-score?) from one event in the tour to the
next? Compare to other individual-performance sports? Here are links for 3
tournaments in 2014:
How about the Hot Hand?
Here is a stats textbook that has a sports focus, rather than just doing sports-statistics, but it might still be interesting: http://www.sportsci.org/resource/stats/index.html
GeoGebra can do some statistics: http://web.geogebra.org/beta/
the three-bar button in the upper right
click on the normal curve with an area under it
Play with either the Distribution or the Statistics tab
Statistics can do Z Test of a Mean, T Test difference of means, etc.
Excel 2010 for educational and psychological statistics : a guide to solving practical problems / Thomas Quirk.
Excel 2010 for biological and life sciences
statistics : a guide to solving practical problems / by Thomas J. Quirk, Meghan
Quirk, Howard Horton.
Converting Data into Evidence
A Statistics Primer for the Medical Practitioner
DeMaris, Alfred, Selman, Steven H.
On Chance and Unpredictability: 13/20 lectures on the links between mathematical probability and the real world. David Aldous, January 2012
Reasoning in Sports, by Tabor and Franklin http://bcs.whfreeman.com/sris/
Statistics: A Guide to the
Forty Studies that Changed Psychology: Exploration into the History of Psychological Research
"Making Sense of Data" volumes 1,2,3, by Glenn J. Myatt; EMU library has an electronic subscription
Doing Data Science: Straight Talk from the Frontline, By Cathy O'Neil, Rachel Schutt; Publisher: O'Reilly Media
We will use this
link for the Car Insurance activity:
and then later we will use this link for the Data Types activity:
Some additional reading is included below.
To prepare for our next class, we will use the following PDF file on Random Rectangles:
and you should enter your answers here before class starts:
Sometimes we code binary categorical variables (like gender) as 0 or 1; that’s called Dummy coding. We can also code them as -1 vs +1; that’s called Effect coding: http://methodology.psu.edu/node/266
Here is some reading on the standard classifications for Data Types (nominal, ordinal, interval, ratio):
and an opposing viewpoint:
which cites, among other possible systems,
Mosteller and Tukey (1977 Chapter 5):
* Grades (ordered labels such as Freshman, Sophomore, Junior, Senior)
* Ranks (starting from 1, which may represent either the largest or smallest)
* Counted fractions (bounded by zero and one. These include percentages, for example.)
* Counts (non-negative integers)
* Amounts (non-negative real numbers)
* Balances (unbounded, positive or negative values).
Science, page 23, suggests:
• Traditional: numerical, categorical, or binary
• Text: emails, tweets, New York Times articles
• Records: user-level data, timestamped event data, json-formatted log files
• Geo-based location data
• Sensor data
Also see http://stats.stackexchange.com/questions/539/does-it-ever-make-sense-to-treat-categorical-data-as-continuous
For future teachers: I
was amazed to see in my daughter's 3rd grade homework a link with our
Categorical/Quantitative, Discrete/Continuous discussion:
This homework sheet talks about Count, Measure, Position, and Label:
This one is amazingly similar to our activity where we talked about Nominal, Ordinal, Interval, Ratio for our start-of-semester-survey:
I'm not sure if it's in all such curricula--the book they're using is by Houghton Mifflin.
Remember that dotplots can tell us:
* S: the Shape of the
distribution: (concentrated at an endpoint? Or in the middle?
* O: any Outliers or other unusual features like gaps
* C: where the data is Centered
* S: how Spread the data is
so if you're writing sentences describing a dotplot, you should write at least one sentence for each of those bullet points. Remember the acronym SOCS. It’s important to do them in that order, too, because shape and outliers often influence our choice of how to measure center and spread.
For example, for a sibling-count dotplot we did one year in class:
The data is concentrated near
the low end.
There are no unexpected gaps or outliers.
The center of the data is around 3.
The data is spread from 1 to 9.
First, let’s note that in statistics, a Sample almost always means more than one data value. If you poll 25 people for a project, that is a single Sample, not 25 samples. This is in contrast to how scientists often think of samples: a blood sample, or a sample from a lake or river, often makes us think of just one container of blood or water.
Doing Data Science, page 21, asks:
But, wait! In the age of Big Data, where we can record all users’ actions all the time, don’t we observe everything? Is there really still this notion of population and sample? If we had all the email in the first place, why would we need to take a sample?
And on page 25:
way the article frames this is by claiming that the new approach of Big Data is
Here’s the thing: it’s pretty much never all. And we are very often missing the very things we should care about most.
Example of confounding: Stereotypically, old people are thought of as not very good with new technology. Is that because they have lived a large number of years, or because they were born during a particular decade or two? There’s no way to disentangle those two things.
Another example: An
exhibit at the Wagner farm in Glenview, IL has 3 rope/pulley systems, each
trying to lift an equal weight. One is a simple pulley; the next is compound
(down and up), the 3rd is even more compound (down/up/down). The ropes used are
also slightly different: the most compound one uses a thinner rope. The most
compound one should be the easiest to pull. [it's not, due to a lack of
lubrication and some bent axles on the pulleys).
My daughter tries all 3 and decided that the diameter of the rope is what makes things easier or harder to lift.
Imagine dropping a marble into a
bucket of sand and measuring the diameter of the crater.
If you change the diameter of the impactor, you're also changing the weight (or vice versa), unless you take very great care to find marbles/balls that change density or become hollow in just the right way.
The table would look like this:
Diameter Weight DropHeight CraterSize
0.5cm 2grams 25cm 5cm
0.5cm 2grams 25cm 6cm
0.7cm 3grams 25cm 7.1cm
0.7cm 3grams 25cm 7.3cm
You could control weight separately from impactor diameter by using a non-sphere impactor, like a stack of pennies or nickels, or AA batteries. But then you'd have to be careful to control its orientation at impact--maybe have it slide down a V-shaped near-vertical channel, or suspended from a string (balanced perfectly vertically) and very still, then cut the string.
Quick activity: name that
a. Roll a die to pick a row in class, then ask each student in that row; do it twice
b. Pick a student and then every 5th student after that
c. Ask one student from each row
d. pick some students, say "you guys look like typical students"
e. throw an object, see who it hits
f. Number the students, etc.
In _______ sampling, ALL groups (strata? clusters?) are used, and SOME individuals in each are sampled.
In _______ sampling, SOME groups (strata? clusters?) are used, and ALL individuals in each are sampled.
If you’ve done Stratified sampling, how do you combine your strata results into whole-sample results? Ask me for a photocopy from Applied Statistics for Engineers and Scientists, 2nd Edition, by Devore and Farnum.
Dotplots of Random Rectangles results, look for bias
Bias in cancer screening: Crunching Numbers: What Cancer Screening Statistics Really Tell Us, by Sharon Reynolds, http://www.cancer.gov/ncicancerbulletin/112712/page4
Bias due to question ordering: http://textbookequity.org/oct/Textbooks/Lippman_mathinsociety.pdf
page 137: A psychology researcher provides an example:
“My favorite finding is this: we did a study where we asked students, 'How satisfied are you with your life? How often do you have a date?' The two answers were not statistically related - you would conclude that there is no relationship between dating frequency and life satisfaction. But when we reversed the order and asked, 'How often do you have a date? How satisfied are you with your life?' the statistical relationship was a strong one. You would now conclude that there is nothing as important in a student's life as dating frequency.”
Swartz,Norbert. http://www.umich.edu/~newsinfo/MT/01/Fal01/mt6f01.html. Retrieved 3/31/2009
Bias in psychology studies: most undergrad students who
volunteer as subjects are WEIRD (or WIRED): Western,
Educated, Industrialized, Rich, and Democratic
Does the needed sample size
grow as the population grows?
Page 43 says no!!!!!!!
Here is a blank copy of Table 2.1 ; try to fill it out with “yes” and “no” entries by reasoning about each situation.
Reasonable to generalize conclusions about group to population?
Reasonable to draw cause-and-effect conclusion?
Observational study with sample selected at random from population of interest
Observational study based on convenience or voluntary response sample
Experiment with groups formed by random assignment of individuals or objects to experimental conditions
(no entry; this row is just a header for the next 2 rows)
* Individuals or objects used in study are volunteers or not randomly selected
* individuals or objects are randomly selected
Experiment with groups not formed by random assignment to experimental conditions
Rating System for the Hierarchy of Evidence: Quantitative Questions
Level I: Evidence from a systematic review of all relevant randomized controlled trials (RCT's), or evidence-based clinical practice guidelines based on systematic reviews of RCT's
Level II: Evidence obtained from at least one well-designed Randomized Controlled Trial (RCT)
Level III: Evidence obtained from well-designed controlled trials without randomization, quasi-experimental
Level IV: Evidence from well-designed case-control and cohort studies
Level V: Evidence from systematic reviews of descriptive and qualitative studies
Level VI: Evidence from a single descriptive or qualititative study
Level VII: Evidence from the opinion of authorities and/or reports of expert committees
Above information from "Evidence-based practice in nursing & healthcare: a guide to best practice" by Bernadette M. Melnyk and Ellen Fineout-Overholt. 2005, page 10.
Additional information can be found at: www.tnaonline.org/Media/pdf/present/conv-10-l-thompson.pdf
Chapter 2.3: Comparative Experiments
"explanatory" variables are sometimes called "independent", and
variables are often called "dependent",
but in later chapters we will learn this can cause confusion.
Blocking: means "putting into groups or blocks", rather than "obstructing".
Blocking activity: email from a friend in the Health school here at EMU:
> Dr. Ross,
> I hope your holiday went well. I have recently completed a project
> identifying some basic variables to be used in preliminary
> evaluation of gait interventions to determine whether the new
> intervention would be worth conducting an in-depth study about. The
> variables include stride length, step width, stride variability, and
> lateral displacement of the total body center of mass. As you can
> see, these variables represent some of the most basic aspects of
> stability, which is what we are always trying to improve or
> maintain, and efficiency.
> My question to you is: what sample size e.g. 10 trials, 20 trials, 50
> trials, would we need to take from both a control group (barefoot or
> with regular shoes) and the experimental group (the intervention) in
> order to obtain a confidence level to say that very small changes
> between the two groups is statistically significant? An example
> would be that the average lateral sway during gait was 5 mm less in
> the experimental group form the control group, is that significant
> or not?
Consider the study design here: Evidence Of Racial, Gender Biases Found In Faculty Mentoring
Or watch this video and consider how to design a study related to it:
Watch this video, called "Dove: Patches"
* If you were to design a study around this concept, what would your research question be?
* How would you design the study to answer that question?
Why do we try to do more than one trial at each level of the explanatory variable?
Imagine this data set:
What if we had only done one trial at each dose?
Might see just the diamonds, or just the Xs, leading to two completely different ideas of the trend!
And that's just by doing two rather than one at each level!
Replication allows us to quantify the variability/uncertainty at each level.
Also, when designing, choose 3 or more X values, so we can detect nonlinearity.
Controls: Positive and Negative
Bio/Chem: when trying to detect a chemical in a sample (pollution in a lake?),
run your procedures on some known pure water (Negative Control),
and on some water with a known amount of pollutant deliberately added to it (Positive Control).
Computer Science: when testing spam-filtering software,
run it on some known non-spam ("ham")--negative control,
and on some known spam -- positive control.
What is the difference between placebo and control?
Placebos are meant to fool _people_--usually unnecessary on non-people.
While control experiments apply to people and non-people alike.
But you should still handle animals in the control group the same way (incl. surgery?)
This article is a humorous take on experimental vs
observational, etc: How To Argue With Research You Don't Like,
Guidelines and debate about information visualization: http://eagereyes.org/blog/2012/responses-gelman-unwin-convenient-posting and http://robertgrantstats.wordpress.com/2014/05/16/afterthoughts-on-extreme-scales/
A “Segmented Bar Chart” in our textbook is the same as a “100% Stacked” chart in Excel. If you change the widths of the bars to reflect the counts of each bar, that is called a “Mosaic” chart by most statisticians, or Fathom calls it a “Ribbon” chart. Here are some thoughts on how to make them in Excel: here and here
A very interesting data
set/set of histograms, which we would expect to have a Normal distribution like
the SAT or ACT, but it's definitely non-normal in very interesting ways:
(you'll need to scroll down past the description of how he collected the data
in a sneaky way):
whereas SAT data can be found at
Are adult heights distributed bimodally due to male/female
... two ways of comparing height data for males and females in the 20-29 age group. Both involve plotting the data or data summaries (box plots or histograms) on the same sale, resulting in what are called parallel (or side-by-side) box plots and parallel histograms. The parallel box plots show an obvious difference in the medians and the IQRs for the two groups; the medians for males and females are, respectively, 71 inches and 65 inches, while the IQRs are 4 inches and 5 inches. Thus, male heights center at a higher value but are slightly more variable.
... Heights for males and females have means of 70.4 and 64.7 inches, respectively, and standard deviations of 3.0 inches and 2.6 inches.
Blood Sugar Levels [note that the height of the Diabetic peak should be much smaller in the whole population, like 2% to 5%; this graph is showing two conditional distributions]
Duration of pregnancy has a left-skewed histogram; doi:10.1093/humrep/det297
Birth weight and birth length probably are also left-skewed?
Where the bins start can affect the apparent shape of the histogram:http://zoonek2.free.fr/UNIX/48_R/03.html
Which histogram below shows more variability, A or B? (adapted from a SCHEMATYC document)
Which time series shows more variability, A or B?
How can we address the mismatch? Focus on understanding and labeling the AXES! (I deliberately didn’t label the axes, above)
x = what values?
y = how many?
x = when?
y = what value?
data sets of quiz scores: which is more variable/harder to
data set 1: 1 2 3 4 5 6 7 8 9 10; histogram looks flat
data set 2: 8 8 8 8 8 8 8 8 8 8 ; histogram has a spike
Other sample histograms from something I created for my Math
110 class. Here's the link:
and just search inside the file for Histogram.
An alternative to a histogram is a Frequency Polygon; these tend to be better when showing two or more histograms on the same graph
Doing Data Science, page 270: “the average person on Twitter is a woman with 250 followers, but the median person has 0 followers”
Mythbusters on Standard Deviation: testing different soccer-ball launchers, looking for consistency (dig the spreadsheet out of my email?)
Boxplots: An interesting comparative boxplot on what
different hospitals charge for different blood tests: Variation
in charges for 10 common blood tests in California hospitals: a cross-sectional
analysis by Renee Y Hsia1, Yaa Akosa
Antwi2, Julia P Nath
Figure 1: Variation in charges for 10 common blood tests in California (CBC, complete blood cell count; ck, creatine kinase; WCC, white cell count). Central lines represent median charges, boxes represent the IQR of charges, and whiskers show the 5th and 95th centile of charges for each of the 10 common blood tests.
Boxplots: variability in salaries for top 100 athletes in various US sports, from Jeff Eicher via the AP-Stats community:
Transitioning from Dotplots to Boxplots: Hat Plots, a math-education-specific way to make the transition; shown in The Role of Writing Prompts in a Statistical Knowledge for Teaching Course by R.E. Groth (draft saved in email); Tinkerplots can do them.
Here is the data that you entered this semester on your own
height, in inches above 5-foot-0-inches.
We will use this in class.
Not to spoil the fun, but we will:
* make a histogram to see its general shape, in bins of width 2 inches
(use the histogram template file we've been using)
* compute the mean & SD
* compute another histogram with bin widths of 1 SD, centered on the mean
* look at the % of data points within +/- 1 SD of the mean, and then +/- 2 SD of the mean, and +/- 3 SD.
And here's another data set, on how long it took students in one of my Math 110 classes to walk from Green Lot to Pray-Harrold, in decimal minutes:
it can be hard to compute the mean, and especially the variance, if there is a
huge amount of data, or if roundoff error is an issue. Computer science people
might want to read:
Chan, Tony F.; Golub, Gene H.; LeVeque, Randall J. (1983). Algorithms for Computing the Sample Variance: Analysis and Recommendations. The American Statistician 37, 242-247. http://www.jstor.org/stable/2683386
Is the main purpose of regression to make predictions? The book “Making Sense of Data, Volume I” says that statistics is for: making predictions, finding hidden relationships, and summarizing the data:
Forecasting time series data is an important statistics
topic that we don’t have time to do in detail. Here is a freely available
textbook chapter about it: “Chapter 16: Time Series and Forecasting”
There are more types of regression than what we’ll learn about. See 10 types of regressions. Which one to use?
For future teachers: Unofficial TI-84 regression manual, or another site. It mentions that you need to turn on the Calculator Diagnostics to get the r and r^2 values. Do that by doing [2nd] [Catalog] [D] [Diagnostic On] [Enter]; you should only have to do that once in the lifetime of the calculator (unless you do a full-reset?)
And, it’s important to be able to compute and plot residuals. Here are instructions for doing it on a TI-84.
Here is an example heteroscedastic scatter plot: x=income, y=Expenditure on food, both in multiples of their respective mean; this is UK data on individuals, from 1968-1983
Here is some data on school-age children in the US, height and weight, that also shows heterskedasticity: data from http://www.nal.usda.gov/fnic/DRI/DRI_Energy/energy_full_report.pdf
There are various tests for heteroscedasticty:http://en.wikipedia.org/wiki/Breusch%E2%80%93Pagan_test
An example Matrix of Scatterplots, from Statistical Methods in Psychology Journals:
It’s data from a national survey of 3000 counseling clients (Chartrand 1997); on the diagonal are dotplots of the individual variables, and off the diagonal there are scatterplots of pairs of variables. “Together” is how many years they’ve been together in their current relationship. What do you see in these plots?
Here’s a fun/depressing scatterplot, from OK Cupid: http://blog.okcupid.com/index.php/we-experiment-on-human-beings/
And then [in the following paragraph, what does the “less than 10%” mean, in terms of statistical things like slope, intercept, correlation coefficient, R^2, etc?]
After we got rid of the two scales, and replaced it with just one, we ran a direct experiment to confirm our hunch—that people just look at the picture. We took a small sample of users and half the time we showed them, we hid their profile text. That generated two independent sets of scores for each profile, one score for “the picture and the text together” and one for “the picture alone.” Here’s how they compare. Again, each dot is a user. Essentially, the text is less than 10% of what people think of you.
Doing Data Science, page 26:
Say you decided to compare women and men with the exact same
qualifications that have been hired in the past, but then, looking into what
happened next you learn that those women have tended to leave more often, get
promoted less often, and give more negative feedback on their environments when
compared to the men. Your model might be likely to hire the man over the woman
the two similar candidates showed up, rather than looking into the possibility that the company doesn’t treat female employees well. In other words, ignoring causation can be a flaw, rather than a feature. Models that ignore causation can add to historical problems instead of addressing them.... And data doesn’t speak for itself. Data is just a quantitative, pale echo of the events of our society
Also see the fantastical claims in “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” Chris Anderson, Wired, 2008
And, Statistical Truisms in the Age of Big Data by Kirk Borne
Grab data for infant mortality vs. gdp-per-capita from my email box?
Year Number of Starbucks stores
http://www.starbucks.com/aboutus/Company_Timeline.pdf retrieved May 2009
Like Moore’s law, but for LEDs: http://en.wikipedia.org/wiki/Haitz's_law
Range of human
hearing and range of human vision:
This page points out the problem of doing repeated tests as the sample size grows--even if H0 is true, the P value will wander between 0 and 1 randomly, and if you decide to stop when it hits 0.05 you're doing something bad: http://www.refsmmat.com/statistics/regression.html
Why best cannot last: Cultural differences in predicting regression toward the mean
Roy R. Spina, Li-Jun Ji, Michael Ross, Ye Li, Zhiyong Zhang
Article first published online: 16 AUG 2010; DOI: 10.1111/j.1467-839X.2010.01310.x
Keywords: culture; lay theories of change; prediction; regression toward the mean
Four studies were conducted to investigate cultural differences in predicting and understanding regression toward the mean. We demonstrated, with tasks in such domains as athletic competition, health and weather, that Chinese are more likely than Canadians to make predictions that are consistent with regression toward the mean. In addition, Chinese are more likely than Canadians to choose a regression-consistent explanation to account for regression toward the mean. The findings are consistent with cultural differences in lay theories about how people, objects and events develop over time.
Home Run Derby. There
is a popular view that players who participate in the Home Run Derby somehow
"hurt their swing" and do worse in the second half of the season.
This article talks about how this phenomenon can be accounted for by regression
to the mean.
Interpreting the Intercept in a
and the more advanced “How to Interpret the Intercept in 6 Linear Regression Examples”
Working on her dissertation in the mid-1990s, Sheryl Stump (now the Department Chairperson and a Professor of Mathematical Sciences at Ball State University) did some of the best work to date about how we define and conceive of slope. Stump (1999) found seven ways to interpret slope, including: (1) Geometric ratio, such as "rise over run" on a graph; (2) Algebraic ratio, such as "change in y over change in x"; (3) Physical property, referring to steepness; (4) Functional property, referring to the rate of change between two variables; (5) Parametric coefficient, referring to the "m" in the common equation for a line y=mx+b; (6) Trigonometric, as in the tangent of the angle of inclination; and finally (7) a Calculus conception, as in a derivative.
[note that none of these correspond to how we view slope in statistics!]
Activity idea: Determine the sensitivity and specificity of the Cinderella shoe-fitting method. You will have to make some assumptions.
How can you analyze this study?
Autism Risk Detected at Birth in Abnormal Placentas
Written by Julia Haskins | Published April 25, 2013
Some good sensitivity/specificity examples at http://onlinelibrary.wiley.com/enhanced/doi/10.1111/1467-9639.00076/
students should experience setting up a model and using simulation (by hand or with technology) to collect data and estimate probabilities for a real situation that is suﬃciently complex that the theoretical probabilities are not obvious. For example, suppose, over many years of records, a river generates a spring ﬂood about 40% of the time. Based on these records, what is the chance that it will ﬂood for at least three years in a row sometime during the next ﬁve years? 7.SP.8c
7.SP.8c Find probabilities of compound events using organized lists, tables, tree diagrams, and simulation.
c Design and use a simulation to generate frequencies for compound events.
We might also have a quick quiz in class about how shifting
or scaling affects the mean, variance, SD, IQR, etc., and the proper formula
for the sample variance.
In class, we used the dotplot-histogram-crf-1000 sheet to investigate questions like:
* Is E[X+Y] = E[X] + E[Y] ? (yes, it always is--doesn't even need independence!)
* Is Std(X+Y) = Std(X) + Std(Y) ? (no, it basically never is!)
* Is Var(X+Y) = Var(X) + Var(Y) ? (in Excel it was close enough; with infinite trials, it's exactly true,
but we need to require that X and Y be independent, or at least uncorrelated)
Some other questions we could ask:
* Is E[X^2] = ( E[X] )^2 ?
* Is E[1/X] = 1/ E[X] ?
* Is E[X*Y] = E[X]*E[Y] ?
If you look at the 2nd multi-plotting sheet inside that file I sent, you will see a copy of what we did today, and some experiments as suggested above.
In other news, here is some advice on keeping notation straight:
Do not write/think nonsense. For example: the expression "P(A) or P(B)" is nonsense--do you see why? Probabilities are numbers, not boolean expressions, so "P(A) or P(B)" is like saying, "0.2 or 0.5" -- meaningless.
Similarly, say we have a random variable X. The "probability" P(X) is invalid. P(X = 3) is valid, but P(X) is meaningless.
Please note that = is not like a comma, or equivalent to the English word therefore. It needs a left side and a right side; "a = b" makes sense, but "= b" doesn't.
Similarly, don't use "formulas" that you didn't learn and that are in fact false. For example, in an expression involving a random variable X, one can NOT replace X by its mean. (How would you like it if your professor were to lose your exam, and then tell you, "Well, I'll just assign you a score that is equal to the class mean"?)
And, from Rossman and Chance, "Brief Review of Set Operations and Properties":
An event is a set, while a probability is a number.
One calculates probabilities of events (and therefore of sets), but probabilities are numbers. The following _meaningless_ statements are examples of nonsensical confusions of sets and numbers:
P(A) intersect P(B)
Examples of _meaningful_ statements about events and probabilities include:
P(A intersect B)
Chapter 7.5: Binomial and Geometric
first, note that an “unfair coin” is apparently nearly impossible to construct:
“You Can Load a Die, But You Can’t Bias a Coin”, Andrew Gelman and Deborah Nolan,
How reliable is public transit? This government document says “train must stop short of an authority limit with a 0.999995 certainty”; that means it shouldn’t go past a point on the track that it isn’t allowed to go past.
Classic question: if you flip a (unfair?) coin n times, how many times will it come up heads?
The # of heads has a Binomial distribution.
2nd classic questions: if you flip a (unfair?) coin _until_ you get your first heads,
how many flips will it take?
The # of flips has a Geometric distribution.
Binomial has a fixed # flips, random #heads
Geometric has a random #flips, fixed #heads (just 1)
We already saw: P(1 energy-efficient fridges out of 3)
involved 3 different outcomes (3 trials, choose 1 E fridge)
Binomial PMF: one thing Prof. Casey has identified as
"something to know cold"!
Book uses p(x) but then also uses p for success probability--dangerous!
nCx * p^x * (1-p)^(n-x); x=0,1,..., n
Binomial PMF applet:
n=10 fair coin flips (p=1/2), P(X=5)?
Can use BINOMDIST(x,n,p, false) for PMF in Excel
What about P(X<=3 )?
Could do P(X=0)+P(X=1)+P(X=2)+P(X=3)
but there's a better way: binomdist using the cumulative=true option:
BINOMDIST(x,n,p,true) = P(X<=x )
What about P(X>7)? There's no reverse-cumulative option.
Instead, say: P(X>7) is the opposite of P(X<=7: P(X>7)=1-P(X<=7)
then do 1-binomdist(7,n,p,true)
What about P(X>=7) ? Change to X>6, then use 1-P(X<=6)
How about P(3 < X <= 8)? Change to P(X<=8) - P(X<=3)
Mean & Standard Deviation:
If you flip a 60% coin 10 times, how many H do you expect? 6, of course.
So mean # of successes is E[X]=n*p
StdDev isn't so obvious. Var(X)=n*p*(1-p)
This is much more useful in Chapter 7.8, Binomial Approximations.
# of FLIPS until success:
P(X=x) = failure on x-1 flips, success on 1 flip = (1-p)^(x-1) * p
Some books call this the G1 distribution since it starts at x=1.
If we asked # FAILURES, not #FLIPS, that would start at x=0, call it G0.
(PMF is slightly different)
Wikipedia shows both types:
P(X<=x)=1-P(X>x)=1-P(x failures at start)=1-(1-p)^x
No equivalent function in Excel.
Geometric distribution applet:
Book skips: Mean & Var for Geom
If each coin flip has a 1-in-10 chance of H, how many flips until H?
10 is the obvious answer, and it's right: E[X] = 1/p
Big important property of Geometric distribution: Memoryless!
If E[X]=10 and we're already on flip #8 without a H, E[#remaining flips]=10
Things that might have a
Geometric distribution: # children per family? #dogs or #cats per family? #pets
per family? #people per car? #marriages per person? # officers at each rank of
the military (2nd Lt, Lt, captain, major, lt. Colonel, Cl, 1-star general,
etc.), or similarly for enlisted? #dancers left in SYTYCD callbacks? (data in
my email box)
Chapter 7.6: Normal Distribution
If you want to see the formula for the bell curve, visit
We hardly ever use that formula in Stats class, though, other than to graph it and shade in some areas so we can see what we are doing.
Start with Standard Normal: mean=0, stddev=1
This is so special it gets its own letter: z instead of x
(we already calculated z-scores; it's not a coincidence!)
Cumulative Distribution Functions: this applet draws the cumulative area under the curve:
Or a more old-fashioned applet, http://www.flashandmath.com/mathlets/calc/antplot/antplot.html
y range on f(x) to [0,0.4]
x range on f(x) to [-3,3]
F(-3) = 0.0013
It's hard to compute something like P(Z<=1) from scratch, so we use tables or Excel formulas.
Shade area on bell curve for Z<=2, look up in table,use Excel formula =normdist(2,0,1,true)
And highlight on CDF graph.
Now try for P(-0.5 < Z < 0.5)
Now backwards: what z cutoff gives P(Z<=z) = 0.80 ?
And double-backwards: what z cutoff gives P(-z < Z < z)=0.95 ?
Non-Standard distributions: translate to z-scores and back.
On the graph printouts, write in x values next to z values.
Speeds on a particular road average 40 mph, sigma=5
What % of speeds are under 45?
What % of speeds are between 30 and 50 ?
Normal Distribution applet:
(the autoscaling makes it near-worthless because it always looks the same,
but that's kind of the point!)
Excel: for finding Pr from cutoff, =normdist(cutoff, mean, std, true)
For finding cutoff from Pr, use =norminv(prob, mean, std )
Chapter 7.7: Checking for Normality, and Normalizing Transformations
Big initial notice: don't sweat the details of the formulas here, since each book and software package does things a little differently.
Normal probability Plot, also called Q-Q (Quantile-Quantile) plot for Normal
Basic idea: could make a CRF plot directly from data (no binning), then overlay a NormalCDF plot. But it's hard to tell how well two CURVES match. So we plot x=exact Normal quantiles, y=data quantiles, which should make a straight line if the distribution is Normal.
Show Q-Q plot in existing Excel file; don't construct by hand in class!
If the data isn't Normal, sometimes we transform it
(take sqrt, cubert, or log) to see if that makes it more normal.
Some people say that stock market returns are normal once you take logs:
Ln(price today / price
This is called a LogNormal distribution.
Read in the book: using correlation coefficient to decide if it's reasonably close to linear.
Doing Data Science, page 31,
figure 2-1, shows a collection of various distributions. They left out highly-skewed
distributions like Pareto, and Gamma or Weibull with CV > 100% (so skewed
their PDF graph has a vertical asymptote at x=0, rather than touching the
Chapter 7.8: Approximating Binomial with Normal
We noticed that the Binomial distribution often looked bell-curve-shaped.
So we could approximate Binomial probabilities with Normal probs.
Match the mean & the StdDev.
mean = n*p, stddev = sqrt(n*p*(1-p) )
P( a < Binom(n,p) < b ) approx= P( a < Normal < b)
For instance: flip 1000 times with p=0.4
Pr( Binom within +/- 15 of mean of 400?)
Google Docs spreadsheet gives an overflow error:
name-brand Excel gives
Normal Approximation: =normdist(400+15,400,15.49,true)-normdist(400-15,400,15.49,true)
Less-important, detail-oriented stuff: using < versus <=, and the Continuity Correction.
binomial approximations applet: http://www.jsc.nildram.co.uk/examples/sustats/normalapproximations/NormalApproximationsApplet.html
Doing Data Science, page 26:
At the other end of the spectrum from N=ALL, we have n=1, by which we mean a sample size of 1. In the old days a sample size of 1 would be ridiculous; you would never want to draw inferences about an entire population by looking at a single individual. And don’t worry, that’s still ridiculous. But the concept of n=1 takes on new meaning in the age of Big Data, where for a single person, we actually can record tons of information about them, and in fact we might even sample from all the events or actions they took for example, phone calls or keystrokes) in order to make inferences about them. This is what user-level modeling is about).
But it's false that we wouldn't draw inferences about an
entire population based on n=1. n=1 is infinitely better than n=0, if you have
no prior information.
Examples where n=1 is important:
A new restaurant opens. You haven't read anything about it, but your friend tried it and hated it (or liked it).
More serious: In a Phase 1 (safety) medical trial, suppose that the first patient you give it to has a horrible reaction and dies immediately. Would you say “well n=1 doesn’t mean anything, let’s give it to the next person”?
E[X_1]= mu, no matter what.
n=1 lets you estimate
the mean but not the spread.
n=2 lets you estimate the spread (very poorly, but at n=1 it's impossible).
n=3 lets you estimate the skew (again, very poorly, but at n=2 it’s impossible to estimate skew)
n=4 lets you estimate the kurtosis (again, very poorly…)
Effect of increasing sample size on a boxplot: Start with a simple box-and-whisker plot, perhaps 50 data points, roughly symmetric. What will it look like if we take 10-times as much data?
The box edges will:
a) not systematically change
b) move much closer to the median
c) move farther away from the median
The whiskers will:
a) not systematically change
b) get longer
c) get shorter
We'll use the
“billionaire-dotplot-histogram-crf-1000” in file in class.
We'll also use these applet pages:
A very good article on why it’s important that the standard error falls inversely with sqrt(n) :
Below, I'm including some text about sampling distributions
from a different book. Please read it.
from "Workshop Statistics, 4th Edition" by Rossman and Chance:
Topic 13: Sampling Distributions: Proportions
*The concept of sampling distribution is one of the most difficult statistical concepts to firmly grasp because of the different "levels" involved. For example, here the original observational units are the candies, and the variable is the color (a categorical variable). But at the next level, the observational units are the samples, and the variable is the proportion of orange candies in the sample (a quantitative variable). Try to keep these different levels clear in your mind.
[to Math 360 students: we will see this categorical/quantitative split in Chapter 8.3; it's not so apparent in Chapter 8.1 and 8.2]
* It's essential to distinguish clearly between parameters [math 360: our book calls them population characteristics] and statistics. A parameter is a fixed numerical value describing a population. Typically, you do not know the value of a parameter in real life, but you may perform calculations assuming a particular parameter value. On the other hand, a statistic is a number describing a sample, which varies from sample to sample if you were to repeatedly take samples from the population.
* Notice that the Central Limit Theorem (CLT) specifies /three/ things about the distribution of a sample proportion [and also about a sample mean]: shape, center (as measured by the mean), and spread (as measured by the standard deviation). It's easy to focus on one of these aspects and ignore the other two. As with other normal distributions, drawing a sketch can help you to visualize the CLT.
* Ensure that these conditions hold before you apply the CLT: the sample needs to have been chosen randomly, and the sample size condition requires that n*pi>=10 and n*(1-pi)>=10 [math 360: we say n*p>=10 and n*(1-p)>=10]. ... it's the normal shape that depends on this condition. The results about the mean and standard deviation hold regardless of this condition.
* Notice that the sample size, relative to the value of the population parameter, is a key consideration when prediction whether or not the sampling distribution will be approximately normal. However, changing the sample size is in no way changing the parameters or shape of the categorical population distribution!
* As long as the population size is much larger than the sample size (say, 20 times larger), the /population/ size itself does not affect the behavior of the sampling distribuiton. This sounds counterintuitive to most people, because it means that a random sample of size 1000 from one [US] state will have the same sampling variability as a random sample of size 1000 from the entire country (with the same population proportion). But think about it: if chef Julia prepares soup in a regular-sized pot and chef Emeril prepares soup in a restaurant-sized vat, you can still learn the same amount of information about either soup from one spoonful. You don't need a larger spoonful to decide whether you like the taste of Emeril's soup.
* As we've said before, try not to confuse the sample size with the number of samples. The sample size is the important number that affects the behavior of the sampling distribution. In practice, you only get one sample. We have asked you to simulate a large number of samples only to give you a sense for how sample statistics vary under repeated sampling; we have tried to ask for enough samples (typically 500 or 1000) to give you a sense of what would happen in the long run. In fact, now that you know the Central Limit Theorem's description of how sample proportions vary under repeated sampling, you no longer need to simulate taking many samples from the population.[AMR1]
Simpson’s Paradox supplemental reading:
From Jeff Witmer
at Oberlin: http://new.oberlin.edu/dotAsset/1801848.ppt
From Tom Moore at Grinnell: http://www.math.grinnell.edu/~mooret/reports/SimpsonExamples.pdf
And an article that’s very good for pre-service teachers: Representations of Reversal: An Exploration of Simpson's Paradox by Lawrence Mark Lesser, http://www.statlit.org/PDF/2001LesserNCTM.pdf
Also, this article discusses the possibility of a “Double Simpson’s Paradox”, and then it turns out that such a thing is impossible: Friedlander, Richard, and Stan Wagon. "Double Simpson's Paradox." Mathematics Magazine 66 (October 1993): 268
Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon – the
reversal paradox, Yu-Kang Tu, David Gunnell and Mark S Gilthorpe
Help on Problem 8.14:
A student asked me for more
guidance on homework 8 #8.14, where I ask you to draw a picture of the sampling
distribution (on paper, or however you want--no need to turn it in).
To give some examples of what I was imagining, I went and drew curves from our class examples (rather carefully) and they are in the file “m360-sampling-distribution-drawings.xls” . You don't have to do them this carefully at all--you could just freehand them.
I added an illustration of which probabilities we were computing by freehanding stuff in Microsoft Paint. Again, you don't have to do that.
The applet we
used last time, on sampling distribution:
Algebraic proof of why we use n-1 in sample variance:
Khan Academy videos and applets on use of n-1 in sample variance:
Confidence Intervals Applet from Peck/Olsen/Devore:
Big disclaimer: almost everything in science uses the basic idea of this chapter, but it has fundamental problems that most statisticians acknowledge!
How to kill your grandmother with statistics, the problem with Null Hypothesis Significance Testing (“if all results are equally good or bad to you, and you have no prior information”)
The Earth Is Round ( p<0.05), by Jacob Cohen, https://labs.psych.ucsb.edu/janusonis/skirmantas/cohen1994.pdf
Chapter 11.9, page 230 of:
What's Wrong with Significance Testing and What to Do Instead
book: The Cult of Statistical Signicance, by S. Ziliak and D. McCloskey
quotes about HT: http://www.indiana.edu/~stigtsts/quotsagn.html
Leland Wilkinson and the Task Force on Statistical Inference: Statistical Methods in Psychology Journals
An article from the journal
Nature: Scientific method: Statistical errors
P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.
WE NEVER SAY what the probability of H0 itself is!
That's the Frequentist way. Bayesians are happy to talk about it.
In a criminal trial, we (the people) want to show strong evidence of guilt, so Ha = guilty; then H0 = not guilty, which is not the same as innocent. (though in the French system it's the opposite!)
If the data is unlikely under the supposition of H0, then we "reject H0" and accept Ha.
BUT if the data isn't particularly unlikely (supposing H0), then we NEVER NEVER NEVER "accept H0"; we just fail to reject it. In a criminal trial, we don't declare them innocent, we just fail to prove that they are guilty.
Hypotheses are about things we DON'T know, like mu or p or sigma. There's no point having a hypothesis about something we DO know, like xbar or phat.
Often we reason this way:
I want to show my product is better than the specification:
Ha : my quality (population % good) > specified value of p
So what's the opposite?
H0: my quality (population % good) <= that specified value fo p
BUT: 1) to assume H0 is true we need a specific value, not just <=,
AND the most conservative thing to do is let it be as close as possible to what I'm trying to show: =specified value rather than <specified.
So while H0 often would naturally be a <= or >= we state & treat it as an =
We can never show that two things ARE equal, just fail to show that they aren't.
(but maybe we just didn't collect enough data)
Application idea: http://en.wikipedia.org/wiki/Demining Mine flail effectiveness can approach
100% in ideal conditions, but clearance rates as low as 50–60% have been
reported. This is well below the 99.6% standard set by the United Nations
for humanitarian demining.
Which tail should you pick? Depends on what you're trying to
my product is better than the standard
this product is worse than the standard.
Example 10.3 is very good!
But note that we might be willing to use the new treatment even if it is provably less effective than the old, as long as it's a lot cheaper or has less side effects. BUT you have to decide before you see the data; if you let your choice be influenced by the data, you won't hit your confidence/significance level!
Consider Oscar Pistorious (before his murder trial), or any
other para-olympics athlete with prosthetics; to prove that they should be
allowed in the ordinary olympics, should they prove that they aren't better
than others? or prove they aren't worse? Or two-sided?
Wikipedia page on: Testing hypotheses suggested by the data
One-sided (one-tailed) vs Two-sided (two-tailed) tests:
If Ha is "this population is better" than a standard, use > of course,
or use < if Ha is "this population is worse" than a standard.
Sometimes we just want to show it's _different_ than a standard:
Ha: p not equal to some standard value. This is called a two-sided or two-tailed test.
Two-sided is the default if you can't decide; it's safer because its cutoff values are farther away from the hypothesized value.
What if we really want to
show that the mean IS equal to something, rather than not-equal?
There's no way to do it with hypothesis testing. Confidence Intervals to the rescue!
Page 585, Example 10.7: null
hypothesis implicitly includes mu<15, but book says includes mu>15 (am I
right on that?)
Here is one way of laying out the 9-step process in Excel, all in one row (or actually two, one for headings and the other for numbers/calculations):
definition H0 value of p Ha alpha pop. Size (approx) n #successes phat n*p n*(1-p) sample/pop SE of phat ASSUMING H0 test statistic z p value decision
has a quote from a prominent (award-winning) researcher: “most classifiers assume all errors are equally costly, but in reality this is seldom the case. Not deleting a spam email will cost a fraction of a second of your attention, but deleting an email from your boss could cost you your job...The bottom line is, you want to use either a natively cost-sensitive learner or an algorithm like MetaCost, or your system will be making a lot of costly mistakes.”
Here’s another opinion: Type II errors are the ones that get you fired, http://punkrockor.wordpress.com/2014/02/04/type-ii-errors-are-the-ones-that-get-you-fired-the-atlanta-edition/
And, another way of looking at errors:
Type S: an error in the sign
of an effect
Type M: an error in the Magnitude of an effect
Type III (and IV) error
How to Read Education Data Without Jumping To Conclusions
One of the less-obvious items:
3. Does the study have enough scale and power?
Very little calculation is required for this quiz; it's more
Here are the situations that will be part of the quiz. Each situation will be followed by 5 options, and you will chose 1 of those 5.
1. Only 33% of students correctly answered a difficult multiple-choice question on an exam given nationwide. Professor Chang gave the same question to her 35 students, hypothesizing that they would do better than students nationwide. Despite the lack of randomization, she performed a one-sided test of the significance of a sample proportion and got a P-value of 0.03. Which is the best interpretation of this P-value?
2. Researchers constructed a 95% confidence interval for the proportion of people who prefer apples to oranges. They computed a margin of error of +- 4%. In checking their work, they discovered that the sample size used in their computation was 1/4 of the actual number of people surveyed. Which is closest to the correct margin of error?
3. A survey of 200 randomly selected students at a large university found that 105 favor a stricter policy for keeping cars off campus. Is this convincing evidence that more than half of all students favor a stricter policy for keeping cars off campus?
4. In college populations, the annual incidence of infectious mononucleosis has been estimated to be as many as about 50 cases per 1000 students. A university student health service took a survey of students to test whether the rate of mononucleosis on their campus is different from this national rate. With alpha=0.05, they rejected the null hypothesis. Which is the best interpretation of "alpha=0.05" in this context?
5. In a pre-election poll, 51% of a random sample of voters plan to vote for the incumbent. A 95% confidence interval was computed for the proportion of all voters who plan to vote for the incumbent. What is the best meaning of "95% confidence"?
6. Sheldon takes a random sample of 50 U.S. housing units and finds that 30 are owner occupied. Using a significance test for a proportion, he is not able to reject the null hypothesis that exactly half of U.S. housing units are owner occupied. Later, Sheldon learns that the U.S. Census for the same year found that 66.2% of housing units are owner occupied. Select the best description of the type of error in this situation.
Here's a video that is meant
for a different textbook, but it's still a good overview of what we've been
looking at recently. It's interesting to note that their requirements for tests
are slightly different: n>40 instead of n>30 for non-normal data, and
n*p>15 in some cases, >10 in others, and >5 in some!
I had discussed the idea of regression instead of doing
post-pre subtraction with a colleague a few years ago. Then she discussed it
with someone else, who emailed us this suggestion:
… the statistical analysis may have more nuance than a simple difference of scores. Some people may say that the gain score should be adjusted by the pre-test, for example. See:
That page talks about ANCOVA (Analysis of Covariance), which is a more advanced topic than our course has time to tackle (we don't even get to the simpler version called ANOVA). So don't worry about the details on that, just think about the general experimental (/observational in some cases) setup and the questions we're asking. Also see Use of covariates in randomized controlled trials
GERARD J.P. VAN BREUKELEN and KOENE R.A. VAN DIJK
Wouldn’t it be great if we could skip a 2-sample t-test and just see if the confidence intervals for the two means overlap? It turns out, yes and no:
A Cautionary Note on the Use of Error Bars, by John R. Lanzante
Data Science people use the term A/B testing for what statistics people call a 2-sample t-test or z-test, or sometimes more than 2 samples (in which case statisticians use ANOVA or Chi-squared tests):
How Obama Raised $60 Million by Running a Simple Experiment
By Dan Siroker
We tried four buttons and six different media (three images and three videos). We used Google Website Optimizer and ran this as a full-factorial multivariate test which is just a fancy way of saying we tested all the combinations of buttons and media against each other at the same time. Since we had four buttons and six different media that meant we had 24 (4 x 6) total combinations to test. Every visitor to the splash page was randomly shown one of these combinations and we tracked whether they signed up or not.
Obama campaign: a/b testing
Dec 12, 2012
Optimization was the name of the game for the Obama Digital team. We optimized just about everything from web pages to emails. Overall we executed about 500 a/b tests on our web pages in a 20 month period which increased donation conversions by 49% and sign up conversions by 161%. As you might imagine this yielded some fascinating findings on how user behavior is influenced by variables like design, copy, usability, imagery and page speed.
What we did on the optimization team was some of the most exciting work I've ever done. I still remember the incredible traffic surge we got the day the Supreme Court upheld Obamacare. We had a queue of about 5 ready-to-go a/b tests that would normally take a couple days to get through, yet we finished them in just a couple hours. We had never expected a traffic surge like that. We quickly huddled behind Manik Rathee—who happened to be the frontend engineer implementing experiments that day—and thought up new tests on the fly. We had enough traffic to get results on each test within minutes. Soon our colleagues from other teams gathered around us to see what the excitement was about. It was captivating to say the least.
examples where resampling works but t-tests don’t [includes Fun data sets,
* telling girl scouts their cookie sales will help fund a trip to Disneyland
* time to back out of a parking spot when there is/isn't someone waiting.]
On the NPR radio show "On The Media" on 2014-08-17,
UCLA law professor Jennifer Mnookin was talking about the use of videotaping in police interrogation rooms:
"We do know certain red flags that may be associated with false confessions. In many of the known false confession cases , the interrogations were unusually long. But at the same time, lots of true confessions may come after long interrogations."
What kind of thinking is that?
Let's see if we can gather some binomial data. Let's use families with 2 kids (if your family has more than 2, just consider the first 2).
Go here (but remove the REMOVETHIS first)
and enter either GG, GB, BG, or BB.
Desmos calculator that I made to show how the Chi-squared distribution changes as DoF increases:
My excel sheet with a slider; might work only on PCs rather than Macs. Also, requires you to enable macros, which is dangerous in general but probably safe for this one file.
Mythbusters checked if yawns
are contagious: The results:
25%, 4 out of 16, who were not exposed to a yawn, yawned while waiting. Call this the non-yawn group.
29%, 10 out of 34, who were exposed to a yawn, yawned. Call this the yawn group.
Is it a statistically significant difference? How small could the sample be to be able to detect a 25% vs 29% difference?
I'm also attaching two data sets on population by city or county, that we might have time to analyze in class using Benford's law:
Log10(1+1/i) for first digit i.
linear regression; it helps explore/explain the concepts from Chapter 13:
Testing for linearityhttp://en.wikipedia.org/wiki/Lack-of-fit_sum_of_squares
And Evaluation of three lack of fit tests in linear
regression models Journal of Applied
Volume 30, Issue 6, 2003
If we fit a line and get a good
R^2, can we say there's a linear trend to the data?
Not really. We should also fit a quadratic (or power, exponential, etc) and show it's not much better than the mx+b fit.
Why You Shouldn't Conclude "No Effect" from
Statistically Insignificant Slopes:http://www.carlislerainey.com/2012/06/16/why-you-shouldnt-conclude-no-effect-from-statistically-insignificant-slopes/
More on Concluding "No Effect":http://www.carlislerainey.com/2012/06/27/more-on-concluding-no-effect/
Two diﬀerent ways to bootstrap a regression model
1. Bootstrap data pairs xi = (ci, yi)
2. Bootstrap the residuals ⇒ xi = (ci, ciβˆ + ˆεi1)
Testing for difference of slopes in two data sets:
See that separate file.
Leemis diagram of distributions:
Decision tree of what distribution to use:
College Math Journal, Vol 31 No 4 September 2000: The Lognormal Distribution
by Brian E Smith and Francis J Merceret
Activites for Calc-Based Statistics Classes
Used in Reliability Engineering
Basic Concepts of Probability and Statistics for Reliability Engineering; Ernesto Gutierrez-Miravete
An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013). As of January 5, 2014, the pdf for this book will be available for free, with the consent of the publisher, on the book website.
Probability and Statistics
for Computer Scientists, 2nd edition
Probability Foundations for Engineers
Joel A. Nachlas
I asked some professors in our CompSci department what CS majors should get out of Math 360, and here are some of their responses:
seminal paper, cited more than 22,000 times:
Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to
multiple testing. Journal of the Royal Statistical Society: Statistical Methodology 57 (1995), 289-300.
p-values and q-values:
I think it would be helpful if CS students knew about density/distribution functions, perhaps with more emphasis on discrete (but not entirely).
Conditional probability , Bayes rule , joint densities.
Diagnosis: medical, mechanical, really anything. Use Naive Bayesian Inference
Classification: Naive Bayesian Inference with MAP estimator - spam/textual filter classifier
Simulation: Monte Carlo techniques, various types of Markov processes
Hypothesis Testing - did this type of user interface increase productivity, did that new protocol increase through-put, etc.
In some cases, it is a bit hard to deconvolute the stats from the science.
Recommender systems has gone the way of matrix decompositions, but understanding distributions is definitely important. Variance, standard deviation, covariance... different forms of correlation, mutual information. Also, ways of looking at error of prediction
Bioinformatics is a bit different, since it is a bit more experimental. you see a lot of use of Fisher's exact test for testing "enrichment" of annotations --- e.g., does a set of n genes found experimentally include more members who are annotated to appear in the nucleus than you would expect by chance? Understanding p-values, q-values/FDR is important.
Bayes rule shows up in several settings.
Personally, I would think that some Bayesian inference could be useful. Maybe signal detection theory. Cluster analysis might fit MATH 360. You would find a lot of potential projects in artificial intelligence, pattern recognition, and machine learning, among other areas.
Signal detection theory:
Try to make a summary table with these column headings:
#samples : 1, 2, >=3
means or proportions
paired or not
CI or HT?
The following web site shows
you a bunch of statistics scenarios and you click on the statistical technique
that is most appropriate, and get instant feedback. There's one type of test we
haven't talked about, though: ANOVA. When clicking on types of tests to
include, don't click on ANOVA.
PS: if you're curious about ANOVA:
" In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are all equal, and therefore generalizes t-test to more than two groups."
See the scanned pages with a
lot of good questions, most of which are concept-based. Some notes:
* Only work on the ones that are multiple-choice; the ones labeled "Investigation" are not part of the practice test.
* There are occasional problems that require you to do some statistical calculations, including using z, t, or chi-squared tables or Excel functions.
* on page 621 of the scanned pages, you may skip problem C6 (a six-sided die)
* on page 663 of the scanned pages, you may skip problem C1.
The concept test that we did via emu-online, on confidence intervals and hypothesis tests, is also very good for you to study from.
However, the actual test
questions will not be simple alterations of these questions; they will be new
The test will also include a few computational questions; here are some practice problems for those:
A poll of 1000 people found that 53% said they were Republican and 47% said Democrat (we are ignoring unaffiliated voters). Of the Republicans, 20% were in favor of a particular new political proposal. Of the Democrats, 18% were in favor. Do an appropriate statistical analysis; show all work and reasoning.
In planning for a wind turbine to generate electric power, a city put up a wind-speed sensor in Location A and collected 7 days of data, with resulting speeds (avg per day, in mph) of:
10 12 11 12 9 9 12
The sensor was then moved to Location B, whose measurements were then
9 11 13 12 12 9 8
Do an appropriate statistical analysis; show all work and reasoning. If you need to make any assumptions, write down what you are assuming.
Frequentist vs Bayesian:
Statistically Significant outlier
Cell Phones and Cancer: