# Virtual Monday Oct 5

Welcome to our first Virtual Monday of October, y’all.

Let’s pick up where we were with calculating a confidence interval, then get into what we need to do in two other situations with CIs: (1) when sample sizes are ≤ 100 and (2) when we’re dealing with a proportion instead of a mean.

# Example from end of Friday’s class, with n > 100

“How many hours per day do you spend on social media?” Suppose n = 200, mean = 2.8, and s = 5.6. Find a 95% confidence interval for this population mean.

In Friday’s class, we calculated this CI using the CI formula, point estimate ± SE * z-score. We started by determining that…

• …our point estimate (of the population parameter) is the sample statistic, 2.8;
• SE = SD / square root of sample size = 5.6 / √200 = .40;
• and the z-score for 95% confidence (or, 2.5% probability in each tail) = 1.96.

95% CI = 2.8 ± .40 * 1.96 = (2.02 , 3.58)

Interpret: We are 95% confident that the mean number of hours spent on social media per day in the population is between 2.02 and 3.58.

Notice that we need to specify that the CI is about the population mean, not the sample mean. Remember that we’re using the CI to make an inference about the population (plus we already know the sample mean to start with!).

## Do It Yourself (DIY) Example 1

Try it yourself! Suppose that the average GPA of a sample of BOM majors (n = 196) is 3.01, with s = 0.67. Using this data, find and interpret a 95% confidence interval for the average GPA of all Gettysburg College students.

# What changes when n ≤ 100?

When our sample size gets smaller, it becomes a less accurate representation of the population; specifically, the sample standard deviation (s) becomes wider than σ. To properly “widen” our confidence interval to match the larger s we get with a smaller sample size, we need to swap out the z-score distribution for a t-score. So, let’s talk about the t-distribution for a bit, then get back around to using it to calculate a CI with n ≤ 100.

## The t-distribution

Just like z-scores, t-scores are a set of “scores” that fall along different, specific points on a normal distribution, ranging from negative to positive numbers, with t = 0 in the middle. However, the placement of each t-score in the t-distribution changes slightly – it becomes more and more spread out – depending on sample size.

Well, technically, rather than sample size, the t-distribution changes depending on something called degrees of freedom (df). And in the case of wanting to calculate a CI for a population mean, df = n – 1. (We’ll talk a little bit more about what df is conceptually, in class on Wednesday. For now, just remember that it’s equal to n – 1.)

Let’s take a look at a table of t values (just like we had for z-scores) from your textbook. (This is also posted as a PDF on Moodle with our slides.)

In the t-table here, you have different df values down the side, and different tail probabilities across the top. Actually, you have two options for looking up tail probabilities — either as a two-tailed confidence level (for example, 95% confidence) or as a one-tailed probability (notice that p of .025 is the same column as 95% confidence).

So, let’s work a quick example. Suppose you’re working with n of 10. What’s the t-score for 95% confidence?

To find this, look in the row for n – 1 = 10 – 1, or 9 df; and look in the column for 95% confidence. The correct t-score here is 2.262.

Here are a few more practice options before we use the t-distribution to calculate CIs.

What’s the df when n = 12 and we’re using 90% confidence?

What’s the df when n = 6 and we’re using 99% confidence?

One quick note about the person who created the t-distribution. The original formula for Student’s t-test was developed by William Sealy Gosset (1876-1937). He published a paper about the test using an alias, “Student,” because his employer was very wary of sharing their trade secrets publicly. Take a guess who Student worked for…

## Calculating a CI with n ≤ 100

So, let’s calculate a CI in a case with n ≤ 100. We’ll work with the same variable as earlier, but a smaller n.

“How many hours per day do you spend on social media?” Suppose n = 20, the sample mean = 2.8, s = 5.6. Find a 95% confidence interval for this population mean.

Using the CI formula, point estimate ± SE * t-score….

• …our point estimate (of the population parameter) is the sample statistic, 2.8;
• SE = SD / square root of sample size = 5.6 / √20 = 1.25;
• and the t-score for 95% confidence and df of 19 (or, 2.5% probability in each tail) = 2.093.

95% CI = 2.8 ± 1.25 * 2.093 = (0.18 , 5.42)

Interpret: We are 95% confident that the mean number of hours spent on social media per day in the population is between 0.18 and 5.42.

Notice this CI is wider than the one we got for n of 200 up top. That CI was (2.02, 3.58). What are the two reasons this has happened? (Hint: look at the numbers used in each calculation.)

## DIY Example 2

OK, another time to try it yourself, using the same example as above in DIY Example 1, but with a smaller sample size.

Suppose that the average GPA of a sample of BOM majors (n = 20) is 3.01, with s = 0.67. Using this data, find and interpret a 95% confidence interval for the average GPA of all Gettysburg College students.

# Two other things to know about CIs

There are two other key things to know about CIs that this is a good time to point out.

## Can we use CIs when we have non-normally distributed data?

First, even if you’ve got a skewed (or any other non-normal) distribution in your sample, you might worry about being able to calculate and/or trust a CI in that case.

But fear not dear statisticians! Remember that the idea behind the CI is still the central limit theorem, so we can trust that, even if our sample is non-normal, the sampling distribution of means will be normal. In other words, yes – we can still calculate and trust a CI, no matter the shape of the sample data’s distribution.

## A tradeoff: Confidence vs. precision

Second, with CIs, we have to balance two competing interests: (a) we want high confidence, and (b) we want a useful, precise estimate of the population parameter.

Remember on Friday that we talked about the case of a 100% CI being totally useless; even though you’d have perfect confidence in the CI, you’d have to contain all possible values (like saying the mean test score was [0, 100]… that tells us nothing!), meaning you’re being very “imprecise” with your estimate.

More generally speaking, as we increase the % confidence level (say, from 90% to 95%), we’re going to have a wider final CI. You can see that this is going to be the case due to an increasingly large t (or z) score — look at how the t-scores increase from left to right going across any single row of the t-table above.

So, you have to decide whether you want relatively high confidence (usually we use 99% as our highest confidence level), or high precision (AKA a narrower CI, perhaps using 90% confidence). Using confidence levels around 95% usually give us a good balance of these two goals (confidence, precision).

Let’s do a quick check to make sure this is making sense: say you have a 95% and a 90% CI created from the same sample data. One CI – but we’re not sure which – is (9, 11) and the other is (8, 12). Which one must be the 90% CI, and which must be the 95% CI? Why?

That’s all for part 1 this week, folks! Check back on Moodle for the videos in part 2 (there are lots of symbols, so videos will be more helpful there!). 