Guide: Running Students T-Tests & Paired Samples T-Test (Parametric Tests)

If you are here, you have likely used Julius to conduct descriptive statistics and explore relationships within your datasets. You might have already tested for normality and homogeneity of variances. If you have not, please refer to Guide: Descriptive Statistics & How to Prompt for and run them on Julius and Guide: Parametric or Non-parametric .

This guide focuses on parametric tests that can be conducted on Julius. For simplicity, we are assuming that we have already confirmed the dataset meets normality and homogeneity of variances assumptions. Let’s delve into these tests

Students T-Test (Independent T-Test)

The independent t-test is used to determine if there are statistically significant differences between the means of two independent groups (groups that do not influence one another).

Scenario: Researchers aim to assess whether there is a significant difference in exam scores between two different teaching methods. They randomly assigned 20 students to Group A (hands on approach to teaching), and another 20 to Group B (independent study). Both groups take the same test after the teaching period. Here are the results:

group_a group_b
75 78
80 82
85 87
90 91
92 93
78 90
83 85
87 89
79 81
81 83
82 84
84 86
86 88
88 90
91 92
77 79
76 77
89 90
85 86
83 84

Objective: Determine if there is a statistically significant difference between the mean test scores of Group A and Group B.

Question 1a: Is there a statistically significant difference between the mean test scores of Group A and Group B?
Prompt 1a: Perform an independent samples t-test to compare the means of Group A and Group B, please.

The t-test results show no statistically significant difference between Group A and Group B. However, let’s explore the effect size of this data regardless using Cohen’s d.
Effect size quantifies the magnitude of the difference between two groups, with 0.2 being small effect, 0.5 being medium, and 0.8 being large effect.

Question 1b: What is the effect size for the comparison between Group A and Group B?
Prompt 1b: Calculate Cohen’s d for the comparison between Group A and Group B, please.

Julius has calculated the effect size, indicating that Group A’s mean is slightly lower than Group B’s.

To gain a clearer understanding of the data characteristics, we can visualize the results.

Question 1c: What does the data look like?
Prompt 1c: Provide a visualization of the data so that we can see the slight differences between Groups A and B?

Julius provides a boxplot, which is effectively for showing data distributions. The boxplot reveals the central tendency and variability of the data, with the ‘whiskers’ representing the spread of data. The shaded area is referred to as the “interquartile range” (IQR), with this example showing that ~25% of the data falling below the median (lower quartile), and ~25% of the data falling above the median line (upper quartile).

Paired Samples T-Test

The paired samples t-test is used when there are two sets of measurements from the same individual under different conditions. This test is appropriate for:

  1. Comparing measurements before and after treatment.
  2. Assessing effects of two different treatments on the same individuals.
  3. Any paired comparisons where measurements are not independent.

Situation: A pharmaceutical company is evaluating a new medication’s effectiveness in lowering blood pressure within 8 hours. They recruit 20 people who are diagnosed with hypertension. They measure their blood pressure levels before administering the medication, then wait 8 hours and remeasure their blood pressure. Below is the data they collected:

student before_score after_score
1 140 130
2 150 140
3 160 150
4 155 145
5 165 155
6 148 138
7 152 142
8 158 148
9 145 135
10 147 137
11 155 145
12 160 150
13 157 147
14 150 153
15 143 140
16 155 133
17 158 145
18 162 148
19 156 152
20 148 146

Objective: Determine if there is a significant difference between the blood pressure readings before and after.

Question 2a: Is there a significant difference between the before and after blood pressure scores?
Prompt 2a: Run a paired samples t-test comparing the before_score and after_score to see if there is any difference between the groups?

Julius performed the paired t-test and reports the results (t(38) = 8.267, p ≤ 0.001). We will now visualize the data to better understand the impact.

Now I’ll ask for a summary of the paired samples t-test result to get an overview of the findings:

Prompt 2b: Provide a summary of the paired samples t-test results please?

Julius provides a detailed summary of the test results, including the breakdown of the statistical analysis. To quantify the magnitude of the difference, we can calculate Cohen’s d.

Cohen’s d = -1.849, indicating a large negative effect size, suggesting that the medication substantially reduced blood pressure. The negative value indicates the direction of the effect, meaning that there is a decrease in scores from the before_score to after_score.

Finally, let’s look at the variability of the responses amongst patients by using a change score analysis. This analysis examines the distribution of the change scores (difference between the before and after scores) to better understand how each participant varied in their response to the drug. It can help us determine if the drug was consistent in the effect it had amongst all participants.

Question 2c: What is the variability of responses between participants?
Prompt 2c: Perform a change score analysis to further understand the variability of the response to the drug.


Now we understand a lot more about the the average change score and some of the variability each participant experienced! Feel free to play around with the dataset to explore other options Julius has.

Keywords: AI statistics, AI statistical analysis, GPT, parametric tests, students t-test, independent t-test, paired samples t-test, descriptive statistics, cohen’s d, change score analysis

4 Likes

I teach statistical analysis to undergraduate students (and post graduate students), but am intending to experiment with changing the delivery for the coming academic year to include using Julius for undergraduate modules.

I am also part of a team exploring how quantitative methods are taught across sport and exercise science undergraduate degrees and provisional findings suggest that the content mostly focuses on statistical tests, such as the t-test in your example above, and less on conceptual understanding.

I intend to evaluate students’ conceptual understanding by the prompts (questions in essence) they ask, so wanted to see what conceptual understanding Julius might provide given particular prompts.

Julius performed as I hoped it would. The following was what I asked:

“Using your data, I ask for the same test, but added an additional request: “Perform an independent t-test on the data and explain the results from a Fisharian and Neyman-Pearson perspective.” The response was accurate and I think would be informative to students.”

T-statistic: -1.421 P-value: 0.163

The independent t-test resulted in a T-statistic of -1.421 and a P-value of 0.163.

Fisherian Perspective:

  • Fisher’s approach focuses on the p-value as a measure of evidence against the null hypothesis. According to Fisher, the smaller the p-value, the stronger the evidence against the null hypothesis. In this context, with a p-value of 0.163, Fisher would argue that there is not strong evidence against the null hypothesis (which posits no difference between the means of group_a and group_b). Fisher would likely suggest that the data does not provide strong evidence to support the claim that there is a significant difference between the two groups.

Neyman-Pearsonian Perspective:

  • Neyman-Pearson framework involves setting a significance level (alpha) before conducting the test, often at 0.05. If the p-value is less than alpha, the null hypothesis is rejected in favor of the alternative hypothesis; otherwise, we fail to reject the null hypothesis. From this perspective, with a p-value of 0.163, which is greater than the typical alpha level of 0.05, we would fail to reject the null hypothesis. This means that, according to the Neyman-Pearsonian approach, there is not enough statistical evidence to conclude that there is a significant difference between the means of group_a and group_b.

In summary, both the Fisherian and Neyman-Pearsonian perspectives would conclude that the data does not provide sufficient evidence to claim a significant difference between the two groups based on the results of this t-test.

I went on to ask: "What does significant actually mean in the context of these results? "

Again, a generally informative response for students (and others less familiar with statistical methods), although one has to be somewhat cautious about some of the language used, as it might lead to misinterpretation:

In the context of these t-test results, “significant” refers to the statistical significance of the difference between the means of two groups. Statistical significance is a measure of the likelihood that the observed difference between groups is not due to random chance alone. Here’s what it means in relation to the obtained results:

  • T-statistic: The T-statistic of -1.421 indicates the direction and magnitude of the difference between the group means relative to the variability observed within the groups. A negative value suggests that the mean of group_a is lower than the mean of group_b, but the magnitude of this value needs to be interpreted in the context of the p-value.
  • P-value (0.163): The p-value quantifies the probability of observing a difference as large as (or larger than) the one observed in our data if the null hypothesis were true. The null hypothesis, in this case, states that there is no true difference between the means of group_a and group_b. A p-value of 0.163 means there’s a 16.3% chance of observing such a difference (or more extreme) due to random variation alone.

Interpretation of Significance:

  • Not Statistically Significant: Since the p-value (0.163) is greater than the conventional threshold of 0.05, the difference between the groups is not considered statistically significant. This means we do not have enough evidence to confidently say that the observed difference between group_a and group_b is not due to random chance. In practical terms, we would not act as if there is a real, meaningful difference between the two groups based on this test.

Contextual Meaning:

  • Practical Significance: It’s important to distinguish statistical significance from practical significance. Even if a result is statistically significant, it may not be practically significant. Practical significance refers to whether the size of the difference is large enough to be of real-world importance. In this case, since the result is not statistically significant, the question of practical significance may not be relevant. However, in cases where statistical significance is found, it’s still crucial to consider whether the effect size is meaningful in a practical context.

I would only add, that he statement “A p-value of 0.163 means there’s a 16.3% chance of observing such a difference (or more extreme) due to random variation alone” is a common misinterpretation. The p-value itself does not directly quantify the probability of the observed data under the null hypothesis. Instead, it quantifies the probability of observing data as extreme as (or more extreme than) the observed, assuming the null hypothesis is true.

So, some caution is warranted as it is possible that some readers might assume that the p-value calculates the probability of the null hypothesis itself (i.e., that no true difference exists between the means of groups a and b). This is not true. The p-value measures the extremeness of the observed outcome (or more) given the null hypothesis is true, not the probability of the null hypothesis itself

5 Likes

I love to hear that you’re trying Julius out! I’m glad you’re enjoying it as much as I am! I think it is a great idea to include Julius for undergraduate modules. I have personally learned a lot about how to analyze statistical analysis based on the feedback and information it provides. It is also very user friendly and breaks down those complex concepts into easier to understand material for people who are starting out in statistics, so I believe it would be a great way to teach your students.

As for your research team, that is a very interesting provisional finding. I think conceptual learning should also be an important concept in teaching statistics because then students would understand the rationale on why they perform these analyses instead of just “going through the motions” of running a statistical analysis without really understanding why. The fact that you are testing this out with Julius is a great idea and I cannot wait to hear more about it!

Thank you for your very thorough insight on the different perspectives that result from using the Fisherian approach and the Neyman-Pearson framework as well as sharing how you used Julius to really dive into the meaning of significance. This really brings in the topic of what being “significant” actually means in a practical setting and also interpreting this value with caution at times. There are so many other factors that you must take into consideration when looking at the p-value and what it truly means. Thank you for sharing your insight and also giving people more “food for thought” when it comes to statistical analysis. :slight_smile:

2 Likes

Thank you for your very kind reply Alysha. Your guides have motivated me to explore producing a couple of guide myself. :slightly_smiling_face:

1 Like

Thank you for such an in-depth and educational response. I love to see it!
That is awesome! I am so excited to see your guides! I’ll keep a lookout for them. :slight_smile: