Guide: Running Students T-Tests & Paired Samples T-Test (Parametric Tests)

If you’re here, chances are that you’ve probably used Julius to help you run descriptive statistics on your data to explore the relationships. You have probably tested your dataset for normality and homogeneity of variances. If you haven’t done these steps, you can go check out my guides on how to run descriptive statistics on Julius and how to run tests to determine if your data is suitable for parametric or non-parametric testing!

This particular guide is going to go over some of the parametric tests you can run on Julius. For simplicity sake, in each example you have already ran a normalcy test and determined that these datasets do follow normal distribution and have equal variance’s. Let’s check out some tests!

Students T-Test (Independent T-Test)

This test is used to determine if there is any statistically significant differences between the means of two independent groups (groups that do not influence one another).

Situation: Researchers want to determine if there is a significant difference in exam scores between two different teaching methods. They randomly assigned 20 students to Group A (hands on approach to teaching), and then 20 to Group B (more independent studying). After the teaching period ends, both groups get the same test. Here are the results:

group_a group_b
75 78
80 82
85 87
90 91
92 93
78 90
83 85
87 89
79 81
81 83
82 84
84 86
86 88
88 90
91 92
77 79
76 77
89 90
85 86
83 84

Determine if there is any difference between the groups mean test score.

Question 1a: is there any statistically significant difference between Group A’s mean test score compared to Group B?
Prompt 1a: Perform an independent samples t-test to compare the means of Group A and Group B, please.

There was no statistically significant difference between Group A and Group B means, but let’s explore the effect size of this data regardless using Cohen’s d. But before we run this, what is effect size you ask? Effect size tells us how big the difference is between two groups, with 0.2 being small effect, 0.5 being medium, and 0.8 being large effect. Let’s prompt Julius to run this test:

Question 1b: what is the effect size for the comparison between Group A and Group B?
Prompt 1b: Calculate Cohen’s d for the comparison between Group A and Group B, please.

Julius has effectively calculated the effect size of this relationship. It also gives us an idea on what the value means, such as the direction of magnitude (Group A having slightly lower mean than Group B).
Now let’s visualize the data to get a better understanding of the characteristics:

Question 1c: What does our data look like?
Prompt 1c: Can you provide a visualization of the data so that we can see the slight differences between Groups A and B?

Great! Julius has provided a visualization of the median scores via a boxplot. Boxplots are useful to show distributions of numeric data values. They provide information on the central tendency and spread of the data (the “whiskers” or the lines at the top and bottom of the box show us how far the data spreads, with maximum and minimum values). The shaded area is referred to as the “interquartile range” (IQR), with this example showing that ~25% of the data falling below the median (lower quartile), and ~25% of the data falling above the median line (upper quartile).

Paired Samples T-Test

These tests are used when we have two sets of measurements collected from the same individual under two different conditions. This is the perfect test to use if:

  1. You want to see if there is a difference between before and after treatments
  2. When two different treatments are applied to the same individual, or
  3. Any paired comparisons where each measurement is not independent from one another.

Let’s run a paired samples t-test!

Situation: A pharmaceutical company is testing the effectiveness of a new medication for lowering blood pressure within 8 hours. They recruit 20 people who are diagnosed with hypertension. They measure their blood pressure levels before administering the medication, then wait 8 hours and remeasure their blood pressure. Below is the data they collected:

student before_score after_score
1 140 130
2 150 140
3 160 150
4 155 145
5 165 155
6 148 138
7 152 142
8 158 148
9 145 135
10 147 137
11 155 145
12 160 150
13 157 147
14 150 153
15 143 140
16 155 133
17 158 145
18 162 148
19 156 152
20 148 146

Determine if there is any difference between the blood pressure readings.

Question 2a: is there any difference between the before and after blood pressure score?
Prompt 2a: can we run a paired samples t-test comparing the before_score and after_score to see if there is any difference between the groups?

Julius was able to run a paired t-test and produce the results (t(38) = 8.267, p ≤ 0.001). This is great, but what next? A visualization! Since I’ve already went over how to prompt for this, I will just provide the visualization below:

Now I’ll ask for a summary of the paired samples t-test result to get an overview of the findings:

Prompt 2b: can you provide a summary of the paired samples t-test results please?

Perfect, Julius has provided me with a breakdown of my steps and statistical analyses. We are almost done! I should conduct a Cohen’s d to quantify the magnitude of difference between the before and after scores. Since I have already done this in the previous example, I will just record the value.

Cohen’s d = -1.849, suggesting that there is a large negative effect size. This means that the medication had a substantial impact on the reduction of patient’s blood pressure. The negative value indicates the direction of the effect, meaning that there is a decrease in scores from the before_score to after_score.

Finally, let’s look at the variability of the responses amongst patients by using a change score analysis. This analysis examines the distribution of the change scores (difference between the before and after scores) to better understand how each participant varied in their response to the drug. It can help us determine if the drug was consistent in the effect it had amongst all participants.

Question 2c: what is the variability of responses between participants?
Prompt 2c: can you perform a change score analysis to further understand the variability of the response to the drug?

Now we understand a lot more about the the average change score and some of the variability each participant experienced! Feel free to play around with the dataset to explore other options Julius has.

Keywords: AI statistics, AI statistical analysis, GPT, parametric tests, students t-test, independent t-test, paired samples t-test, descriptive statistics, cohen’s d, change score analysis


I teach statistical analysis to undergraduate students (and post graduate students), but am intending to experiment with changing the delivery for the coming academic year to include using Julius for undergraduate modules.

I am also part of a team exploring how quantitative methods are taught across sport and exercise science undergraduate degrees and provisional findings suggest that the content mostly focuses on statistical tests, such as the t-test in your example above, and less on conceptual understanding.

I intend to evaluate students’ conceptual understanding by the prompts (questions in essence) they ask, so wanted to see what conceptual understanding Julius might provide given particular prompts.

Julius performed as I hoped it would. The following was what I asked:

“Using your data, I ask for the same test, but added an additional request: “Perform an independent t-test on the data and explain the results from a Fisharian and Neyman-Pearson perspective.” The response was accurate and I think would be informative to students.”

T-statistic: -1.421 P-value: 0.163

The independent t-test resulted in a T-statistic of -1.421 and a P-value of 0.163.

Fisherian Perspective:

  • Fisher’s approach focuses on the p-value as a measure of evidence against the null hypothesis. According to Fisher, the smaller the p-value, the stronger the evidence against the null hypothesis. In this context, with a p-value of 0.163, Fisher would argue that there is not strong evidence against the null hypothesis (which posits no difference between the means of group_a and group_b). Fisher would likely suggest that the data does not provide strong evidence to support the claim that there is a significant difference between the two groups.

Neyman-Pearsonian Perspective:

  • Neyman-Pearson framework involves setting a significance level (alpha) before conducting the test, often at 0.05. If the p-value is less than alpha, the null hypothesis is rejected in favor of the alternative hypothesis; otherwise, we fail to reject the null hypothesis. From this perspective, with a p-value of 0.163, which is greater than the typical alpha level of 0.05, we would fail to reject the null hypothesis. This means that, according to the Neyman-Pearsonian approach, there is not enough statistical evidence to conclude that there is a significant difference between the means of group_a and group_b.

In summary, both the Fisherian and Neyman-Pearsonian perspectives would conclude that the data does not provide sufficient evidence to claim a significant difference between the two groups based on the results of this t-test.

I went on to ask: "What does significant actually mean in the context of these results? "

Again, a generally informative response for students (and others less familiar with statistical methods), although one has to be somewhat cautious about some of the language used, as it might lead to misinterpretation:

In the context of these t-test results, “significant” refers to the statistical significance of the difference between the means of two groups. Statistical significance is a measure of the likelihood that the observed difference between groups is not due to random chance alone. Here’s what it means in relation to the obtained results:

  • T-statistic: The T-statistic of -1.421 indicates the direction and magnitude of the difference between the group means relative to the variability observed within the groups. A negative value suggests that the mean of group_a is lower than the mean of group_b, but the magnitude of this value needs to be interpreted in the context of the p-value.
  • P-value (0.163): The p-value quantifies the probability of observing a difference as large as (or larger than) the one observed in our data if the null hypothesis were true. The null hypothesis, in this case, states that there is no true difference between the means of group_a and group_b. A p-value of 0.163 means there’s a 16.3% chance of observing such a difference (or more extreme) due to random variation alone.

Interpretation of Significance:

  • Not Statistically Significant: Since the p-value (0.163) is greater than the conventional threshold of 0.05, the difference between the groups is not considered statistically significant. This means we do not have enough evidence to confidently say that the observed difference between group_a and group_b is not due to random chance. In practical terms, we would not act as if there is a real, meaningful difference between the two groups based on this test.

Contextual Meaning:

  • Practical Significance: It’s important to distinguish statistical significance from practical significance. Even if a result is statistically significant, it may not be practically significant. Practical significance refers to whether the size of the difference is large enough to be of real-world importance. In this case, since the result is not statistically significant, the question of practical significance may not be relevant. However, in cases where statistical significance is found, it’s still crucial to consider whether the effect size is meaningful in a practical context.

I would only add, that he statement “A p-value of 0.163 means there’s a 16.3% chance of observing such a difference (or more extreme) due to random variation alone” is a common misinterpretation. The p-value itself does not directly quantify the probability of the observed data under the null hypothesis. Instead, it quantifies the probability of observing data as extreme as (or more extreme than) the observed, assuming the null hypothesis is true.

So, some caution is warranted as it is possible that some readers might assume that the p-value calculates the probability of the null hypothesis itself (i.e., that no true difference exists between the means of groups a and b). This is not true. The p-value measures the extremeness of the observed outcome (or more) given the null hypothesis is true, not the probability of the null hypothesis itself


I love to hear that you’re trying Julius out! I’m glad you’re enjoying it as much as I am! I think it is a great idea to include Julius for undergraduate modules. I have personally learned a lot about how to analyze statistical analysis based on the feedback and information it provides. It is also very user friendly and breaks down those complex concepts into easier to understand material for people who are starting out in statistics, so I believe it would be a great way to teach your students.

As for your research team, that is a very interesting provisional finding. I think conceptual learning should also be an important concept in teaching statistics because then students would understand the rationale on why they perform these analyses instead of just “going through the motions” of running a statistical analysis without really understanding why. The fact that you are testing this out with Julius is a great idea and I cannot wait to hear more about it!

Thank you for your very thorough insight on the different perspectives that result from using the Fisherian approach and the Neyman-Pearson framework as well as sharing how you used Julius to really dive into the meaning of significance. This really brings in the topic of what being “significant” actually means in a practical setting and also interpreting this value with caution at times. There are so many other factors that you must take into consideration when looking at the p-value and what it truly means. Thank you for sharing your insight and also giving people more “food for thought” when it comes to statistical analysis. :slight_smile:


Thank you for your very kind reply Alysha. Your guides have motivated me to explore producing a couple of guide myself. :slightly_smiling_face:

1 Like

Thank you for such an in-depth and educational response. I love to see it!
That is awesome! I am so excited to see your guides! I’ll keep a lookout for them. :slight_smile: