Have you ever found yourself just staring aimlessly at your dataset, wondering where to even start with the analysis? Well, I am here to show you how to use Julius to help determine which data analysis may fit best with the data you have! Some of this information may be a little tedious, but I promise you, it will be worth it.

**Step 1: Review the Assumptions of each model.**

After running descriptive statistics on your data, we can then proceed to data analysis. For simplicity sake, I’ll focus on two categories of statistical tests that you can run: Parametric and Non-parametric test.

**Parametric Tests**

These tests make assumptions about the distribution of the population from where the sample was taken from. They assume that the data follows a normal distribution. There are some assumptions that your dataset must follow to run these tests. They are listed below:

**1. Normality:** data should be normally distributed, meaning that it should follow a (somewhat) natural bell curve.

**2. Homogeneity of variance:** the variance of the dependent variable should be equal across the groups/conditions.

**Example:* you are comparing the heights between two groups of plants: Group A and B. Each group has 5 plants. After measuring the differences in heights, you find that the height of plants in Group A vary by about 0.2inches. You also find that Group B’s plant height varies at 0.2inches. This does not necessarily mean that the height of each plant is the same in both groups, it just means that the differences between all five plants in each group equates to about 0.2inches. This means that there is homogeneity of variance across both groups.

**3. Independence:** observations should be independent of each other.

**Example:** when we measure each plant’s height, we know that its height is not influenced by/or is dependent on the measurement of another plant’s height. In other words, each plant’s height is independent of the other plants since they do not influence the growth of one another.

**4. Linearity (linear regression):** there is a linear relationship between the independent and dependent variables. This means that as one variable changes, the other variable also changes at a constant rate.

**5. Normality of Residuals (linear regression):** in linear regression models, the residuals (differences between observed and predicted values) should be normally distributed. This keeps the models’ errors at/or around zero.

**6. Continuous (measured on Interval or Ratio) Data:** Parametric tests assume that the data is measured on an interval or ratio scale.

**Non-Parametric Tests**

These tests are distribution-free methods, meaning that the analysis does not directly rely on the distribution of the dataset/population. These methods are typically used when the data fails to meet the assumptions of parametric tests. The assumptions are as follows:

**1. No assumption of normality**: the data does not have to follow a normal distribution (natural bell curve).

**2. No assumption of equal variances:** these tests do not require homogeneity of variances across the groups.

**Example:** if the overall variance in height between the plants in Group A was ~0.5inches and in Group B it was ~0.3inches, then this would mean there is no equal variances between Group’s A and B’s.

**3. Independence:** Observations should be independent of one another (do not influence each other).

**4. Ranking:** oftentimes involve ranking the data (i.e., highest to lowest, ordering data).

**5. Independence of Rank:** ranks assigned to observations should be independent of one another, meaning that one ranking should not influence the ranking of another observation.

**6. Categorical or Ordinal Data:** these tests can handle data that may not have a numerical value associated with it. But it must have some sort of rank or order.

**Step 2: Running Assumption on different Datasets**

**Question 1:** I want to see the characteristics of my data falls under the parametric or nonparametric tests.

**Prompt 1:** can you run tests that check the homogeneity of variance and normality of my dataset please?

**I did not prompt this correctly and I’ll explain why:** I forgot to tell it to separate the groups and height when running the parametric tests. This is why we see a significant p-value in the “group” heading but not the “height”. Levene’s test is also significant, indicating non-normal distribution. I want to reassess this specifying that it looks at the groups separately. Let’s recheck this:

Now we can see how this outcome differs compared to the first one. This also shows how important it is to prompt for the correct test based on your data’s characteristics. Always double check your analysis when running tests!

**Question 2:** What does the distribution look like on a histogram?

**Prompt 2:** Can you create a histogram on height?

Now let’s look at a histogram that goes over the distribution of height:

Again, I did not prompt this correctly as I did not ask for Julius to separate it by groups. Let’s try again:

Julius misunderstood what I wanted, thinking I wanted the data frame to be separated. Let’s try another prompt that reiterates what we want:

Perfect! Julius has made the correct adjustments and now I can look at the distribution of my data points. My data is normally distributed with equal homogeneity of variances, my points are independent from one another, and this dataset is considered as a continuous measurement! This means I can run a parametric test to compare the mean height of Group A and B and assess the statistical significance of difference.

Keywords: AI statistics, AI statistical analysis, GPT, parametric, non-parametric, normality, assumptions, homogeneity of variances,