Have you ever found yourself staring aimlessly at your dataset, wondering where to start with the analysis? This guide will show you how to use Julius to determine which data analysis method best fits your data. Some of this information may seem tedious, but I promise you, it will be worth it.
Step 1: Review the Assumptions of each model.
After running descriptive statistics on your data, you can proceed with data analysis. For simplicity sake, I’ll focus on two categories of statistical tests: Parametric and Non-parametric test.
Parametric Tests
These tests make assumptions about the distribution of the population from which the sample was taken. They assume the data follows a normal distribution. Here are the assumptions that your dataset must meet to run these tests:
1. Normality: Data should be normally distributed, meaning that it should follow a (somewhat) natural bell curve.
2. Homogeneity of variance: The variance of the dependent variable should be equal across the groups/conditions.
*Example: You are comparing the heights between two groups of plants: Group A and B (N=5). After measuring the differences in heights, you find that the height of plants in Group A varies by about 0.2 inches, and Group B’s plant height also varies by 0.2 inches. This does not necessarily mean that the height of each plant is the same in both groups; it just means that the differences between all five plants in each group equates to about 0.2 inches. This indicates homogeneity of variance across both groups.
3. Independence: Observations should be independent of each other.
Example: When measuring each plant’s height, we know that its height is not influenced by the height of another plant. Each plants height is independent of the others since they do not influence each other’s growth.
4. Linearity (linear regression): There is a linear relationship between the independent and dependent variables. This means that as one variable changes, the other variable also changes at a constant rate.
5. Normality of Residuals (linear regression): In linear regression models, the residuals (differences between observed and predicted values) should be normally distributed, keeping the models’ errors around zero.
6. Continuous (measured on Interval or Ratio) Data: Parametric tests assume that the data is measured on an interval or ratio scale.
Non-Parametric Tests
These tests are distribution-free methods, meaning that the analysis does not directly rely on the distribution of the dataset/population. These methods are typically used when the data fails to meet the assumptions of parametric tests. The assumptions are as follows:
1. No assumption of normality: The data does not have to follow a normal distribution (natural bell curve).
2. No assumption of equal variances: These tests do not require homogeneity of variances across the groups.
Example: If the overall variance in height between the plants in Group A is ~0.5 inches and Group B it was ~0.3 inches, this indicates there is no equal variances between Group’s A and B’s.
3. Independence: Observations should be independent of one another (do not influence each other).
4. Ranking: These tests often involve ranking the data (i.e., highest to lowest, ordering data).
5. Independence of Rank: Ranks assigned to observations should be independent of one another, meaning that one ranking should not influence another observation’s ranking.
6. Categorical or Ordinal Data: These tests can handle data that may not have a numerical value associated with it but it must have some sort of rank or order.
Step 2: Running Assumption on different Datasets
Question 1: How do I determine if my data fits parametric or non-parametric tests?
Prompt 1: Can you run tests that check the homogeneity of variance and normality of my dataset please?
Explanation: The initial prompt was incorrect because it did not specify to separate the groups and heights when running the parametric tests. This resulted in a significant p-value in the ‘group’ heading but not the ‘height’ heading. Levene’s test was also significant, indicating non-normal distribution. Let’s reassess this by specifying that the tests should consider the groups separately.
Now, we can see how this outcome differs from the first one. This shows the importance of using the correct prompt based on your data’s characteristics. Always double check your analysis when running tests!
Question 2: What does the distribution look like on a histogram?
Prompt 2: Can you create a histogram of height?
Again, the initial prompt was incorrect as it did not specify separating the data by groups. Let’s try again:
Julius misunderstood the request, thinking the data frame should be separated. Let’s try another prompt that clarifies our requirements:
Perfect! Julius has made the correct adjustments, and now I can examine the distribution of my data points. Since my data is normally distributed with equal homogeneity of variances, the points are independent from one another, and the dataset is considered a continuous measurement, this means I can run a parametric test to compare the mean height of Group A and B and assess the statistical significance of difference.
Keywords: AI statistics, AI statistical analysis, GPT, parametric, non-parametric, normality, assumptions, homogeneity of variances
References & Further Readings
- Turner, R., Samaranayaka, A., & Cameron, C. (2020). Parametric vs nonparametric statistical methods: which is better, and why? The New Zealand Medical Student Journal, 29 , 61-62.
- Uchechi, Henshaw & Akwiwu, Euphoria. (2019). Parametric and Nonparametric statistics. 4. 5-15.