Exploratory Data Analysis (EDA) is a fundamental step in the data science process, as it can reveal underlying patterns, structures, and potential relationships between variables. While descriptive statistics provides a snapshot of the dataset by summarizing metrics such as mean, median, mode, standard error, etc., EDA takes this process one step further. EDA provides a deeper exploration of the dataset by providing various visualizations such as histograms, scatter plots, and box plots to uncover trends, identify any anomalies, and test hypotheses. These insights are crucial for generating additional hypotheses and informing the researcher on the potential statistical analyses needed.
The Titanic dataset is a well known dataset in the scientific community, and it is oftentimes used to demonstrate how to run EDA. This dataset contains detailed information on the passengers on the RMS Titanic, which sank in 1912. Here we will use Julius to helps us perform EDA on the dataset to uncover potential insights and trends that can help us further our understanding on the factors that may have influenced passenger survival rates and other uncover other trends that may be present in the dataset.
Step 1: Calling in the dataset
We can first ask Julius to retrieve the dataset using Python.
Prompt: can you import the dataset via the code:
df = sns.load_dataset('titanic")
The dataset displays the following:
a. survived (0 = no, 1 = yes)
b. pclass (passenger class with 1st, 2nd or 3rd class)
c. the sex
d. age
e. sibsp (#of sibling or spouses aboard)
f. parch (#of parents or children aboard)
g. fare
h. embarked (where they were going)
i. class
j. male/female
k. adult_male (true or false)
l. deck (where the passenger was located)
m. embarked_town
n. alive (if they survived)
o. alone (if they were by themselves)
Step 2: Running Descriptive statistics on the dataset
To get an idea of the metrics for each column, we can prompt Julius to run some descriptive statistics!
Prompt: Please give me the descriptive statistics of the dataset using the following code:
print('\Summary statistics of the dataset:')
print(df.describe())
Great, Julius has given us a nice simple table showing us the summary statistics of the dataset which includes the mean, standard deviation, minimum value, 25%, 50% and 75% interquartile ranges, and maximum value.
Step 3: Data cleaning
The next thing I want to address in the dataset is the missing “null” values that we have in the deck column, as well check for other missing values:
Prompt: check columns for missing values please.
We can see that Julius has gone and identified the columns for missing values. It looks like we have four columns to adjust for: age, embarked, deck and embarked_town. For these values I will do the following:
- Fill missing values for “age” with the median (dataset is right skewed, so median chosen as it is more robust in dealing with outliers… this does produce bias so be mindful while interpreting).
- Drop the rows with missing 'embarked values"
- Drop the ‘deck’ column because of too many missing values
Let’s prompt Julius to do this and preview the dataset!
Prompt: please remove the deck column, fill missing values for ‘age’ with the median age, and drop the rows with missing “embarked” values. Please show me a preview of the updated dataset once you are finished.
Great, the dataset has been cleaned and Julius has brought up the corresponding table. Let’s look at some data visualizations now!
Step 4: Data Visualizations
a. Creating a histogram of the distribution of age
We can check the distribution of age by creating a histogram on the distribution of ages and their corresponding frequencies. This will give us an idea on if there were any patterns in relation to passengers age. Remember, we did fill the missing values (n = 177) with the median age so we do have some bias here.
Prompt: let’s plot the age distribution on a histogram, please provide a skewness value associated with it and explain what it means".
According to Julius we have a positive skew of approximately 0.507, which indicates our data is right-skewed. Additionally, from visual inspection, we can see that the age that has the highest frequency is around ~28, followed by the mid-twenties, then early twenties. Overall, the Titanic seemed to have a large frequency of people in their 20s and 30s.
b. Creating a bar graph on survival count by sex
Next we can look at the survival rate by sex by creating a bar graph. We can then discern any potential trends that may be present.
Prompt: let’s look at the survival count by sex, please give a brief overview of the visual trends.
From visual inspection, it seems that females tended to have a higher survival rate than males. However, further analysis should be done in order to truly conclude if there are significant findings. Remember, EDA is to get an idea of the trends and then craft additional hypotheses and corresponding statistical analyses that target those questions.
Although Julius uses the word significantly higher when comparing the female survivor rate to males, I would be hesitant to conclude this until statistical analyses can confirm this, and other parameters are tested for.
c. Creating a boxplot to visualize fare by survival rate
Let’s look to see if there are any trends between the amount paid for fare versus survival rate. We can do this by creating a boxplot!
Prompt: create a boxplot that shows the fare paid by survival rate.
Based on the created visualization and the weak correlation coefficient, we can see that there is a weak correlation between fare and survival rate. This suggests that people who tended to pay higher fares, may have had a slightly higher tendency to survive. More statistical analyses would be needed to confirm this.
d. Creating a boxplot on the distribution of fare by embarkment point
I then wanted to see if the price varied between the three different embarkment points. So, let’s see!
Prompt: create a boxplot on the fare distribution by embarkment point, please.
We can see the distribution of the different fares for the three embarkment points. Cherboroug tended to have a higher median fare rate than both Queenstown and Southhampton. Queenstown seems to report the lowest median fare, while Southampton shows a larger distribution of outliers and was more expensive than Queenstown, but less expensive than Cherboroug.
Step 5 (Optional): Creating a new variable for your dataset
I was curious about the the average family size on the Titanic, so I asked Julius to create a new column named family_size for my dataset. This was done by summing the “sibsp” and “parch” columns and then adding 1.
Prompt: create a new column called “family_size”: this is done by summing “sibsp” and “parch” then adding 1.
Great, now I can ask Julius to calculate the average family size on the Titanic!
Prompt: can you calculate the average family size please?
Julius was able to calculate the average family size to be around ~2 people.
Now that we have generated some visualizations, and some questions regarding it, we can continue on with the next step in our data analysis adventure: statistical analyses! Or, you can just stop here and appreciate the process of using Exploratory Data Analysis to help identify potential trends in your dataset. Regardless, I hope you found this guide insightful!
Happy data visualization!
Keywords: AI, AI and EDA, running EDA, exploratory data analysis, data science, descriptive statistics, visualizations, histogram, scatter plots, box plots, bar graph, summary statistics, data visualizations, statistical analyses, AI statistical analyses, data cleaning.