Visualization: Distribution Charts

Alysha · August 16, 2024, 3:25pm

Introduction

Choosing the correct visualization is an important aspect of data presentation. However, many users often struggle to identify the most effective visualization for their data. Each type of visualization serves a different purpose, and selecting the appropriate one requires an understanding of the data, the audience, and the overall message you wish to convey.

This article aims to make the process of data visualization easier to understand. It will highlight the different types of graphs and their typical use cases. Additionally, it will provide you with the dataset used for each visualization, along with the Python and R code involved in creating the graph. You can see the full article here.

Acknowledgements
Each dataset used in this document (unless otherwise stated) can be found on vincentarelbundock.github.io, which is a large repository for datasets that can be used in R. I would like to thank the people responsible for making this information open access and accessible. The link to the google sheet will be provided throughout the document.

How the Guide is Formatted

The guide will be formatted where it will list a general group (i.e., comparison charts, correlation, etc.) followed by a list of visualizations that fall under that group. For example, bar/column charts are known as a type of comparison chart. Then, after a short introduction on the chart, a visualization will follow. Below the figure, the R and Python code will be displayed that was used to generate the graph. The code that is related to the visualization is listed directly underneath the figure. For all visualizations, make sure that you upload the file when you start the chat, as some of the code does not reflect that initial step.

Distribution

Distribution charts are meant to show the spread of data across various categories or values. They help readers understand the frequency, range, and the overall shape of the data’s distribution.

1. Density plot

A density plot measures the probability distribution of a continuous variable. Density plots are useful for visualizing the distribution, identifying modes, and comparing distributions between multiple groups.

For this visualization, we will use the “iris” dataset (151 rows, 4 columns). This is a common dataset that contains information on petal width, petal length, sepal width and sepal length of three different iris species (Setosa, Versicolour, and Virginica). The dataset can be accessed by simply asking Julius to retrieve it in Python or R, or it can be accessed here.

R Example

#R CODE
# Load required libraries
library(tidyverse)
library(gridExtra)

# Load the iris dataset
data(iris)

# Create the four density plots
plot_sepal_length <- ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_density(alpha = 0.7) +
  labs(title = "Density Plot of Sepal Length by Species",
       x = "Sepal Length (cm)", y = "Density") +
  theme_minimal()

plot_sepal_width <- ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
  geom_density(alpha = 0.7) +
  labs(title = "Density Plot of Sepal Width by Species",
       x = "Sepal Width (cm)", y = "Density") +
  theme_minimal()

plot_petal_length <- ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  geom_density(alpha = 0.7) +
  labs(title = "Density Plot of Petal Length by Species",
       x = "Petal Length (cm)", y = "Density") +
  theme_minimal()

plot_petal_width <- ggplot(iris, aes(x = Petal.Width, fill = Species)) +
  geom_density(alpha = 0.7) +
  labs(title = "Density Plot of Petal Width by Species",
       x = "Petal Width (cm)", y = "Density") +
  theme_minimal()

# Combine all plots into a single figure
combined_plot <- grid.arrange(plot_sepal_length, plot_sepal_width,
                              plot_petal_length, plot_petal_width,
                              ncol = 2)

# Display the combined plot
print(combined_plot)

# Save the plot as a PNG file
ggsave("iris_density_plots.png", combined_plot, width = 12, height = 10, dpi = 300)

print("All four density plots created successfully in R and saved as 'iris_density_plots.png'.")

Python Example

#PYTHON CODE
# Import required libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Set the style for better-looking plots
sns.set_style("whitegrid")

# Load the iris dataset
iris = sns.load_dataset("iris")

# Create a figure with 4 subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Density plot for Sepal Length
sns.kdeplot(data=iris, x="sepal_length", hue="species", multiple="stack", palette="viridis", common_norm=False, ax=axes[0, 0])
axes[0, 0].set_title("Density Plot of Sepal Length by Species")
axes[0, 0].set_xlabel("Sepal Length (cm)")
axes[0, 0].set_ylabel("Density")

# Density plot for Sepal Width
sns.kdeplot(data=iris, x="sepal_width", hue="species", multiple="stack", palette="viridis", common_norm=False, ax=axes[0, 1])
axes[0, 1].set_title("Density Plot of Sepal Width by Species")
axes[0, 1].set_xlabel("Sepal Width (cm)")
axes[0, 1].set_ylabel("Density")

# Density plot for Petal Length
sns.kdeplot(data=iris, x="petal_length", hue="species", multiple="stack", palette="viridis", common_norm=False, ax=axes[1, 0])
axes[1, 0].set_title("Density Plot of Petal Length by Species")
axes[1, 0].set_xlabel("Petal Length (cm)")
axes[1, 0].set_ylabel("Density")

# Density plot for Petal Width
sns.kdeplot(data=iris, x="petal_width", hue="species", multiple="stack", palette="viridis", common_norm=False, ax=axes[1, 1])
axes[1, 1].set_title("Density Plot of Petal Width by Species")
axes[1, 1].set_xlabel("Petal Width (cm)")
axes[1, 1].set_ylabel("Density")

# Adjust layout
plt.tight_layout()
plt.show()

print("All four density plots created successfully.")

2. Histogram

A histogram is used to display the distribution of a dataset by dividing it into intervals, or bins, and counting the data points that fall into each interval. The height of each bar corresponds to the frequency of data points falling into that specific interval.

For this visualization, we will use a dataset comparing thermometer readings between Mr. Trump and Mr. Obama (3,081 rows, 3 columns). The dataset can be found here.

R Example

#R CODE
# Load necessary libraries
library(googlesheets4)
library(dplyr)
library(ggplot2)
library(gridExtra)

# Deauthorize Google Sheets API
gs4_deauth()

# Read the data
df <- read_sheet('https://docs.google.com/spreadsheets/d/1TXjIkRtoZQuGV5faJ2xt5CxXxyj-lYfnP3zsSqVdH1I/edit?gid=0#gid=0')

# Remove NAs and 0's
df_cleaned <- df %>%
  filter(!is.na(fttrump1) & !is.na(ftobama1) & fttrump1 != 0 & ftobama1 != 0)

# Create histograms without normal distribution curve and with smaller title font
hist_trump <- ggplot(df_cleaned, aes(x = fttrump1)) +
  geom_histogram(binwidth = 10, fill = 'red', color = 'black', alpha = 0.7) +
  labs(title = 'Histogram of fttrump1 (Cleaned)', x = 'Score', y = 'Frequency') +
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

hist_obama <- ggplot(df_cleaned, aes(x = ftobama1)) +
  geom_histogram(binwidth = 10, fill = 'blue', color = 'black', alpha = 0.7) +
  labs(title = 'Histogram of ftobama1 (Cleaned)', x = 'Score', y = 'Frequency') +
  theme_minimal() +
  theme(plot.title = element_text(size = 10))

# Arrange plots side by side
grid.arrange(hist_trump, hist_obama, ncol = 2)

# Print summary of cleaned data
print(summary(df_cleaned))

# Calculate and print the number of rows in the cleaned dataset
print(paste('Number of rows in cleaned dataset:', nrow(df_cleaned)))

Python Example

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data from the Excel file
file_path = 'therms.xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet1')

# Clean the data by removing NAs and 0's
cleaned_df = df.dropna(subset=['fttrump1', 'ftobama1'])
cleaned_df = cleaned_df[(cleaned_df['fttrump1'] != 0) & (cleaned_df['ftobama1'] != 0)]

# Select columns at indices 1 to 2
selected_columns = cleaned_df.iloc[:, 1:3]

# Create a figure with two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Histogram for fttrump1
sns.histplot(data=selected_columns, x='fttrump1', kde=True, ax=ax1)
ax1.set_title('Histogram of fttrump1')
ax1.set_xlabel('fttrump1 values')
ax1.set_ylabel('Frequency')

# Histogram for ftobama1
sns.histplot(data=selected_columns, x='ftobama1', kde=True, ax=ax2)
ax2.set_title('Histogram of ftobama1')
ax2.set_xlabel('ftobama1 values')
ax2.set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('histograms.png')
plt.show()

# Display basic statistics
print("Basic statistics for the selected columns:")
print(selected_columns.describe())

3. Jitter Plot

A jitter plot is similar to scatter plot but introduces intentional random dispersions of points – referred to as ‘jittering’ – along one axis to prevent overlapping. This is useful when your data points may have the same values or relatively close values across different categories.

For this visualization, we will use a dataset comparing dried plant weight yields (30 observations) under three different conditions (control, treatment 1, and treatment 2). The dataset can be accessed here.

R Example

# Load required libraries
library(readxl)
library(ggplot2)
library(dplyr)

# Read the Excel file
FILEPATH <- 'plant.xlsx'
df <- read_excel(FILEPATH)

# Create the jitterplot with different colors for each group
p <- ggplot(df, aes(x = group, y = weight, color = group)) +
  geom_jitter(width = 0.2, size = 4, alpha = 0.7) +
  theme_minimal() +
  labs(title = "Jitterplot of Weights by Treatment Group",
       x = "Treatment Group",
       y = "Weight") +
  scale_color_brewer(palette = "Set1") +  # Use a color-blind friendly palette
  theme(legend.position = "none")  # Remove legend as it's redundant with x-axis

# Display the plot
print(p)

# Save the plot
ggsave("weights_jitterplot_r_colored.png", plot = p, width = 10, height = 6, dpi = 300)

Python Example

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset from the Excel file
FILEPATH = 'plant.xlsx'
dataframes = read_all_sheets_from_excel(FILEPATH)

# Assuming the data is in the first sheet
sheet_name = list(dataframes.keys())[0]
df = dataframes[sheet_name]

# Create the jitterplot
plt.figure(figsize=(10, 6))
sns.stripplot(x='group', y='weight', data=df, jitter=True, size=8, palette='viridis')
plt.title('Jitterplot of Weights by Group')
plt.xlabel('Group')
plt.ylabel('Weight')
plt.savefig('weights_jitterplot.png')
plt.close()

print("Jitterplot has been created and saved as 'weights_jitterplot.png'.")

4. Beeswarm chart

A beeswarm chart visualizes data points along a single axis, with dots representing each individual datapoint. This method does slightly rearrange the points to avoid overlapping.

We will use the same plant growth dataset from the jitter plot visualization to illustrate how the data points appear in comparison to the jitter plot. The dataset can be accessed here.

R Example

# Load necessary libraries
library(ggplot2)
library(ggbeeswarm)
library(readxl)

# Load the data from the Excel file
df <- as.data.frame(read_excel('bee.xlsx', sheet = 'Sheet1'))

# Create the beeswarm plot
p <- ggplot(df, aes(x = group, y = weight)) +
  geom_beeswarm(aes(color = group), size = 3, alpha = 0.8) +
  geom_boxplot(alpha = 0.2, width = 0.5, outlier.shape = NA) +
  theme_minimal() +
  labs(title = "Beeswarm Plot of Plant Weight by Group",
       x = "Group",
       y = "Weight") +
  theme(legend.position = "none")  # Remove legend as color is redundant with x-axis

# Display the plot
print(p)

# Save the plot as a PNG file
ggsave("plant_growth_beeswarm_plot.png", plot = p, width = 10, height = 6, dpi = 300)

# Confirm the file was saved
file.exists("plant_growth_beeswarm_plot.png")

Python Example

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data from the Excel file
sheet_df = pd.read_excel('bee.xlsx', sheet_name='Sheet1')

# Create the beeswarm plot using seaborn
plt.figure(figsize=(10, 6))
sns.swarmplot(x='group', y='weight', data=sheet_df, size=8, alpha=0.8)
sns.boxplot(x='group', y='weight', data=sheet_df, whis=1.5, width=0.3, showcaps=False, boxprops={'facecolor':'None'}, showfliers=False)
plt.title('Beeswarm Plot of Plant Weight by Group')
plt.xlabel('Group')
plt.ylabel('Weight')
plt.show()

# Save the plot as a PNG file
plt.savefig('plant_growth_beeswarm_plot_python.png', dpi=300)

# Confirm the file was saved
print('Beeswarm plot saved as plant_growth_beeswarm_plot_python.png')

5. Box Plot (Box-and-whisker plot)

A box plot, or box-and-whiskers plot, is a standardized method for displaying the distribution of a dataset. It highlights five key aspects: the minimum value, the first quartile (Q1), median, third quartile (Q3), and the maximum value. This allows the reader to examine the spread of the data, central tendency, and identify potential outliers, making it a great tool for exploratory data analysis.

For this visualization, we will use a dataset from Baumann & Jones, as reported by Moore & McCabe (1993). The dataset examines whether three different teaching methods – traditional (Basal), innovative 1 (DRATA), and innovative 2 (Strat) – affected reading comprehension in students. The data frame has 66 rows with 6 columns: group, pretest.1, pretest.2, post.test.1, post.test.2, post.test.3. The dataset can be accessed here.

R Example

# Load necessary libraries
library(ggplot2)
library(dplyr)
library(readxl)

# Read the Excel file
df <- read_excel('teacher.xlsx', sheet = 'Sheet1')

# Calculate average post-test score
df <- df %>%
  mutate(avg_posttest = (post.test.1 + post.test.2 + post.test.3) / 3)

# Create a box plot for post-test scores
post_scores_plot <- ggplot(df, aes(x = group, y = avg_posttest, fill = group)) +
  geom_boxplot() +
  labs(title = "Distribution of Average Post-test Scores by Group",
       x = "Group", y = "Average Post-test Score") +
  theme_minimal()

# Save the post-test scores plot
ggsave('post_scores_distribution.png', plot = post_scores_plot, width = 10, height = 6)

# Display the post-test scores plot
print(post_scores_plot)

# Calculate average pretest and post-test scores
df <- df %>%
  mutate(avg_pretest = (pretest.1 + pretest.2) / 2,
         avg_posttest = (post.test.1 + post.test.2 + post.test.3) / 3)

# Group the data and calculate mean scores for each group
grouped_data <- df %>%
  group_by(group) %>%
  summarise(avg_pretest = mean(avg_pretest),
            avg_posttest = mean(avg_posttest))

# Print the average scores for each group
print("Average scores for each group:")
print(grouped_data)

# Create a combined plot for average pretest and post-test scores by group
combined_scores_plot <- ggplot(df, aes(x = group)) +
  geom_boxplot(aes(y = avg_pretest, fill = 'Pre-test'), position = position_dodge(width = 0.8)) +
  geom_boxplot(aes(y = avg_posttest, fill = 'Post-test'), position = position_dodge(width = 0.8)) +
  labs(title = "Distribution of Average Pre-test and Post-test Scores by Group",
       x = "Group", y = "Average Score") +
  scale_fill_manual(name = "Test Type", values = c('Pre-test' = 'lightblue', 'Post-test' = 'lightgreen')) +
  theme_minimal()

# Save the combined plot
ggsave('combined_scores_distribution.png', plot = combined_scores_plot, width = 10, height = 6)

# Display the combined plot
print(combined_scores_plot)

Python Example

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming the dataframe is already loaded as 'dataframes['Sheet1']'

# Calculate average pretest and post-test scores
dataframes['Sheet1']['avg_pretest'] = (dataframes['Sheet1']['pretest.1'] + dataframes['Sheet1']['pretest.2']) / 2
dataframes['Sheet1']['avg_posttest'] = (dataframes['Sheet1']['post.test.1'] + dataframes['Sheet1']['post.test.2'] + dataframes['Sheet1']['post.test.3']) / 3

# Create a box plot to show the distribution of average pretest and post-test scores for each group
plt.figure(figsize=(12, 6))
sns.boxplot(x='group', y='value', hue='variable', 
            data=pd.melt(dataframes['Sheet1'], id_vars=['group'], value_vars=['avg_pretest', 'avg_posttest']))
plt.title('Distribution of Average Pretest and Posttest Scores by Group')
plt.ylabel('Score')
plt.show()

# Create a box plot for the average post-test scores by group
plt.figure(figsize=(10, 6))
sns.boxplot(x='group', y='avg_posttest', data=dataframes['Sheet1'])
plt.title('Distribution of Average Posttest Scores by Group')
plt.ylabel('Average Posttest Score')
plt.show()

This post is part of a multi-series compilation. You can find the other posts below:

Visualization: Part-to-Whole Charts

Visualization: Correlation Charts

Visualization: Comparison Charts

Visualization: Geospatial and Other Charts

Visualization: Data Over Time (Temporal)

Happy graphing!

Topic		Replies	Views
Visualizations: Comparison Charts Guides	0	523	August 16, 2024
Visualizations: Correlation Charts Guides	1	202	August 20, 2024
Visualizations: Part-to-Whole Charts Guides	0	53	August 16, 2024
Visualization: Geospatial and Other Guides	0	372	August 16, 2024
Visualizations: Data Over Time (Temporal) Guides	0	94	August 16, 2024

Visualization: Distribution Charts

Introduction

How the Guide is Formatted

Distribution

Related topics