Introduction
Choosing the correct visualization is an important aspect of data presentation. However, many users often struggle to identify the most effective visualization for their data. Each type of visualization serves a different purpose, and selecting the appropriate one requires an understanding of the data, the audience, and the overall message you wish to convey.
This article aims to make the process of data visualization easier to understand. It will highlight the different types of graphs and their typical use cases. Additionally, it will provide you with the dataset used for each visualization, along with the Python and R code involved in creating the graph. You can see the full article here.
Acknowledgements
Each dataset used in this document (unless otherwise stated) can be found on vincentarelbundock.github.io, which is a large repository for datasets that can be used in R. I would like to thank the people responsible for making this information open access and accessible. The link to the google sheet will be provided throughout the document.
How the Guide is Formatted
The guide will be formatted where it will list a general group (i.e., comparison charts, correlation, etc.) followed by a list of visualizations that fall under that group. For example, bar/column charts are known as a type of comparison chart. Then, after a short introduction on the chart, a visualization will follow. Below the figure, the R and Python code will be displayed that was used to generate the graph. The code that is related to the visualization is listed directly underneath the figure. For all visualizations, make sure that you upload the file when you start the chat, as some of the code does not reflect that initial step.
Comparison Charts
Comparison charts or graphs are used to compare quantities across different categories. Their primary purpose is to highlight the differences and similarities within data sets, making it easier for viewers to draw conclusions about the variations amongst various groups.
1. Bar/Column charts
Bar and column charts provide clear comparisons between discrete categories (i.e., car models) based on a quantitative measure (e.g., miles per gallon, MPG). They are widely used because they offer a quick and effective way to visualize differences amongst categorical variables.
The data used in this visualization can be accessed here. It details fuel consumption and 10 different aspects of motor vehicle design and performance.
R Example
#Create the bar chart
library(ggplot2)
ggplot(car_data, aes(x = reorder(car_name, -mpg), y = mpg)) +
geom_bar(stat = 'identity', fill = 'skyblue') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = 'Miles Per Gallon for Different Car Models',
x = 'Car Model',
y = 'Miles Per Gallon (MPG)') +
coord_flip() # Flip coordinates to make it a horizontal bar chart
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Read the CSV file
df = pd.read_csv('spreadsheet.csv')
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='car_name', y='mpg', data=df)
plt.xticks(rotation=90)
plt.title('MPG by Car Model')
plt.xlabel('Car Model')
plt.ylabel('Miles Per Gallon (MPG)')
plt.tight_layout()
plt.savefig('mpg_bar_plot.png')
plt.close()
print("Bar plot code displayed")
2. Grouped/Clustered Bar Chart
Grouped or clustered bar charts are used to compare frequencies, counts, or other measures across multiple categories and groups.
For this visualization, we will be using a dataset from the College Scorecard, which contains by-college-by-year data on how students are doing after graduation, available here. We will create a grouped bar chart to compare the counts of working vs. not working for five institutions in the year 2007.
R Example
#Load necessary libraries
library(ggplot2)
library(reshape2)
#Create a data frame with the provided data
institutions_data <- data.frame(
inst_name = c(
'AI Miami International University of Art and Design',
'ASA College',
'ASM Beauty World Academy',
'AVTEC-Alaska\'s Institute of Technology',
'Aaron\'s Academy of Beauty'
),
count_working = c(530, 2137, 155, 100, 57),
count_not_working = c(138, 377, 53, 29, 16)
)
#Melt the data frame for plotting
df_plot <- melt(institutions_data, id.vars = 'inst_name',
variable.name = 'status', value.name = 'count')
#Create the grouped bar plot
p <- ggplot(df_plot, aes(x = inst_name, y = count, fill = status)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(title = 'Working vs Not Working Counts for Selected Institutions (2007)',
x = 'Institution', y = 'Count') +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c('count_working' = 'skyblue', 'count_not_working' = 'lightcoral'),
labels = c('Working', 'Not Working'))
#Save the plot
ggsave('selected_institutions_working_vs_not_working_R.png', plot = p, width = 12, height = 8)
#Print confirmation
print("Grouped bar graph has been saved as 'selected_institutions_working_vs_not_working_R.png'.")
#Display summary statistics for the selected institutions
summary <- aggregate(cbind(count_working, count_not_working) ~ inst_name, data = institutions_data,
FUN = function(x) c(sum = sum(x), mean = mean(x)))
summary <- do.call(data.frame, summary)
colnames(summary) <- c('inst_name', 'working_sum', 'working_mean', 'not_working_sum', 'not_working_mean')
print(summary)
Python Example
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Create a DataFrame with the provided data
institutions_data = {
'inst_name': [
'AI Miami International University of Art and Design',
'ASA College',
'ASM Beauty World Academy',
'AVTEC-Alaska\'s Institute of Technology',
'Aaron\'s Academy of Beauty'
],
'count_working': [530, 2137, 155, 100, 57],
'count_not_working': [138, 377, 53, 29, 16]
}
# Convert to DataFrame
institutions_df = pd.DataFrame(institutions_data)
# Melt the DataFrame for plotting
df_plot = institutions_df.melt(id_vars='inst_name',
value_vars=['count_working', 'count_not_working'],
var_name='status', value_name='count')
# Set a color palette
color_palette = {'count_working': 'blue', 'count_not_working': 'orange'}
# Create the grouped bar plot
plt.figure(figsize=(12, 8))
ax = sns.barplot(x='inst_name', y='count', hue='status', data=df_plot, palette=color_palette)
# Customize the plot
plt.title('Working vs Not Working Counts for Selected Institutions', fontsize=16)
plt.xlabel('Institution', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
# Update the legend with correct colors
handles = [plt.Rectangle((0,0),1,1, color=color) for color in color_palette.values()]
plt.legend(handles, ['Working', 'Not Working'], title='Status')
# Adjust layout and save
plt.tight_layout()
plt.savefig('selected_institutions_working_vs_not_working_final_orange.png')
plt.show()
print("Final grouped bar graph with orange 'Not Working' bars has been saved as 'selected_institutions_working_vs_not_working_final_orange.png'.")
3. Range Plot
Range plots are useful for displaying variability, distributions, and confidence intervals within categories.
For this visualization, we will be using a dataset that contains daily temperatures (minimum and maximum) for Clemson, South Carolina from January 1st, 1930 to December 31st, 2020 (33,148 observations). The dataset can be accessed here.
R Example
#R CODE
library(dplyr)
library(lubridate)
library(ggplot2)
# Convert Unix timestamp to Date
df$date <- as.Date(as.POSIXct(df$date, origin="1970-01-01"))
# Filter data for 2020 and calculate monthly averages
monthly_avg_2020 <- df %>%
filter(year(date) == 2020) %>%
group_by(month = month(date, label = TRUE)) %>%
summarize(
avg_tmin = mean(tmin, na.rm = TRUE),
avg_tmax = mean(tmax, na.rm = TRUE)
)
# Filter data for 1930 and calculate monthly averages
monthly_avg_1930 <- df %>%
filter(year(date) == 1930) %>%
group_by(month = month(date, label = TRUE)) %>%
summarize(
avg_tmin = mean(tmin, na.rm = TRUE),
avg_tmax = mean(tmax, na.rm = TRUE)
)
# Combine the data for plotting
combined_data <- bind_rows(
monthly_avg_1930 %>% mutate(year = 1930),
monthly_avg_2020 %>% mutate(year = 2020)
)
# Create an overlapping range plot
range_plot_combined <- ggplot(combined_data, aes(x = month, group = year, color = as.factor(year))) +
geom_linerange(aes(ymin = avg_tmin, ymax = avg_tmax), size = 1, position = position_dodge(width = 0.5)) +
geom_point(aes(y = avg_tmin), size = 3, position = position_dodge(width = 0.5)) +
geom_point(aes(y = avg_tmax), size = 3, position = position_dodge(width = 0.5)) +
theme_minimal() +
labs(title = "Average Monthly Temperature Range: 1930 vs 2020",
x = "Month",
y = "Temperature (\u00b0F)",
color = "Year") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Display the plot
print(range_plot_combined)
# Print the summary tables
print("Monthly averages for 1930:")
print(monthly_avg_1930)
print("Monthly averages for 2020:")
print(monthly_avg_2020)
# Calculate overall averages for the year 2020
yearly_avg_2020 <- df %>%
filter(year(date) == 2020) %>%
summarize(
avg_tmin = mean(tmin, na.rm = TRUE),
avg_tmax = mean(tmax, na.rm = TRUE)
)
print("Yearly averages for 2020:")
print(yearly_avg_2020)
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
#Read the data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/1dxPKBBUR147Fin8FFKDdOUg6Pykpgh92jan4mhZR9pQ/export?format=csv&gid=0')
#Convert date to datetime format
#Assuming the date format is MM/DD/YYYY
try:
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
except ValueError:
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
#Function to calculate monthly averages
def calculate_monthly_avg(year):
monthly_data = df[df['date'].dt.year == year].groupby(df['date'].dt.strftime('%b'))[['tmin', 'tmax']].mean()
monthly_data['month'] = pd.to_datetime(monthly_data.index, format='%b').month
return monthly_data.sort_values('month')
#Calculate monthly averages for 1930 and 2020
monthly_avg_1930 = calculate_monthly_avg(1930)
monthly_avg_2020 = calculate_monthly_avg(2020)
#Set up the plot
plt.figure(figsize=(12, 6))
#Plot 1930 data
plt.vlines(x=monthly_avg_1930.index, ymin=monthly_avg_1930['tmin'], ymax=monthly_avg_1930['tmax'],
color='red', alpha=0.7, linewidth=2, label='1930')
plt.scatter(monthly_avg_1930.index, monthly_avg_1930['tmin'], color='red', alpha=0.7)
plt.scatter(monthly_avg_1930.index, monthly_avg_1930['tmax'], color='red', alpha=0.7)
#Plot 2020 data
plt.vlines(x=monthly_avg_2020.index, ymin=monthly_avg_2020['tmin'], ymax=monthly_avg_2020['tmax'],
color='blue', alpha=0.7, linewidth=2, label='2020')
plt.scatter(monthly_avg_2020.index, monthly_avg_2020['tmin'], color='blue', alpha=0.7)
plt.scatter(monthly_avg_2020.index, monthly_avg_2020['tmax'], color='blue', alpha=0.7)
#Customize the plot
plt.title('Average Monthly Temperature Range: 1930 vs 2020')
plt.xlabel('Month')
plt.ylabel('Temperature (\u00b0F)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
#Show the plot
plt.tight_layout()
plt.show()
#Print summary tables
print("Monthly Averages for 1930:")
print(monthly_avg_1930[['tmin', 'tmax']])
print("\
Monthly Averages for 2020:")
print(monthly_avg_2020[['tmin', 'tmax']])
#Calculate and print yearly averages
yearly_avg_1930 = df[df['date'].dt.year == 1930][['tmin', 'tmax']].mean()
yearly_avg_2020 = df[df['date'].dt.year == 2020][['tmin', 'tmax']].mean()
print("\
Yearly Averages for 1930:")
print(yearly_avg_1930)
print("\
Yearly Averages for 2020:")
print(yearly_avg_2020)
4. Radar Chart
Radar charts are useful for displaying multivariate data in a way that is easy to compare across different variables.
For this example, we are going to use the fitness scores of five individuals. The dataset can be accessed here.
R Example
#RCODE
library(googlesheets4)
library(fmsb)
# Load the Google Sheet
sheet_url <- "https://docs.google.com/spreadsheets/d/1DJk8RByndyRbQJfMHK3vl-0ai758VAHWmQZtDp4bdOI/edit?usp=sharing"
df_radar <- read_sheet(sheet_url)
# Ensure all columns except 'Individual' are numeric
df_radar[, -1] <- lapply(df_radar[, -1], as.numeric)
# Convert to dataframe for radar chart (excluding 'Individual' column)
df_radar_plot <- as.data.frame(df_radar[, -1])
rownames(df_radar_plot) <- df_radar$Individual
# Add max and min rows
df_radar_plot <- rbind(rep(10,6) , rep(0,6) , df_radar_plot)
# Set up colors for each individual
colors <- c('#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd')
# Create radar chart
par(mar = c(1, 1, 3, 1)) # Adjust margins
radarchart(df_radar_plot,
axistype = 1,
pcol = colors,
pfcol = sapply(colors, function(x) paste0(x, '66')), # Add transparency
plwd = 2,
plty = 1,
cglcol = 'grey',
cglty = 1,
axislabcol = 'grey',
caxislabels = seq(0, 10, 2),
cglwd = 0.8,
vlcex = 0.8,
title = 'Fitness Profile Radar Chart')
# Add legend with names
legend('topright',
legend = rownames(df_radar_plot)[-c(1, 2)],
col = colors,
lty = 1,
lwd = 2,
bty = 'n',
cex = 0.8,
text.col = 'black')
print('Radar chart with names has been created and displayed.')
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#Load the data from the Google Sheet
url = 'https://docs.google.com/spreadsheets/d/1DJk8RByndyRbQJfMHK3vl-0ai758VAHWmQZtDp4bdOI/edit?usp=sharing'
df = pd.read_csv(url)
# Set up the data for the radar chart
categories = df.columns[1:].tolist()
num_categories = len(categories)
# Create a figure and polar axes
fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))
# Plot data for each individual
angles = np.linspace(0, 2*np.pi, num_categories, endpoint=False)
angles = np.concatenate((angles, [angles[0]])) # complete the circle
for index, row in df.iterrows():
values = row[1:].tolist()
values += values[:1] # complete the circle
ax.plot(angles, values, 'o-', linewidth=2, label=row['Individual'])
ax.fill(angles, values, alpha=0.25)
# Set the labels and title
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories)
ax.set_title("Fitness Profile Radar Chart", fontsize=20)
# Add legend
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
# Adjust the layout and display the plot
plt.tight_layout()
plt.show()
print("Radar chart has been created and displayed.")
5. Dot Plot
Dot plots display one or more qualitative values for each category, allowing for easy comparison across multiple values within and between categories.
For this visualization, we will use a dataset containing the stats of starter Pokemon and from Generations I through VI (19 entries). This dataset can be accessed here.
R Example
#R CODE
#Load required libraries
library(tidyverse)
library(googlesheets4)
#Authenticate and read the Google Sheet
gs4_deauth()
url <- "https://docs.google.com/spreadsheets/d/1Qdhyom4eCb_OGYLVSnPg7AfgZ-PAF6d3q-cgxR5s254/edit?usp=sharing"
df <- read_sheet(url)
#Reshape the data for plotting
df_long <- df %>%
select(Name, Defense, `Sp. Def`, Attack, `Sp. Atk`) %>%
pivot_longer(cols = c(Defense, `Sp. Def`, Attack, `Sp. Atk`),
names_to = "Stat", values_to = "Value")
#Create the plot
ggplot(df_long, aes(x = Value, y = Name, color = Stat)) +
geom_point(position = position_jitter(width = 0.5, height = 0), size = 3, alpha = 0.8) +
geom_line(aes(group = Name), color = "grey", size = 0.5) +
scale_x_continuous(limits = c(30, 80), breaks = seq(30, 80, by = 10)) +
scale_color_manual(values = c("Defense" = "#1f77b4", "Sp. Def" = "#2ca02c",
"Attack" = "#d62728", "Sp. Atk" = "#9467bd")) +
labs(title = "Comparison of Defense, Special Defense, Attack, and Special Attack",
x = "Stat Value",
y = "Pok\u00e9mon") +
theme_minimal() +
theme(legend.position = "right",
axis.text.y = element_text(angle = 0),
plot.title = element_text(size = 14, hjust = 0.5),
axis.title = element_text(size = 12),
panel.grid = element_blank())
#Display the plot
print(last_plot())
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#Load the data from the Google Sheet
url = 'https://docs.google.com/spreadsheets/d/1Qdhyom4eCb_OGYLVSnPg7AfgZ-PAF6d3q-cgxR5s254/export?format=csv'
df = pd.read_csv(url)
#Reshape the data for plotting
df_long = df.melt(id_vars=['Name'], value_vars=['Defense', 'Sp. Def', 'Attack', 'Sp. Atk'],
var_name='Stat', value_name='Value')
#Create the dot plot
plt.figure(figsize=(12, 8))
#Plot points with a slight horizontal jitter to show overlapping points
for stat, color in zip(['Defense', 'Sp. Def', 'Attack', 'Sp. Atk'], ['#1f77b4', '#2ca02c', '#d62728', '#9467bd']):
stat_data = df_long[df_long['Stat'] == stat]
plt.scatter(stat_data['Value'] + np.random.normal(0, 0.5, len(stat_data)), stat_data['Name'],
label=stat, color=color, alpha=0.8, s=100)
#Add connecting lines
for name, group in df_long.groupby('Name'):
plt.plot(group['Value'], [name]*len(group), color='grey', linewidth=0.5)
#Customize the plot
plt.xlim(30, 80)
plt.title('Comparison of Defense, Special Defense, Attack, and Special Attack', fontsize=14)
plt.xlabel('Stat Value', fontsize=12)
plt.ylabel('Pokémon', fontsize=12)
plt.legend(title='Stat', bbox_to_anchor=(1.05, 1), loc='upper left')
#Remove gridlines
plt.grid(False)
#Adjust layout and show the plot
plt.tight_layout()
plt.show()
This post is part of a multi-series compilation. You can find the other posts below:
Visualization: Geospatial and Other Charts
Visualization: Data Over Time (Temporal)
Visualization: Distribution Charts
Visualization: Part-to-Whole Charts
Visualization: Correlation Charts
Happy graphing!