Visualizations: Part-to-Whole Charts

Introduction

Choosing the correct visualization is an important aspect of data presentation. However, many users often struggle to identify the most effective visualization for their data. Each type of visualization serves a different purpose, and selecting the appropriate one requires an understanding of the data, the audience, and the overall message you wish to convey.

This article aims to make the process of data visualization easier to understand. It will highlight the different types of graphs and their typical use cases. Additionally, it will provide you with the dataset used for each visualization, along with the Python and R code involved in creating the graph. You can see the full article here.

Acknowledgements
Each dataset used in this document (unless otherwise stated) can be found on vincentarelbundock.github.io, which is a large repository for datasets that can be used in R. I would like to thank the people responsible for making this information open access and accessible. The link to the google sheet will be provided throughout the document.

How the Guide is Formatted

The guide will be formatted where it will list a general group (i.e., comparison charts, correlation, etc.) followed by a list of visualizations that fall under that group. For example, bar/column charts are known as a type of comparison chart. Then, after a short introduction on the chart, a visualization will follow. Below the figure, the R and Python code will be displayed that was used to generate the graph. The code that is related to the visualization is listed directly underneath the figure. For all visualizations, make sure that you upload the file when you start the chat, as some of the code does not reflect that initial step.

Part-to-Whole & Hierarchical

Part-to-Whole visualizations illustrate how individual portions contribute to the whole, showing the proportions of each part in relation to the total.

Hierarchical graphs represent data in a tree-like structure, displaying relationships between different levels of data.

1. Stacked Bar Graphs

Stacked bar graphs show the composition of different categories within a dataset. Each bar represents the total amount, with segments within the bar representing the categories and their relative contributions to the total.

For this example, we will use data from a 2020 Financial Independence (FI) Survey conducted on Reddit, which examined people’s finances and the changes experienced during the pandemic. The full dataset can be accessed here, which contains 1998 rows and 65 variables. The visualization focuses on the columns pan_inc_chg (pandemic income change), pan_exp_chg (pandemic expense change), and pan_fi_chg (pandemic financial independence change), as they contain multiple categories relevant to the analysis.

R Example

#R CODE
# Install and load required packages
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")
if (!requireNamespace("googledrive", quietly = TRUE)) install.packages("googledrive")
if (!requireNamespace("readr", quietly = TRUE)) install.packages("readr")

library(tidyverse)
library(googledrive)
library(readr)

# Read the CSV file
df <- read_csv("data.csv")

# Load necessary libraries
library(tidyverse)

# Process the data
focused_columns <- c('pan_inc_chg', 'pan_exp_chg', 'pan_fi_chg')

df <- df %>%
  mutate(across(all_of(focused_columns), 
                ~case_when(
                  . %in% c("Increase", "Increased") ~ "Increased",
                  . %in% c("Decrease", "Decreased") ~ "Decreased",
                  . %in% c("Stayed the same", "No Change", "No change") ~ "No change",
                  is.na(.) ~ "Unknown",
                  TRUE ~ "Unknown"
                )))

# Calculate percentages
df_long <- df %>%
  select(all_of(focused_columns)) %>%
  pivot_longer(cols = everything(), names_to = "Category", values_to = "Change") %>%
  count(Category, Change) %>%
  group_by(Category) %>%
  mutate(Percentage = n / sum(n) * 100)

# Create the stacked bar chart
color_map <- c("Increased" = "#E69F00", "Decreased" = "#56B4E9", "No change" = "#009E73", "Unknown" = "#999999")

p <- ggplot(df_long, aes(x = Category, y = Percentage, fill = Change)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = color_map) +
  labs(title = "Distribution of Changes Across Three Categories",
       x = "Categories",
       y = "Percentage",
       fill = "Change Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right") +
  scale_x_discrete(labels = c("pan_inc_chg" = "Income Rate Change",
                              "pan_exp_chg" = "Expenses Change",
                              "pan_fi_chg" = "Financial Independence Change")) +
  geom_text(aes(label = sprintf("%.1f%%", Percentage)), 
            position = position_stack(vjust = 0.5), 
            size = 3)

# Display the plot
print(p)

# Print the normalized data
df_summary <- df_long %>%
  pivot_wider(names_from = Change, values_from = Percentage, values_fill = 0) %>%
  select(Category, Decreased, Increased, `No change`, Unknown)

print("Normalized counts (percentages):")
print(df_summary)

# Print unique values in each column
print("Unique values in each column:")
for (col in focused_columns) {
  print(paste(col, ":"))
  print(unique(df[[col]]))
}

# Save the plot as a PNG file
ggsave("financial_changes_distribution.png", plot = p, width = 12, height = 8, dpi = 300)

# Print the file name for reference
print("Plot saved as: financial_changes_distribution.png")

Python Example

#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt

# Load the data (assuming the data is already in a DataFrame named 'df')
# If it's not, uncomment the following line and replace the URL with the correct one
# df = pd.read_csv("https://docs.google.com/spreadsheets/d/1EEXxtp4swk6DDb3Yo22DeRUgR-y-6o9cI_D3TpkQWxE/export?format=csv")

# Extract the relevant columns
focused_columns = ['pan_inc_chg', 'pan_exp_chg', 'pan_fi_chg']

# Clean and standardize the data
for column in focused_columns:
    df[column] = df[column].replace({
        'No Change': 'No change',
        'Increased': 'Increased',
        'Decreased': 'Decreased',
        '': 'Missing'
    })

# Analyze the updated data
updated_counts = df[focused_columns].apply(pd.value_counts).fillna(0)

# Normalize the data to percentages
normalized_counts = updated_counts.apply(lambda x: x / x.sum() * 100)

# Create the stacked bar chart with color-blind friendly colors
color_map = {'Increased': '#E69F00', 'Decreased': '#56B4E9', 'No change': '#009E73', 'Missing': '#999999'}
ax = normalized_counts.transpose().plot(kind='bar', stacked=True, figsize=(12, 8), color=[color_map.get(x, '#999999') for x in normalized_counts.index])

# Customize the chart
ax.set_xticklabels(['Income Change', 'Expenses Change', 'Financial Independence Change'])
plt.title('Distribution of Changes Across Three Categories')
plt.xlabel('Categories')
plt.ylabel('Percentage')
plt.legend(title='Change Type', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)

# Add percentage labels on the bars
for c in ax.containers:
    ax.bar_label(c, fmt='%.1f%%', label_type='center')

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

# Print the normalized data
print("Normalized counts (percentages):")
print(normalized_counts.transpose().round(2))

2. Dendrogram

Dendrograms are tree-like diagrams that illustrate the arrangement of clusters formed by a specific hierarchical structure. They are commonly used in fields such as biology, bioinformatics, and machine learning to visualize the relationships between data points.

For this visualization, we will use a dataset called ‘cerebellum_gene_expression2’, which can be accessed here.

R Example

#R CODE
# Select the first 20 genes (excluding 'rownames' and 'y')
gene_subset <- df[, 2:21]

# Perform hierarchical clustering
hc <- hclust(dist(t(gene_subset)), method = "ward.D2")

# Create a dendrogram plot
dendro_plot <- ggdendrogram(hc, rotate = TRUE, size = 2) +
  theme_minimal() +
  labs(title = "Dendrogram of First 20 Genes",
       x = "Distance",
       y = "Genes") +
  theme(axis.text.y = element_text(size = 8))

# Display the plot
print(dendro_plot)

# Print the names of the first 20 genes
cat("First 20 genes:\
")
print(colnames(gene_subset))

# Calculate and print correlation matrix
correlation_matrix <- cor(gene_subset)
print("Correlation matrix (first 5x5):")
print(correlation_matrix[1:5, 1:5])

Python Example

#PYTHON CODE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster import hierarchy

# Select the first 20 genes (excluding 'rownames' and 'y')
gene_subset = df.iloc[:, 1:21]

# Perform hierarchical clustering
linkage = hierarchy.linkage(gene_subset.T, method='ward')

# Create a dendrogram
plt.figure(figsize=(12, 8))
dendrogram = hierarchy.dendrogram(linkage, labels=gene_subset.columns, leaf_rotation=90, leaf_font_size=8)
plt.title('Dendrogram of First 20 Genes')
plt.xlabel('Genes')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()

# Print the names of the first 20 genes
print("First 20 genes:")
print(gene_subset.columns.tolist())

# Calculate and print correlation matrix
correlation_matrix = gene_subset.corr()
print("\
Correlation matrix:")
print(correlation_matrix)

3. Pie Chart

A pie chart is a circular statistical graph divided into slices to illustrate the relative proportions of different categories within a dataset. Each slice represents a category, and the size of the slice corresponds to the proportion of that category in relation to the whole.

For this visualization, we will use a dataset from a 2010 poll on whether airports should use full-body scanners. The dataset can be accessed here.

R Example

#R CODE
# Load necessary libraries
library(googlesheets4)
library(dplyr)
library(ggplot2)

# Read the Google Sheet (assuming it's already authenticated)
sheet_url <- "https://docs.google.com/spreadsheets/d/158HYQ_oNbyvbYQzfLN5H1LInOH-FDormHlVIQYLpY60/edit?usp=sharing"
df <- read_sheet(sheet_url)

# Categorize by party affiliation and count responses
party_responses <- df %>%
  group_by(party_affiliation, answer) %>%
  summarise(count = n(), .groups = 'drop') %>%
  arrange(party_affiliation, desc(count))

# Display the categorized data
print(party_responses)

# Calculate percentages for each party
party_percentages <- party_responses %>%
  group_by(party_affiliation) %>%
  mutate(percentage = count / sum(count) * 100) %>%
  ungroup()

# Create a pie chart for each party
for (party in unique(party_percentages$party_affiliation)) {
  party_data <- party_percentages %>% filter(party_affiliation == party)
  
  pie_chart <- ggplot(party_data, aes(x = "", y = percentage, fill = answer)) +
    geom_bar(stat = "identity", width = 1) +
    coord_polar("y", start = 0) +
    labs(title = paste("Responses for", party),
         fill = "Answer",
         x = NULL,
         y = NULL) +
    theme_minimal() +
    theme(axis.text = element_blank(),
          axis.ticks = element_blank(),
          panel.grid = element_blank()) +
    geom_text(aes(label = paste0(round(percentage, 1), "%")), 
              position = position_stack(vjust = 0.5))
  
  print(pie_chart)
}

# Create an overall pie chart
overall_percentages <- df %>%
  group_by(answer) %>%
  summarise(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

overall_pie_chart <- ggplot(overall_percentages, aes(x = "", y = percentage, fill = answer)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Overall Responses",
       fill = "Answer",
       x = NULL,
       y = NULL) +
  theme_minimal() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5))

print(overall_pie_chart)

Python Example

#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt

# URL of the Google Sheets document
url = 'https://docs.google.com/spreadsheets/d/158HYQ_oNbyvbYQzfLN5H1LInOH-FDormHlVIQYLpY60/edit?usp=sharing'

# Convert to CSV export URL
csv_export_url = url.replace('/edit?usp=sharing', '/export?format=csv')

# Read the CSV data into a pandas DataFrame
df = pd.read_csv(csv_export_url)

# Function to create a pie chart
def create_pie_chart(ax, data, title):
    ax.pie(data.values, labels=data.index, autopct='%1.1f%%', startangle=90)
    ax.set_title(title)

# Create a 2x2 subplot for the pie charts
fig, axs = plt.subplots(2, 2, figsize=(20, 20))
fig.suptitle('Comparison of Answers by Party Affiliation', fontsize=16)

# Separate data by party affiliation and create pie charts
parties = df['party_affiliation'].unique()
for i, party in enumerate(parties):
    party_data = df[df['party_affiliation'] == party]['answer'].value_counts()
    create_pie_chart(axs[i//2, i%2], party_data, f'{party} Answers')

# Create overall answers pie chart
overall_data = df['answer'].value_counts()
create_pie_chart(axs[1, 1], overall_data, 'Overall Answers')

plt.tight_layout()
plt.savefig('comparison_pie_charts.png')
plt.close()

print('Comparison pie chart has been created and saved as comparison_pie_charts.png')

# Display summary of the data
summary = df.groupby(['party_affiliation', 'answer']).size().unstack(fill_value=0)
summary_percentages = summary.div(summary.sum(axis=1), axis=0) * 100
print('\
Summary of answers by party affiliation (percentages):')
print(summary_percentages)

4. Donut Chart

Donut charts are similar to pie charts, but they have a hole in the center of the circle, giving them their name.

For this visualization, we will use a dataset detailing the chemical composition (Aluminum, Iron, Magnesium, Calcium, and Sodium) found at four different archaeological sites in Great Britain (26 entries). The dataset can be accessed here.

R Example

#R CODE
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)

# Transform the data to long format for plotting
long_df <- df %>%
  pivot_longer(cols = c(Al, Fe, Mg, Ca, Na), names_to = "Element", values_to = "Value")

# Create a donut chart for each site
plot_list <- list()
sites <- unique(long_df$Site)

for (site in sites) {
  site_data <- long_df %>% 
    filter(Site == site) %>%
    group_by(Element) %>%
    summarize(Value = mean(Value)) %>%
    mutate(Percentage = Value / sum(Value) * 100)

  # Sort the data by percentage in descending order
  site_data <- site_data %>% arrange(desc(Percentage))
  
  # Create labels with element names and percentages
  site_data$label <- paste0(site_data$Element, "\
", round(site_data$Percentage, 1), "%")

  plot <- ggplot(site_data, aes(x = 2, y = Percentage, fill = Element)) +
    geom_bar(stat = "identity", width = 1) +
    geom_text(aes(label = label), position = position_stack(vjust = 0.5), size = 3) +
    coord_polar(theta = "y") +
    xlim(0.5, 2.5) +
    theme_void() +
    theme(legend.position = "none") +
    ggtitle(paste("Chemical Composition of Pottery from", site))
  
  plot_list[[site]] <- plot
}

# Display the plots
for (plot in plot_list) {
  print(plot)
}

Python Example

#PYTHON CODE 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Read the CSV file
df = pd.read_csv('Pottery.csv')

# Calculate the average chemical composition for each site
avg_composition = df.groupby('Site')[['Al', 'Fe', 'Mg', 'Ca', 'Na']].mean()

print("Average chemical composition for each site:")
print(avg_composition)

# Create a donut chart for each site
fig, axes = plt.subplots(2, 2, figsize=(20, 20))
axes = axes.flatten()

colors = ['#FF9999', '#66B2FF', '#99FF99', '#FFCC99', '#FF99CC']

for i, (site, data) in enumerate(avg_composition.iterrows()):
    ax = axes[i]
    wedges, texts, autotexts = ax.pie(data, autopct=lambda pct: f'{pct:.1f}%' if pct > 1 else '', 
                                      startangle=90, colors=colors, pctdistance=0.85)
    
    # Create the center circle for donut chart
    centre_circle = plt.Circle((0, 0), 0.70, fc='white')
    ax.add_artist(centre_circle)
    
    ax.set_title(f"Chemical Composition of {site}", fontsize=16)

    # Add small indications for small percentages
    for j, (wedge, autotext) in enumerate(zip(wedges, autotexts)):
        ang = (wedge.theta2 + wedge.theta1) / 2
        y = np.sin(np.deg2rad(ang))
        x = np.cos(np.deg2rad(ang))
        
        if data[j] / data.sum() < 0.01:  # If percentage is less than 1%
            horizontalalignment = {-1: "right", 1: "left"}[int(np.sign(x))]
            connectionstyle = f"angle,angleA=0,angleB={ang}"
            ax.annotate(f'{data[j]/data.sum()*100:.1f}% ({avg_composition.columns[j]})', 
                        xy=(x, y), xytext=(1.1*np.sign(x), 1.1*y),
                        horizontalalignment=horizontalalignment,
                        verticalalignment='center',
                        fontsize=8,
                        arrowprops=dict(arrowstyle="-", connectionstyle=connectionstyle, color='gray', lw=0.5))

# Add a legend
fig.legend(wedges, avg_composition.columns, title="Elements", loc="center right", bbox_to_anchor=(1.1, 0.5), fontsize=12)

plt.tight_layout()
plt.savefig('pottery_composition_donut_charts_with_small_indications.png', dpi=300, bbox_inches='tight')
print("Updated donut charts have been saved as 'pottery_composition_donut_charts_with_small_indications.png'")

# Display the image
from IPython.display import Image
display(Image(filename='pottery_composition_donut_charts_with_small_indications.png'))

5. Population Pyramid

Also known as age-sex pyramids, population pyramids are visualizations that display the gender distribution of a population.

For this visualization, we will use a dataset containing male and female birth rates in London from 1962 to 1710. For simplicity, we will only plot data for the first 20 years. The dataset can be accessed here.

R Example

#R CODE
library(ggplot2)
library(dplyr)
library(tidyr)

# Read the CSV file
df <- read.csv("data.csv")

# Filter the data to include only the first 20 years
first_20_years_df <- head(df, 20)

# Reshape the data for easier plotting with a legend
first_20_years_long <- first_20_years_df %>%
  select(Year, Males, Females) %>%
  pivot_longer(cols = c(Males, Females), names_to = "Gender", values_to = "Population")

# Make Females population negative for the pyramid effect
first_20_years_long$Population <- ifelse(first_20_years_long$Gender == "Females", 
                                         -first_20_years_long$Population, 
                                         first_20_years_long$Population)

# Plot the population pyramid
pyramid_plot <- ggplot(first_20_years_long, aes(x = Year, y = Population, fill = Gender)) +
  geom_bar(stat = "identity", position = "identity") +
  coord_flip() +
  scale_fill_manual(values = c("Females" = "pink", "Males" = "lightblue")) +
  labs(title = "Population Pyramid: Males vs Females (First 20 Years)",
       x = "Year",
       y = "Population",
       fill = "Gender") +
  theme_minimal() +
  scale_y_continuous(labels = function(x) format(abs(x), big.mark = ",")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom")

# Display the plot
print(pyramid_plot)

print("Focused population pyramid for the first 20 years created with legend.")

Python Example

#PYTHON CODE 
import matplotlib.pyplot as plt
import seaborn as sns

# Select the first 20 years of data
first_20_years_df = df.head(20).sort_values('Year')

# Create the plot
fig, ax = plt.subplots(figsize=(12, 8))

# Plot the male population
sns.barplot(x="Males", y="Year", data=first_20_years_df, color="blue", label="Males", orient="h", ax=ax)

# Plot the female population (negative values for the pyramid effect)
first_20_years_df["Females_neg"] = -first_20_years_df["Females"]
sns.barplot(x="Females_neg", y="Year", data=first_20_years_df, color="pink", label="Females", orient="h", ax=ax)

# Add labels and title
ax.set_xlabel("Population")
ax.set_ylabel("Year")
ax.set_title("Population Pyramid: Males vs Females (First 20 Years)")

# Customize x-axis labels
ticks = ax.get_xticks()
ax.set_xticklabels([f'{abs(int(x))}' for x in ticks])

# Add a legend
ax.legend()

# Show the plot
plt.tight_layout()
plt.show()

print("Focused population pyramid for the first 20 years created.")

This post is part of a multi-series compilation. You can find the other posts below:

Visualization: Correlation Charts

Visualization: Comparison Charts

Visualization: Geospatial and Other Charts

Visualization: Data Over Time (Temporal)

Visualization: Distribution Charts

Happy graphing!

1 Like