Introduction
Choosing the correct visualization is an important aspect of data presentation. However, many users often struggle to identify the most effective visualization for their data. Each type of visualization serves a different purpose, and selecting the appropriate one requires an understanding of the data, the audience, and the overall message you wish to convey.
This article aims to make the process of data visualization easier to understand. It will highlight the different types of graphs and their typical use cases. Additionally, it will provide you with the dataset used for each visualization, along with the Python and R code involved in creating the graph. You can see the full article here.
Acknowledgements
Each dataset used in this document (unless otherwise stated) can be found on vincentarelbundock.github.io, which is a large repository for datasets that can be used in R. I would like to thank the people responsible for making this information open access and accessible. The link to the google sheet will be provided throughout the document.
How the Guide is Formatted
The guide will be formatted where it will list a general group (i.e., comparison charts, correlation, etc.) followed by a list of visualizations that fall under that group. For example, bar/column charts are known as a type of comparison chart. Then, after a short introduction on the chart, a visualization will follow. Below the figure, the R and Python code will be displayed that was used to generate the graph. The code that is related to the visualization is listed directly underneath the figure. For all visualizations, make sure that you upload the file when you start the chat, as some of the code does not reflect that initial step.
Data Over Time (Temporal)
Temporal charts are used to display data over time, revealing trends, patterns, and changes.
1. Area Chart
Area charts are a type of data visualizations used to represent quantitative data and illustrate how values change over a period of time.
We will be using the London dataset (82 rows; 7 variables), to visualize the mortality rate and plague deaths over time. The dataset can be accessed here.
R Example
#R CODE
# Load required libraries
library(ggplot2)
library(scales)
# Create the line area chart with legend
p <- ggplot(df, aes(x = Year)) +
geom_area(aes(y = Plague, fill = "Plague"), alpha = 0.3) +
geom_area(aes(y = Mortality, fill = "Mortality"), alpha = 0.3) +
geom_line(aes(y = Plague, color = "Plague"), linewidth = 1) +
geom_line(aes(y = Mortality, color = "Mortality"), linewidth = 1) +
scale_fill_manual(values = c("Plague" = "blue", "Mortality" = "red"), name = "Category") +
scale_color_manual(values = c("Plague" = "blue", "Mortality" = "red"), name = "Category") +
scale_x_continuous(breaks = seq(min(df$Year), max(df$Year), by = 10)) +
scale_y_continuous(labels = comma) +
labs(title = 'Plague and Mortality Rates in London (1629-1710)',
x = 'Year',
y = 'Number of Cases') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "right",
plot.title = element_text(hjust = 0.5, face = "bold"))
# Display the plot
print(p)
# Save the plot as a PNG file
ggsave("plague_mortality_chart.png", plot = p, width = 12, height = 8, dpi = 300)
print("Chart saved as 'plague_mortality_chart.png'")
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
# Read the CSV file
df = pd.read_csv('dataset.csv')
# Create the line area chart
plt.figure(figsize=(12, 6))
plt.fill_between(df['Year'], df['Plague'], alpha=0.3, label='Plague')
plt.fill_between(df['Year'], df['Mortality'], alpha=0.3, label='Mortality')
plt.plot(df['Year'], df['Plague'], linewidth=2)
plt.plot(df['Year'], df['Mortality'], linewidth=2)
plt.title('Plague and Mortality Rates Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Cases')
plt.legend()
plt.grid(True, alpha=0.3)
# Rotate x-axis labels for better readability
plt.xticks(rotation=45)
# Adjust layout and save the plot
plt.tight_layout()
plt.savefig('plague_mortality_chart.png')
plt.close()
print("Chart has been created and saved as 'plague_mortality_chart.png'")
# Display some statistics about the Plague and Mortality columns
print("\
Plague statistics:")
print(df['Plague'].describe())
print("\
Mortality statistics:")
print(df['Mortality'].describe())
# Calculate the correlation between Plague and Mortality
correlation = df['Plague'].corr(df['Mortality'])
print(f"\
Correlation between Plague and Mortality: {correlation:.4f}")
2. Line chart
Line charts are among the most commonly used types of charts worldwide. They are effective at showing overall trends or progress over time.
For this visualization, we will use a dataset called ‘trump_tweet’, which tracks the number of tweets by Mr. Trump from 2009 to 2017. The full dataset can be accessed here (20,761 rows; 8 variables), while the condensed dataset used for this visualization is available here (9 rows; one variable).
R Example
#R CODE
library(ggplot2)
library(scales)
# Read the CSV file
df <- read.csv('tweets_per_year.csv')
# Create the line chart with all years on x-axis
p <- ggplot(df, aes(x = year, y = count)) +
geom_line(color = 'blue', linewidth = 1) +
geom_point(color = 'red', size = 3) +
theme_minimal() +
labs(title = 'Number of Tweets per Year',
x = 'Year',
y = 'Number of Tweets') +
scale_x_continuous(breaks = seq(min(df$year), max(df$year), by = 1)) +
scale_y_continuous(labels = comma) +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = 'bold'),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
axis.text.x = element_text(angle = 45, hjust = 1))
# Display the plot
print(p)
# Save the plot as a PNG file
ggsave('tweets_per_year_chart_all_years.png', plot = p, width = 12, height = 7, dpi = 300)
# Print a message about the saved file
cat('The updated chart has been saved as tweets_per_year_chart_all_years.png')
Python Example
#PYTHON CODE
import pandas as pd
import matplotlib.pyplot as plt
# Read the CSV file
df = pd.read_csv('tweets_per_year.csv')
# Create the line chart
plt.figure(figsize=(12, 7))
plt.plot(df['year'], df['count'], marker='o', color='blue', linestyle='-', linewidth=2, markersize=6)
plt.title('Number of Tweets per Year', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Tweets', fontsize=12)
plt.xticks(df['year'], rotation=45)
plt.grid(True)
plt.tight_layout()
# Save the plot as a PNG file
plt.savefig('tweets_per_year_chart_all_years_python.png', dpi=300)
# Display the plot
plt.show()
print('The updated chart has been saved as tweets_per_year_chart_all_years_python.png')
3. Candlestick chart
A candlestick chart is a financial visualization used to analyze price movements of an asset, derivative, or currency. It is commonly used in technical analysis to predict market trends. The chart displays the high, low, opening, and closing prices of a product within a specific time frame.
For this chart, we will use the S&P 500 stock market dataset. The dataset can be accessed here.
R Example
library(quantmod)
library(readxl)
library(ggplot2)
library(plotly)
# Read the Excel file
df <- read_excel("sp500_1974_March.csv.xlsx")
# Convert the data to xts (extensible time series) format
df$Date <- as.POSIXct(df$Date)
xts_data <- xts(df[, c("Open", "High", "Low", "Close", "Volume")], order.by = df$Date)
# Create the candlestick chart using plotly
p <- plot_ly(data = df, x = ~Date, type="candlestick",
open = ~Open, close = ~Close,
high = ~High, low = ~Low) %>%
layout(title = "S&P 500 Candlestick Chart - March 1974",
xaxis = list(title = "Date"),
yaxis = list(title = "Price"))
# Add volume as a bar chart
p <- p %>% add_bars(x = ~Date, y = ~Volume, name = "Volume",
yaxis = "y2", marker = list(color = "rgba(128,128,128,0.5)"))
# Update layout to include secondary y-axis for volume
p <- p %>% layout(yaxis2 = list(title = "Volume", overlaying = "y", side = "right"))
# Save the plot as an HTML file
htmlwidgets::saveWidget(p, "sp500_march_1974_candlestick_r.html")
print("Candlestick chart has been created and saved as 'sp500_march_1974_candlestick_r.html'")
# Display the first few rows of the data
print(head(df))
Python Example
#PYTHON CODE
import pandas as pd
import mplfinance as mpf
import matplotlib.pyplot as plt
# Read the Excel file
df = pd.read_excel('sp500_1974_March.csv.xlsx')
# Ensure the Date column is set as the index
df.set_index('Date', inplace=True)
# Create the candlestick chart with volume
fig, axes = mpf.plot(df, type='candle', volume=True, figsize=(12, 8),
title='S&P 500 Candlestick Chart - March 1974',
ylabel='Price', ylabel_lower='Volume',
style='yahoo', returnfig=True)
# Adjust the layout and save the figure
plt.tight_layout()
plt.savefig('sp500_march_1974_candlestick.png')
print("Candlestick chart has been created and saved as 'sp500_march_1974_candlestick.png'")
# Display the plot
plt.show()
4. Stream graph
A stream graph displays changes in the magnitude of categorical data over time. It is a variation of the stacked area bar graph, where the baseline is not anchored to a singular point but rather moves up or down, allowing the to inherit a natural flow.
For this visualization, we will use a dataset that measures air pollutants in Leeds (UK) from 1994 to 1998 (Heffernan and Tawn, 2004). The winter dataset includes measurements between November to February (532 rows with 5 variables). The dataset can be accessed here.
R Example
# Load necessary libraries
library(tidyverse)
library(googlesheets4)
# Deauthorize to access public Google Sheets
gs4_deauth()
# Read the dataset from Google Sheets
sheet_url <- "https://docs.google.com/spreadsheets/d/14dbvT_Qj60jVA9-i-AOSAAEh0uEWIZvMKpUGoMkjuAw/edit?usp=sharing"
df <- read_sheet(sheet_url)
# Reshape the data for plotting
library(tidyr)
df_long <- df %>%
pivot_longer(cols = c(O3, NO2, NO, SO2, PM10), names_to = "Pollutant", values_to = "Level")
# Plot the stream graph
p <- ggplot(df_long, aes(x = time, y = Level, fill = Pollutant)) +
geom_area(position = "stack") +
labs(title = "Stream Graph of Pollutants Over Time", x = "Time", y = "Pollutant Levels") +
theme_minimal()
# Display the plot
print(p)
Python Example
#PYTHON CODE
# Import necessary libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style for the plot
sns.set(style="whitegrid")
# Create a stream graph for the pollutants
plt.figure(figsize=(14, 8))
# Plotting the stream graph
plt.stackplot(df['rownames'], df['O3'], df['NO2'], df['NO'], df['SO2'], df['PM10'],
labels=['O3', 'NO2', 'NO', 'SO2', 'PM10'], alpha=0.8)
# Adding labels and title
plt.legend(loc='upper left')
plt.title('Stream Graph of Pollutants Over Time')
plt.xlabel('Time (rownames)')
plt.ylabel('Pollutant Levels')
# Show the plot
plt.show()
print("Stream graph created.")
5. Gantt chart
A Gantt chart is a visual tool used in project management to plan and track the progress of tasks. It displays individual tasks or activities along a timeline, highlighting their scheduled start and end dates.
For this visualization, we will use a dataset showing task allocation between start and end dates of my Master’s program. The dataset can be accessed here.
R Example
#R CODE
# Load necessary libraries
library(ggplot2)
library(dplyr)
# Convert UNIX timestamps to Date format
df <- df %>%
mutate(`Start Date` = as.POSIXct(`Start Date`, origin = '1970-01-01'),
`End Date` = as.POSIXct(`End Date`, origin = '1970-01-01'))
# Convert Task to a factor to preserve original order
df$Task <- factor(df$Task, levels = rev(unique(df$Task)))
# Create the Gantt chart
gantt_chart <- ggplot(df, aes(x = `Start Date`, xend = `End Date`, y = Task, yend = Task)) +
geom_segment(linewidth = 5, color = 'steelblue') +
theme_minimal() +
labs(title = 'Gantt Chart', x = 'Date', y = 'Task') +
theme(axis.text.y = element_text(size = 8),
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
plot.title = element_text(hjust = 0.5, face = "bold"),
panel.grid.major.y = element_line(color = "gray90"),
panel.grid.minor.y = element_blank()) +
scale_x_datetime(date_breaks = "1 month", date_labels = "%b %Y")
# Display the Gantt chart
print(gantt_chart)
Python Example
import pandas as pd
import plotly.figure_factory as ff
import plotly.io as pio
import base64
# Load the Excel file
df = pd.read_excel('gantt.xlsx')
# Ensure the dates are in datetime format
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
# Create a list of dictionaries for the Gantt chart
gantt_data = []
for index, row in df.iterrows():
gantt_data.append(dict(Task=row['Task'],
Start=row['Start Date'],
Finish=row['End Date'],
Resource=f"Duration: {row['Duration']} days"))
# Create the Gantt chart
fig = ff.create_gantt(gantt_data, index_col='Resource', show_colorbar=True, group_tasks=True)
# Update the layout
fig.update_layout(
title='Task Allocation Gantt Chart',
xaxis_title='Date',
yaxis_title='Task',
height=800,
width=1000,
yaxis={'categoryorder':'array', 'categoryarray': df['Task'].tolist()}
)
# Save the plot as an HTML file
html_file = "gantt_chart.html"
pio.write_html(fig, file=html_file, auto_open=False)
# Read the HTML file and encode it
with open(html_file, "rb") as file:
html_content = file.read()
encoded_html = base64.b64encode(html_content).decode("utf-8")
# Create a data URL
data_url = f"data:text/html;base64,{encoded_html}"
# Display the data URL
print(f"Gantt chart saved as: {html_file}")
print(f"Data URL: {data_url}")
print("Done")
This post is part of a multi-series compilation. You can find the other posts below:
Visualization: Distribution Charts
Visualization: Part-to-Whole Charts
Visualization: Correlation Charts
Visualization: Comparison Charts
Visualization: Geospatial and Other Charts
Happy graphing!