Data Visualizing
Data visualization refers to exploring data through visual representations. It is closely related to data analysis, which involves using code to explore patterns and correlations in datasets
This article covers two commonly used Python packages:
- Matplotlib v3.7.4 is a mathematical plotting library commonly used for data processing and visual analysis
- Plotly v5.18 package, which generates charts well-suited for display on digital devices.
Install matplotlib
You can install matplotlib directly with the command python -m pip install --user matplotlib
Alternatively, you can create a virtual environment and run pip install matplotlib within it
Or use other methods listed on the official site
matplotlib
Simple Line Chart
About matpltlib.pyplot
matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.
With the following code, we can use the pyplot function collection from matpltlib to quickly generate a simple line chart
import matplotlib.pyplot as plt
squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares)
plt.show()The code plot.show() creates a new figure instance and displays the chart. If you need to save the chart, refer to pyplot.savefig and replace plot.show() with plt.savefig('suqares_plot.png', bbox_inches='tight').
WARNING
the variable fig in the code fig, ax = plt.subplots() can not be omitted because the function plt.subplots() returns a tuple containing a Figure and Axes object(s). tuple unpacking will make an assignment to fig and ax respectively with the Figure and Axes objects. Just like javascript, the args can not be omitted or the unpacked variables will mismatch.
If you only want the Axes object and don't care about the Figure, you could use _ to ignore the Figure
See Tuple Unpacking
_, ax = plt.subplots()Set labels and Line Width
Below we customize the labels and line width
import matplotlib.pyplot as plt
squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares)
ax.plot(squares, linewidth=3)
ax.set_title("Square Numbers", fontsize=24)
# It's best to use English for labels to avoid text display issues
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)
# Set size of tick labels.
ax.tick_params(axis='both', labelsize=14)
plt.show()Correcting the Figure
If you look closely at the current figure, you'll notice that the x-axis and y-axis data are actually misaligned. At x=0 the value is y=1, and at the last point x=4 the value is y=25
This is because according to ax.plot, the plot method accepts an optional x-axis parameter. When only one dataset is passed, the x-axis data defaults to range(len(y)). So the current chart's x-axis data is the default range(len(squares))
Therefore, we just need to specify the corresponding x-axis data
import matplotlib.pyplot as plt
input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares, linewidth=3)
ax.plot(input_values, squares, linewidth=3)
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)
# Set size of tick labels.
ax.tick_params(axis="both", labelsize=14)
plt.show()Set Plot Styles
You can use plt.style.available to see which styles are available on your system, then use plt.style.use('Solarize_Light2') to apply a theme
import matplotlib.pyplot as plt
print(plt.style.available)
"""
output: ['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark',
'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
"""Code Example
import matplotlib.pyplot as plt
input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]
plt.style.use("Solarize_Light2")
fig, ax = plt.subplots()
ax.plot(input_values, squares, linewidth=3)
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)
# Set size of tick labels.
ax.tick_params(axis="both", labelsize=14)
plt.show()Scatter
Scatter plots (Scatters) are useful when you need to configure individual points, such as when plotting large datasets.
To plot a single point, use the scatter() method. See matplotlib.pyplot.scatter for details
import matplotlib.pyplot as plt
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(2, 4)
plt.show()Set scatter style
Then set the chart styles
import matplotlib.pyplot as plt
plt.style.use("seaborn")
fig, ax = plt.subplots()
# Set scatter point size and color
ax.scatter(2, 4)
ax.scatter(2, 4, s=200, c="red")
# ax.scatter(1, 3, c="blue")
# Set chart title and label axes
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)
# Set size of tick labels
ax.tick_params(axis='both', which='major', labelsize=14)
plt.show()Set multiple scatters
You can plot multiple points by passing a list with shape(n, ). You can also use auto-computation and list comprehension to quickly declare related data
import matplotlib.pyplot as plt
x_values = range(1, 1001)
y_values = [v**2 for v in x_values]
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(2, 4, s=200, c="red")
ax.scatter(x_values, y_values, s=10, c="red")
# Set chart title and label axes
--snip--
# Set the range for each axis
ax.axis([0, 1100, 0, 1100000])Set Color Map
Besides manually setting colors with c='red', pyplot has a built-in set of color maps. You can dynamically assign colors to scatter points by passing an array and specifying a color set. To use these color maps, you need to tell pyplot how to set the colors in the dataset.
import matplotlib.pyplot as plt
x_values = range(1, 1001)
y_values = [x**2 for x in x_values]
ax.scatter(x_values, y_values, s=10, c="red")
ax.scatter(x_values, y_values, c=y_values, cmap=plt.cm.Reds, s=10)
# Set chart title and label axes.
--snip--Code Example
import matplotlib.pyplot as plt
x_values = range(1, 1001)
y_values = [v**2 for v in x_values]
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(x_values, y_values, s=10, c=y_values, cmap=plt.cm.Reds)
# 设置图表标题并给坐标轴加上标签
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)
# 设置刻度标记的大小
ax.tick_params(axis="both", which="major", labelsize=14)
# 设置每个坐标轴的取值范围
ax.axis([0, 1100, 0, 1100000])
# plt.show()
plt.savefig("test.png", bbox_inches="tight")Random Walk
A random walk is a path determined by a series of random decisions — each step is completely random with no clear direction.
First, we create a random walk class. Set the starting point to (0, 0) and pass a target value num_points to simulate the number of steps
Then declare a utility function get_random_step to randomly generate the direction and distance for each step. Within the step limit, use a while loop to generate 5000 random walk steps
about random module
random.choice(seq) This module implements pseudo-random number generators for various distributions.
And random.choice(seq) return a random element from the non-empty sequence seq
pseudo-random means something that looks random, but is actually produced by a fixed and predictable process.
Update Styles
The styles mentioned in the Scatter chapter can also be applied to the random walk scatter plot. The following code removes the scatter point outlines and colors all points along the num_points dimension
ax.scatter(rw.x_values, rw.y_values, s=15)
ax.scatter(rw.x_values, rw.y_values, s=15, c=range(rw.num_points), cmap=plt.cm.Blues, edgecolors="none") Meanwhile, we may want to apply special styles to the start and end points of the random walk. We can declare an emphasize function inside draw_plot to update the styles of specific points
def emphasize_point(idx, **scatter_args):
final_args = dict(s=100, edgecolors="none", c="red")
final_args.update(scatter_args)
ax.scatter(
rw.x_values[idx],
rw.y_values[idx],
**final_args,
)
emphasize_point(0, c="green")
emphasize_point(-1, c="red", s=120)Code Example
from random import choice
import matplotlib.pyplot as plt
def get_random_step(distance=[0, 1, 2, 3, 4]):
"""Determine the direction and distance for each step"""
direction = choice([-1, 1])
return direction * choice(distance)
class RandomWalk:
"""A class to generate random walks."""
def __init__(self, num_points=5000):
"""Initialize attributes of a walk"""
self.num_points = num_points
# All walks start at (0,0)
self.x_values = [0]
self.y_values = [0]
def fill_walk(self):
"""Calculate all the points in the walk"""
while len(self.x_values) < self.num_points:
# Decide which direction to go and how far to go in that direction
x_step = get_random_step()
y_step = get_random_step()
# Reject moves that go nowhere
if x_step == 0 and y_step == 0:
continue
# calculate the next x and y values
x = self.x_values[-1] + x_step
y = self.y_values[-1] + y_step
self.x_values.append(x)
self.y_values.append(y)
def draw_plot():
# Create a random walk instance and fill it
rw = RandomWalk()
rw.fill_walk()
# plot all the points in the walk
plt.style.use("classic")
fig, ax = plt.subplots()
ax.scatter(
rw.x_values,
rw.y_values,
s=15,
c=range(rw.num_points),
cmap=plt.cm.Blues,
edgecolors="none",
)
def emphasize_point(idx, **scatter_args):
"""Emphasize a point in the plot"""
final_args = dict(s=100, edgecolors="none", c="red")
final_args.update(scatter_args)
ax.scatter(
rw.x_values[idx],
rw.y_values[idx],
**final_args,
)
emphasize_point(0, c="green")
emphasize_point(-1, c="red", s=120)
plt.show()
if __name__ == "__main__":
draw_plot()Plotly
Plotlycan generate interactive charts, and the charts it produces automatically scale to fit the viewer's screen
According to the official documentation, plotly's interactivity is based on the Plotly JavaScript library. Therefore, we can use Plotly module features to export interactive html
from plotly import offline
offline.plot({"data": data, "layout": my_layout}, filename="d6.html")This is also the recommended Figure Display approach in version 4. In version 5, you can use the renderers framework to display charts. The renderers framework is a generalized solution for plotly.offline.iplot and plotly.offline.plot
Code Example
from plotly.graph_objects import Bar, Layout
# from plotly.graph_objects import Figure
import plotly.offline as offline
from random import randint
class Die:
"""A class representing a single die"""
def __init__(self, num_sides=6):
"""default to a six-sided die"""
self.num_sides = num_sides
def roll(self):
"""return a random value between 1 and number of sides"""
return randint(1, self.num_sides)
# def draw_plot():
# Create a D6
die = Die()
# make some rolls and save the results in a list
results = []
for roll_num in range(1000):
result = die.roll()
results.append(result)
# analyze the results
frequencies = []
for value in range(1, die.num_sides + 1):
frequencies.append(results.count(value))
x_values = list(range(1, die.num_sides + 1))
data = [Bar(x=x_values, y=frequencies)]
x_axis_config = {"title": "Result"}
y_axis_config = {"title": "Frequency of the result"}
my_layout = Layout(
title="Roll a D6 1000 times", xaxis=x_axis_config, yaxis=y_axis_config
)
# Figure(data=data, layout=my_layout).show()
offline.plot({"data": data, "layout": my_layout}, filename="d6.html")Mass Data Process
Generally, the data we use for visualization is not self-generated. It is collected from the web through scraping and other methods, then processed further.
By storing and processing visualization data, we can discover patterns and correlations that others have not found
CSV
CSV files store data by writing values separated by commas (comma-separated values) into a file
For example, in the file assets/data/sitka_weather_07-2018_simple.csv, the following text content
"STATION","NAME","DATE","PRCP","TAVG","TMAX","TMIN"
"USW00025333","SITKA AIRPORT, AK US","2018-07-01","0.25",,"62","50"actually represents the data in the following table
| STATION | NAME | DATE | PRCP | TAVG | TMAX | TMIN |
|---|---|---|---|---|---|---|
| USW00025333 | SITKA AIRPORT, AK US | 2018-07-01 | 0.25 | 62 | 50 |
Analyze CSV Data
Python has a built-in CSV module for convenient data processing. Combined with file operations, you can
import csv
filename = 'assets/data/sitka_weather_07-2018_simple.csv'
with open(filename) as f:
reader = csv.reader(f)
# next is a built-in function that reads the next line of the file
header_row = next(reader)
print(header_row) # ['STATION', 'NAME', 'DATE', 'PRCP', 'TAVG', 'TMAX', 'TMIN']You can also use the enumerate method to display the header and its index more clearly
--snip--
# next is a built-in function that reads the next line of the file
header_row = next(reader)
print(header_row)
for idx, col_header in enumerate(header_row):
print(idx, col_header) Similarly, by iterating through each row of data, you can get a list of the highest temperatures from the file
--snip--
# next is a built-in function that reads the next line of the file
header_row = next(reader)
for idx, col_header in enumerate(header_row):
print(idx, col_header)
highs = []
for row in reader:
high = int(row[5])
highs.append(high) Or use list comprehension highs = [int(row[5]) for row in reader] to quickly generate the data list
Draw temperature plot
Referring to drawing line charts and other content, you can set highs from the previous section as the y-axis to get a basic line chart
import matplotlib.pyplot as plt
with open(filename) as f:
--snip--
# Draw the chart
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(highs, c="red")
# set chart title and label axes
ax.set_title("2018 Daily high temperatures for Sitka, AK", fontsize=24)
ax.set_xlabel("", fontsize=16)
ax.set_ylabel("Temperature(F)", fontsize=16)
ax.tick_params(axis="both", which="major", labelsize=16)Then use the datetime module to process date data into corresponding values. strftime and strptime are a pair of corresponding APIs. The former formats dates, while the latter parses dates
Why not use the second column directly as x-axis data?
The reason for not directly using the date data from row[2] as the x-axis data is that we need to call the Figure object's autofmt_xdate method to format and auto-rotate the labels
Make the following changes to the code
import csv
import matplotlib.pyplot as plt
from datetime import datetime
with open(filename) as f:
--snip--
# get temperature from the file
highs = []
dates, highs = [], []
for row in reader:
high = int(row[5])
current_date = datetime.strptime(row[2], '%Y-%m-%d')
dates.append(current_date)
highs.append(high)
--snip--
ax.set_xlabel("", fontsize=16)
fig.autofmt_xdate()
--snip--Draw another series
In addition to plotting the highest temperatures, if the data is sufficient, we can also plot the lowest temperatures
--snip--
header_row = next(reader)
dates, highs = [], []
dates, highs, lows = [], [], []
for row in reader:
current_date = datetime.strptime(row[2], '%Y-%m-%d')
dates.append(current_date)
high = int(row[5])
highs.append(high)
lows.append(int(row[6]))
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(dates, highs, c="red")
ax.plot(dates, lows, c="blue")
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1) The plot.fill_between method can quickly fill the gap between data series
Error process
In theory, we can use the code from the previous section to generate any chart containing date, low, and high data. However, sometimes the data source may have issues, such as
- Data corruption or missing values
- Incorrect data format
If not handled properly, the program may crash. For example, if we replace the file with assets/data/death_valley_2018_simple.csv and modify the corresponding indices for high and low temperatures, running the current program will raise a ValueError because int(row[4]) encounters an empty value in some row, causing the program to fail
The solution is simple — just use except to catch the error
try :
high = int(row[5])
low = int(row[6])
except ValueError:
print(f"Missing data for {current_date}")
else:
highs.append(high)
lows.append(low)
dates.append(current_date)Code Example
# sitka_highs.py
import csv
import matplotlib.pyplot as plt
from datetime import datetime
filename = "assets/data/sitka_weather_2018_simple.csv"
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
# get high temperature from the file
dates, highs, lows = [], [], []
for row in reader:
current_date = datetime.strptime(row[2], '%Y-%m-%d')
try :
high = int(row[5])
low = int(row[6])
except ValueError:
print(f"Missing data for {current_date}")
else:
highs.append(high)
lows.append(low)
dates.append(current_date)
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(dates, highs, c="red")
ax.plot(dates, lows, c="blue")
# set chart title and label axes
ax.set_title("2018 Daily temperatures for Sitka, AK", fontsize=24)
ax.set_xlabel("", fontsize=16)
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)
# format x axis according to date
fig.autofmt_xdate()
ax.set_ylabel("Temperature(F)", fontsize=16)
ax.tick_params(axis="both", which="major", labelsize=16)
plt.show()JSON
Similar to CSV, Python also has a built-in JSON module for convenient data processing
Code Example
import json
# Explore the structure of the data
filename = "assets/data/json/eq_data_1_day_m1.json"
with open(filename) as f:
all_eq_data = json.load(f)
readable_file = "assets/data/json/readable_eq_data.json"
with open(readable_file, "w") as f:
json.dump(all_eq_data, f, indent=4)Process Json Response
A Web API is part of a website that uses specific URLs to request particular information. We can use Web APIs to request external data sources and visualize the latest data. The response is usually returned in easy-to-process formats like JSON or CSV
For example, https://api.github.com/search/repositories?q=language:python&sort=stars. This endpoint returns information about Python projects hosted on Github, sorted by stars
To make requests to the endpoint, we use the external module requests
import requests
# Make an API call and store the response
url = "https://api.github.com/search/repositories?q=language:python&sort=stars"
headers = {"Accept": "application/vnd.github.v3+json"}
r = requests.get(url, headers=headers)
print(f"Status code: {r.status_code}") # Status code: 200
# Store API response in a variable
response_dict = r.json()
# Process results
print(response_dict.keys()) # dict_keys(['total_count', 'incomplete_results', 'items'])For the Plotly charting and data processing workflow, see the Source Code