Skip to content

Data Visualizing

Data visualization refers to exploring data through visual representations. It is closely related to data analysis, which involves using code to explore patterns and correlations in datasets

This article covers two commonly used Python packages:

  • Matplotlib v3.7.4 is a mathematical plotting library commonly used for data processing and visual analysis
  • Plotly v5.18 package, which generates charts well-suited for display on digital devices.
Install matplotlib

You can install matplotlib directly with the command python -m pip install --user matplotlib

Alternatively, you can create a virtual environment and run pip install matplotlib within it

Or use other methods listed on the official site

matplotlib

Simple Line Chart

About matpltlib.pyplot

matplotlib.pyplot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

With the following code, we can use the pyplot function collection from matpltlib to quickly generate a simple line chart

python
import matplotlib.pyplot as plt

squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares)
plt.show()

The code plot.show() creates a new figure instance and displays the chart. If you need to save the chart, refer to pyplot.savefig and replace plot.show() with plt.savefig('suqares_plot.png', bbox_inches='tight').

WARNING

the variable fig in the code fig, ax = plt.subplots() can not be omitted because the function plt.subplots() returns a tuple containing a Figure and Axes object(s). tuple unpacking will make an assignment to fig and ax respectively with the Figure and Axes objects. Just like javascript, the args can not be omitted or the unpacked variables will mismatch.

If you only want the Axes object and don't care about the Figure, you could use _ to ignore the Figure

See Tuple Unpacking

python
_, ax = plt.subplots()

Set labels and Line Width

Below we customize the labels and line width

python
import matplotlib.pyplot as plt

squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares) 
ax.plot(squares, linewidth=3) 
ax.set_title("Square Numbers", fontsize=24) 
# It's best to use English for labels to avoid text display issues
ax.set_xlabel("Value", fontsize=14) 
ax.set_ylabel("Square of Value", fontsize=14) 

# Set size of tick labels.
ax.tick_params(axis='both', labelsize=14) 

plt.show()

Correcting the Figure

If you look closely at the current figure, you'll notice that the x-axis and y-axis data are actually misaligned. At x=0 the value is y=1, and at the last point x=4 the value is y=25

This is because according to ax.plot, the plot method accepts an optional x-axis parameter. When only one dataset is passed, the x-axis data defaults to range(len(y)). So the current chart's x-axis data is the default range(len(squares))

Therefore, we just need to specify the corresponding x-axis data

python
import matplotlib.pyplot as plt

input_values = [1, 2, 3, 4, 5] 
squares = [1, 4, 9, 16, 25]
fig, ax = plt.subplots()
ax.plot(squares, linewidth=3) 
ax.plot(input_values, squares, linewidth=3) 
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)

# Set size of tick labels.
ax.tick_params(axis="both", labelsize=14)

plt.show()

Set Plot Styles

You can use plt.style.available to see which styles are available on your system, then use plt.style.use('Solarize_Light2') to apply a theme

python
import matplotlib.pyplot as plt

print(plt.style.available)
"""
output: ['Solarize_Light2', '_classic_test_patch', '_mpl-gallery', '_mpl-gallery-nogrid', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-v0_8', 'seaborn-v0_8-bright', 'seaborn-v0_8-colorblind', 'seaborn-v0_8-dark',
'seaborn-v0_8-dark-palette', 'seaborn-v0_8-darkgrid', 'seaborn-v0_8-deep', 'seaborn-v0_8-muted', 'seaborn-v0_8-notebook', 'seaborn-v0_8-paper', 'seaborn-v0_8-pastel', 'seaborn-v0_8-poster', 'seaborn-v0_8-talk', 'seaborn-v0_8-ticks', 'seaborn-v0_8-white', 'seaborn-v0_8-whitegrid', 'tableau-colorblind10']
"""
Code Example
python
import matplotlib.pyplot as plt

input_values = [1, 2, 3, 4, 5]
squares = [1, 4, 9, 16, 25]

plt.style.use("Solarize_Light2")
fig, ax = plt.subplots()
ax.plot(input_values, squares, linewidth=3)
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)

# Set size of tick labels.
ax.tick_params(axis="both", labelsize=14)

plt.show()

Scatter

Scatter plots (Scatters) are useful when you need to configure individual points, such as when plotting large datasets.

To plot a single point, use the scatter() method. See matplotlib.pyplot.scatter for details

python
import matplotlib.pyplot as plt

plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(2, 4)

plt.show()

Set scatter style

Then set the chart styles

python
import matplotlib.pyplot as plt

plt.style.use("seaborn")
fig, ax = plt.subplots()
# Set scatter point size and color
ax.scatter(2, 4) 
ax.scatter(2, 4, s=200, c="red") 
# ax.scatter(1, 3, c="blue")

# Set chart title and label axes
ax.set_title("Square Numbers", fontsize=24) 
ax.set_xlabel("Value", fontsize=14) 
ax.set_ylabel("Square of Value", fontsize=14) 

# Set size of tick labels
ax.tick_params(axis='both', which='major', labelsize=14) 

plt.show()

Set multiple scatters

You can plot multiple points by passing a list with shape(n, ). You can also use auto-computation and list comprehension to quickly declare related data

python
import matplotlib.pyplot as plt

x_values = range(1, 1001) 
y_values = [v**2 for v in x_values] 

plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(2, 4, s=200, c="red") 
ax.scatter(x_values, y_values, s=10, c="red") 

# Set chart title and label axes
--snip--
# Set the range for each axis
ax.axis([0, 1100, 0, 1100000])

Set Color Map

Besides manually setting colors with c='red', pyplot has a built-in set of color maps. You can dynamically assign colors to scatter points by passing an array and specifying a color set. To use these color maps, you need to tell pyplot how to set the colors in the dataset.

python
import matplotlib.pyplot as plt

x_values = range(1, 1001)
y_values = [x**2 for x in x_values]

ax.scatter(x_values, y_values, s=10, c="red") 
ax.scatter(x_values, y_values, c=y_values, cmap=plt.cm.Reds, s=10) 

# Set chart title and label axes.
--snip--
Code Example
python
import matplotlib.pyplot as plt

x_values = range(1, 1001)
y_values = [v**2 for v in x_values]
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.scatter(x_values, y_values, s=10, c=y_values, cmap=plt.cm.Reds)


# 设置图表标题并给坐标轴加上标签
ax.set_title("Square Numbers", fontsize=24)
ax.set_xlabel("Value", fontsize=14)
ax.set_ylabel("Square of Value", fontsize=14)

# 设置刻度标记的大小
ax.tick_params(axis="both", which="major", labelsize=14)

# 设置每个坐标轴的取值范围
ax.axis([0, 1100, 0, 1100000])
# plt.show()
plt.savefig("test.png", bbox_inches="tight")

Random Walk

A random walk is a path determined by a series of random decisions — each step is completely random with no clear direction.

First, we create a random walk class. Set the starting point to (0, 0) and pass a target value num_points to simulate the number of steps

Then declare a utility function get_random_step to randomly generate the direction and distance for each step. Within the step limit, use a while loop to generate 5000 random walk steps

about random module

random.choice(seq) This module implements pseudo-random number generators for various distributions.

And random.choice(seq) return a random element from the non-empty sequence seq

pseudo-random means something that looks random, but is actually produced by a fixed and predictable process.

Update Styles

The styles mentioned in the Scatter chapter can also be applied to the random walk scatter plot. The following code removes the scatter point outlines and colors all points along the num_points dimension

python
ax.scatter(rw.x_values, rw.y_values, s=15) 
ax.scatter(rw.x_values, rw.y_values, s=15, c=range(rw.num_points), cmap=plt.cm.Blues, edgecolors="none") 

Meanwhile, we may want to apply special styles to the start and end points of the random walk. We can declare an emphasize function inside draw_plot to update the styles of specific points

python
def emphasize_point(idx, **scatter_args):
        final_args = dict(s=100, edgecolors="none", c="red")
        final_args.update(scatter_args)
        ax.scatter(
            rw.x_values[idx],
            rw.y_values[idx],
            **final_args,
        )

    emphasize_point(0, c="green")
    emphasize_point(-1, c="red", s=120)
Code Example
python
from random import choice
import matplotlib.pyplot as plt


def get_random_step(distance=[0, 1, 2, 3, 4]):
    """Determine the direction and distance for each step"""
    direction = choice([-1, 1])
    return direction * choice(distance)


class RandomWalk:
    """A class to generate random walks."""

    def __init__(self, num_points=5000):
        """Initialize attributes of a walk"""
        self.num_points = num_points

        # All walks start at (0,0)
        self.x_values = [0]
        self.y_values = [0]

    def fill_walk(self):
        """Calculate all the points in the walk"""

        while len(self.x_values) < self.num_points:
            # Decide which direction to go and how far to go in that direction
            x_step = get_random_step()
            y_step = get_random_step()

            # Reject moves that go nowhere
            if x_step == 0 and y_step == 0:
                continue
            # calculate the next x and y values
            x = self.x_values[-1] + x_step
            y = self.y_values[-1] + y_step

            self.x_values.append(x)
            self.y_values.append(y)


def draw_plot():
    # Create a random walk instance and fill it
    rw = RandomWalk()
    rw.fill_walk()
    # plot all the points in the walk
    plt.style.use("classic")
    fig, ax = plt.subplots()

    ax.scatter(
        rw.x_values,
        rw.y_values,
        s=15,
        c=range(rw.num_points),
        cmap=plt.cm.Blues,
        edgecolors="none",
    )

    def emphasize_point(idx, **scatter_args):
        """Emphasize a point in the plot"""
        final_args = dict(s=100, edgecolors="none", c="red")
        final_args.update(scatter_args)
        ax.scatter(
            rw.x_values[idx],
            rw.y_values[idx],
            **final_args,
        )

    emphasize_point(0, c="green")
    emphasize_point(-1, c="red", s=120)
    plt.show()


if __name__ == "__main__":
    draw_plot()

Plotly

Plotly can generate interactive charts, and the charts it produces automatically scale to fit the viewer's screen

According to the official documentation, plotly's interactivity is based on the Plotly JavaScript library. Therefore, we can use Plotly module features to export interactive html

python
from plotly import offline
offline.plot({"data": data, "layout": my_layout}, filename="d6.html")

This is also the recommended Figure Display approach in version 4. In version 5, you can use the renderers framework to display charts. The renderers framework is a generalized solution for plotly.offline.iplot and plotly.offline.plot

Code Example
python
from plotly.graph_objects import Bar, Layout
# from plotly.graph_objects import Figure

import plotly.offline as offline
from random import randint

class Die:
    """A class representing a single die"""

    def __init__(self, num_sides=6):
        """default to a six-sided die"""
        self.num_sides = num_sides

    def roll(self):
        """return a random value between 1 and number of sides"""
        return randint(1, self.num_sides)


# def draw_plot():
# Create a D6
die = Die()

# make some rolls and save the results in a list
results = []
for roll_num in range(1000):
    result = die.roll()
    results.append(result)

# analyze the results
frequencies = []
for value in range(1, die.num_sides + 1):
    frequencies.append(results.count(value))

x_values = list(range(1, die.num_sides + 1))
data = [Bar(x=x_values, y=frequencies)]

x_axis_config = {"title": "Result"}
y_axis_config = {"title": "Frequency of the result"}
my_layout = Layout(
    title="Roll a D6 1000 times", xaxis=x_axis_config, yaxis=y_axis_config
)
# Figure(data=data, layout=my_layout).show()
offline.plot({"data": data, "layout": my_layout}, filename="d6.html")

Mass Data Process

Generally, the data we use for visualization is not self-generated. It is collected from the web through scraping and other methods, then processed further.

By storing and processing visualization data, we can discover patterns and correlations that others have not found

CSV

CSV files store data by writing values separated by commas (comma-separated values) into a file

For example, in the file assets/data/sitka_weather_07-2018_simple.csv, the following text content

"STATION","NAME","DATE","PRCP","TAVG","TMAX","TMIN"
"USW00025333","SITKA AIRPORT, AK US","2018-07-01","0.25",,"62","50"

actually represents the data in the following table

STATIONNAMEDATEPRCPTAVGTMAXTMIN
USW00025333SITKA AIRPORT, AK US2018-07-010.256250

Analyze CSV Data

Python has a built-in CSV module for convenient data processing. Combined with file operations, you can

python
import csv
filename = 'assets/data/sitka_weather_07-2018_simple.csv'
with open(filename) as f:
  reader = csv.reader(f)
  # next is a built-in function that reads the next line of the file
  header_row = next(reader)
  print(header_row) # ['STATION', 'NAME', 'DATE', 'PRCP', 'TAVG', 'TMAX', 'TMIN']

You can also use the enumerate method to display the header and its index more clearly

python
--snip--
  # next is a built-in function that reads the next line of the file
  header_row = next(reader)
  print(header_row) 
  for idx, col_header in enumerate(header_row): 
    print(idx, col_header) 

Similarly, by iterating through each row of data, you can get a list of the highest temperatures from the file

python
--snip--
  # next is a built-in function that reads the next line of the file
  header_row = next(reader)
  for idx, col_header in enumerate(header_row): 
    print(idx, col_header) 
  highs = [] 
  for row in reader: 
    high = int(row[5]) 
    highs.append(high) 

Or use list comprehension highs = [int(row[5]) for row in reader] to quickly generate the data list

Draw temperature plot

Referring to drawing line charts and other content, you can set highs from the previous section as the y-axis to get a basic line chart

python
import matplotlib.pyplot as plt 
with open(filename) as f:
  --snip--

# Draw the chart
plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(highs, c="red")

# set chart title and label axes
ax.set_title("2018 Daily high temperatures for Sitka, AK", fontsize=24)
ax.set_xlabel("", fontsize=16)
ax.set_ylabel("Temperature(F)", fontsize=16)
ax.tick_params(axis="both", which="major", labelsize=16)

Then use the datetime module to process date data into corresponding values. strftime and strptime are a pair of corresponding APIs. The former formats dates, while the latter parses dates

Why not use the second column directly as x-axis data?

The reason for not directly using the date data from row[2] as the x-axis data is that we need to call the Figure object's autofmt_xdate method to format and auto-rotate the labels

Make the following changes to the code

python
import csv
import matplotlib.pyplot as plt
from datetime import datetime 

with open(filename) as f:
   --snip--

    # get temperature from the file
    highs = [] 
    dates, highs = [], [] 
    for row in reader:
        high = int(row[5])
        current_date = datetime.strptime(row[2], '%Y-%m-%d') 
        dates.append(current_date) 
        highs.append(high)

--snip--
ax.set_xlabel("", fontsize=16)
fig.autofmt_xdate() 
--snip--

Draw another series

In addition to plotting the highest temperatures, if the data is sufficient, we can also plot the lowest temperatures

python
--snip--
    header_row = next(reader)
    dates, highs = [], [] 
    dates, highs, lows = [], [], [] 
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        dates.append(current_date)
        high = int(row[5])
        highs.append(high)
        lows.append(int(row[6])) 

plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(dates, highs, c="red")
ax.plot(dates, lows, c="blue") 
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1) 

The plot.fill_between method can quickly fill the gap between data series

Error process

In theory, we can use the code from the previous section to generate any chart containing date, low, and high data. However, sometimes the data source may have issues, such as

  • Data corruption or missing values
  • Incorrect data format

If not handled properly, the program may crash. For example, if we replace the file with assets/data/death_valley_2018_simple.csv and modify the corresponding indices for high and low temperatures, running the current program will raise a ValueError because int(row[4]) encounters an empty value in some row, causing the program to fail

The solution is simple — just use except to catch the error

python
try :
    high = int(row[5])
    low = int(row[6])
except ValueError:
    print(f"Missing data for {current_date}")
else:
    highs.append(high)
    lows.append(low)
    dates.append(current_date)
Code Example
python
# sitka_highs.py
import csv
import matplotlib.pyplot as plt
from datetime import datetime

filename = "assets/data/sitka_weather_2018_simple.csv"
with open(filename) as f:
    reader = csv.reader(f)
    header_row = next(reader)

    # get high temperature from the file
    dates, highs, lows = [], [], []
    for row in reader:
        current_date = datetime.strptime(row[2], '%Y-%m-%d')
        try :

            high = int(row[5])
            low = int(row[6])
        except ValueError:
            print(f"Missing data for {current_date}")
        else:
            highs.append(high)
            lows.append(low)
            dates.append(current_date)

plt.style.use("seaborn")
fig, ax = plt.subplots()
ax.plot(dates, highs, c="red")
ax.plot(dates, lows, c="blue")

# set chart title and label axes
ax.set_title("2018 Daily temperatures for Sitka, AK", fontsize=24)
ax.set_xlabel("", fontsize=16)
ax.fill_between(dates, highs, lows, facecolor='blue', alpha=0.1)
# format x axis according to date
fig.autofmt_xdate()
ax.set_ylabel("Temperature(F)", fontsize=16)
ax.tick_params(axis="both", which="major", labelsize=16)

plt.show()

JSON

Similar to CSV, Python also has a built-in JSON module for convenient data processing

Code Example
python
import json

# Explore the structure of the data
filename = "assets/data/json/eq_data_1_day_m1.json"

with open(filename) as f:
    all_eq_data = json.load(f)

readable_file = "assets/data/json/readable_eq_data.json"
with open(readable_file, "w") as f:
    json.dump(all_eq_data, f, indent=4)

Process Json Response

A Web API is part of a website that uses specific URLs to request particular information. We can use Web APIs to request external data sources and visualize the latest data. The response is usually returned in easy-to-process formats like JSON or CSV

For example, https://api.github.com/search/repositories?q=language:python&sort=stars. This endpoint returns information about Python projects hosted on Github, sorted by stars

To make requests to the endpoint, we use the external module requests

python
import requests

# Make an API call and store the response
url = "https://api.github.com/search/repositories?q=language:python&sort=stars"

headers = {"Accept": "application/vnd.github.v3+json"}
r = requests.get(url, headers=headers)
print(f"Status code: {r.status_code}") # Status code: 200

# Store API response in a variable
response_dict = r.json()

# Process results
print(response_dict.keys()) # dict_keys(['total_count', 'incomplete_results', 'items'])

For the Plotly charting and data processing workflow, see the Source Code