Distances#

In this notebook, we will use aeon for time series distance computation

Preliminaries#

[9]:

import matplotlib.pyplot as plt
import numpy as np

from aeon.datasets import load_macroeconomic
from aeon.distances import distance

Distances#

The goal of a distance computation is to measure the similarity between the time series ‘x’ and ‘y’. A distance function should take x and y as parameters and return a float that is the computed distance between x and y. The value returned should be 0.0 when the time series are the exact same, and a value greater than 0.0 that is a measure of distance between them, when they are different.

Take the following two time series:

[10]:

X = load_macroeconomic()
country_d, country_c, country_b, country_a = np.split(X["realgdp"].to_numpy()[3:], 4)

plt.plot(country_a, label="County D")
plt.plot(country_b, label="Country C")
plt.plot(country_c, label="Country B")
plt.plot(country_d, label="Country A")
plt.xlabel("Quarters from 1959")
plt.ylabel("Gdp")
plt.legend()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File ~/projects/aeon/aeon/utils/validation/_dependencies.py:118, in _check_soft_dependencies(package_import_alias, severity, obj, suppress_import_stdout, *packages)
    117     else:
--> 118         pkg_ref = import_module(package_import_name)
    119 # if package cannot be imported, make the user aware of installation requirement

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'statsmodels'

The above exception was the direct cause of the following exception:

ModuleNotFoundError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 X = load_macroeconomic()
      2 country_d, country_c, country_b, country_a = np.split(X["realgdp"].to_numpy()[3:], 4)
      4 plt.plot(country_a, label="County D")

File ~/projects/aeon/aeon/datasets/_single_problem_loaders.py:941, in load_macroeconomic()
    906 def load_macroeconomic():
    907     """
    908     Load the US Macroeconomic Data [1]_.
    909
   (...)
    939           http://www.bls.gov/data/; accessed December 15, 2009.
    940     """
--> 941     _check_soft_dependencies("statsmodels")
    942     import statsmodels.api as sm
    944     y = sm.datasets.macrodata.load_pandas().data

File ~/projects/aeon/aeon/utils/validation/_dependencies.py:140, in _check_soft_dependencies(package_import_alias, severity, obj, suppress_import_stdout, *packages)
    130     msg = (
    131         f"{class_name} requires package '{package}' to be present "
    132         f"in the python environment, but '{package}' was not found. "
   (...)
    137         f"aeon[all_extras]`"
    138     )
    139 if severity == "error":
--> 140     raise ModuleNotFoundError(msg) from e
    141 elif severity == "warning":
    142     warnings.warn(msg)

ModuleNotFoundError: No module named 'statsmodels'. 'statsmodels' is a soft dependency and not included in the base aeon installation. Please run: `pip install statsmodels` to install the statsmodels package. To install all soft dependencies, run: `pip install aeon[all_extras]`

The above shows a made up scenario comparing the gdp growth of four countries (country A, B, C and D) by quarter from 1959. If our task is to determine how different country C is from our other countries one way to do this is to measure the distance between each country.

How to use the distance module to perform tasks such as these, will now be outlined.

Distance module#

To begin using the distance module we need at least two time series, x and y and they must be numpy arrays. We’ve established the various time series we’ll be using for this example above as country_a, country_b, country_c and country_d. To compute the distance between x and y we can use a euclidean distance as shown:

[3]:

# Simple euclidean distance
distance(country_a, country_b, metric="euclidean")

[3]:

27014.721294922445

Shown above taking the distance between country_a and country_b, returns a singular float that represents their similarity (distance). We can do the same again but compare country_d to country_a:

[4]:

distance(country_a, country_d, metric="euclidean")

[4]:

58340.14674572803

Now we can compare the result of the distance computation and we find that country_a is closer to country_b than country_d (27014.7 < 58340.1).

We can further confirm this result by looking at the graph above and see the green line (country_b) is closer to the red line (country_a) than the orange line (country d).

Different metric parameters#

Above we used the metric “euclidean”. While euclidean distance is appropriate for simple example such as the one above, it has been shown to be inadequate when we have larger and more complex timeseries (particularly multivariate). While the merits of each different distance won’t be described here (see documentation for descriptions of each), a large number of specialised time series distances have been implement to get a better accuracy in distance computation. These are:

‘euclidean’, ‘squared’, ‘dtw’, ‘ddtw’, ‘wdtw’, ‘wddtw’, ‘lcss’, ‘edr’, ‘erp’

All of the above can be used as a metric parameter. This will now be demonstrated:

[5]:

print("Euclidean distance: ", distance(country_a, country_d, metric="euclidean"))
print("Squared euclidean distance: ", distance(country_a, country_d, metric="squared"))
print("Dynamic time warping distance: ", distance(country_a, country_d, metric="dtw"))
print(
    "Derivative dynamic time warping distance: ",
    distance(country_a, country_d, metric="ddtw"),
)
print(
    "Weighted dynamic time warping distance: ",
    distance(country_a, country_d, metric="wdtw"),
)
print(
    "Weighted derivative dynamic time warping distance: ",
    distance(country_a, country_d, metric="wddtw"),
)
print(
    "Longest common subsequence distance: ",
    distance(country_a, country_d, metric="lcss"),
)
print(
    "Edit distance for real sequences distance: ",
    distance(country_a, country_d, metric="edr"),
)
print(
    "Edit distance for real penalty distance: ",
    distance(country_a, country_d, metric="erp"),
)

Euclidean distance:  58340.14674572803
Squared euclidean distance:  3403572722.3130813
Dynamic time warping distance:  3403572722.3130813
Derivative dynamic time warping distance:  175072.58701887555
Weighted dynamic time warping distance:  1701786361.1565406
Weighted derivative dynamic time warping distance:  87536.29350943778
Longest common subsequence distance:  1.0
Edit distance for real sequences distance:  1.0
Edit distance for real penalty distance:  411654.25899999996

While many of the above use euclidean distance at their core, they change how it is used to account for various problems we encounter with time series data such as: alignment, phase, shape, dimensions etc. As mentioned for specific details on how to best use each distance and what it does see the documentation for that distance.

Custom parameters for distances#

In addition each distance has a different set of parameters. How these are passed to the ‘distance’ function will now be outlined using the ‘dtw’ example. As stated for specific parameters for each distance please refer to the documentation.

Dtw is a O(n^2) algorithm and as such a point of focus has been trying to optimise the algorithm. A proposal to improve performance is to restrict the potential alignment path by putting a ‘bound’ on values to consider when looking for an alignment. While there have been many bounding algorithms proposed the most popular is known as Sakoe-Chiba’s bounding window.

First this is a bounding matrix that considers all indexes in x and y:

[3]:

from aeon.distances import create_bounding_matrix

first_ts_size = 10
second_ts_size = 5

create_bounding_matrix(first_ts_size, second_ts_size)

[3]:

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True]])

Above shows a matrix that maps each index in ‘x’ to each index in ‘y’. Each value that is considered in the computation is set to True (in this instance we want a full bounding matrix so all values are set to True). As such if we were to run Dtw with this without bounding matrix all values will be consider all of these indexes.

However, it sometimes (depending on the dataset) is not necessary (and sometimes detrimental to the result) to consider all indexes. We can you a bounding technique like Sakoe-Chibas to limit the potential warping paths. This is done by setting a window size that will restrict the indexes that are considered. Below shows creating a bounding matrix again but only considering 20% of the indexes in x and y:

[5]:

create_bounding_matrix(first_ts_size, second_ts_size, window=0.2)

[5]:

array([[ True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True],
       [False,  True,  True,  True,  True],
       [False, False,  True,  True,  True],
       [False, False, False,  True,  True],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False]])

This bounding matrix produces a diagnol shape over the matrix. This restricts the warping and therefore greatly increase the speed at which the distance is computed as much fewer potential warping paths are considered.

With that base introductory to bounding algorithms and why we may want to use them how do we use it in our distance computation. There are two ways:

[11]:

# Create two random unaligned time series to better illustrate the difference

rng = np.random.RandomState(42)
n_timestamps, n_features = 10, 19
x = rng.randn(n_timestamps, n_features)
y = rng.randn(n_timestamps, n_features)

# First we can specify the bounding matrix to use either via enum or int (see
# documentation for potential values):
print(
    "Dynamic time warping distance with Sakoe-Chiba: ",
    distance(x, y, metric="dtw", window=0.2),
)  # Sakoe chiba

Dynamic time warping distance with Sakoe-Chiba:  312.6023735352965

[ ]:

Generated using nbsphinx. The Jupyter notebook can be found here.