binder

Using aeon distances with scikit learn

Scikit learn has a range of algorithms based on distances, including classifiers, regressors and clusterers. These can generally all be used with aeon distances in two ways.

  1. Pass the distance function as a callable to the metric or kernel parameter in the constructor or

  2. Set the metric or kernel to precomputed in the constructor and pass a pairwise distance matrix to fit and predict.

K-Nearest Neighbour Classification and Regression in sklearn.neighbors

[1]:
from sklearn.neighbors import (
    KNeighborsClassifier,
    KNeighborsRegressor,
    KNeighborsTransformer,
    NearestCentroid,
)

If we have a univariate problem stored as a 2D numpy shape (n_cases,n_timepoints), we can apply these estimators directly, but it is treating the data as vector based rather than time series.

If we try and use with an aeon style 3D numpy (n_cases, n_channels, n_timepoints), they will crash. See the data_formats for details on data storage.

[2]:
from aeon.datasets import load_gunpoint

trainx1, trainy1 = load_gunpoint(split="TRAIN", return_type="numpy2D")
testx1, testy1 = load_gunpoint(split="TEST", return_type="numpy2D")
# Use directly on TSC data with standard scikit distances
knn = KNeighborsClassifier(metric="manhattan")
knn.fit(trainx1, trainy1)
print(
    "KNN with manhatttan distance on 2D time series data first five ",
    knn.predict(testx1)[:5],
)
trainx2, trainy2 = load_gunpoint(split="TRAIN")
testx2, testy2 = load_gunpoint(split="TEST")
print("Shape of train = ", trainx2.shape, "sklearn will crash")
try:
    knn.fit(trainx2, trainy2)
except ValueError:
    print(
        "raises an ValueError: Found array with dim 3."
        "KNeighborsClassifier expected <= 2."
    )
KNN with manhatttan distance on 2D time series data first five  ['1' '2' '2' '1' '1']
Shape of train =  (50, 1, 150) sklearn will crash
raises an ValueError: Found array with dim 3.KNeighborsClassifier expected <= 2.

We can use KNeighborsClassifier with a callable aeon distance function, but the input must still be 2D numpy. We can also use them as callables in other sklearn.neighbors estimators

[3]:
from aeon.distances import (
    dtw_distance,
    edr_distance,
    erp_distance,
    msm_distance,
    twe_distance,
)

# Use directly on TSC data with aeon distance function
knn = KNeighborsClassifier(metric=dtw_distance)
knn.fit(trainx1, trainy1)
print(
    "KNN with DTW on 2D time series data first five predictions= ",
    knn.predict(testx1)[:5],
)
try:
    knn.fit(trainx2, trainy2)
except ValueError:
    print(
        "raises a ValueError: Found array with dim 3. "
        "KNeighborsClassifier expected <= 2."
    )
nc = NearestCentroid(metric=erp_distance)
nc.fit(trainx1, trainy1)
print(
    "nc with ERP on 2D time series data first five predictions= ",
    nc.predict(testx1)[:5],
)
kt = KNeighborsTransformer(metric=edr_distance)
kt.fit(trainx1, trainy1)
print(
    "nc with ERP on 2D time series data transform shape = ", kt.transform(testx1).shape
)
KNN with DTW on 2D time series data first five predictions=  ['1' '2' '2' '1' '2']
raises an ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.
nc with ERP on 2D time series data first five predictions=  ['1' '1' '2' '1' '1']
C:\Code\aeon\venv\lib\site-packages\sklearn\neighbors\_nearest_centroid.py:179: UserWarning: Averaging for metrics other than euclidean and manhattan not supported. The average is set to be the mean.
  warnings.warn(
nc with ERP on 2D time series data transform shape =  (150, 50)

Also note that the callable will not work with some KNN algorithm options such as kd_tree, which raises an error kd_tree does not support callable metric. Because of these problems, we have implemented a KNN classifier and regressor to use with our distance functions.

[4]:
from aeon.classification.distance_based import KNeighborsTimeSeriesClassifier
from aeon.datasets import load_covid_3month  # Regression problem
from aeon.regression.distance_based import KNeighborsTimeSeriesRegressor

knn1 = KNeighborsTimeSeriesClassifier(distance="msm")
knn1.fit(trainx1, trainy1)
print(
    "KNN classification with MSM 3D time series data first five predictions=",
    knn1.predict(testx1)[:5],
)
trainx3, trainy3 = load_covid_3month(split="train")
testx3, testy3 = load_covid_3month(split="test")
knn2 = KNeighborsTimeSeriesRegressor(distance="twe", n_neighbors=1)
knn2.fit(trainx3, trainy3)
print(
    "aeon KNN regression with TWE first five predictions=",
    knn2.predict(testx3)[:5],
)
knr = KNeighborsRegressor(metric=twe_distance, n_neighbors=1)
trainx4 = trainx3.squeeze()
testx4 = testx3.squeeze()
knr.fit(trainx4, trainy3)
print(
    "sklearn KNN regression with TWE first five predictions=",
    knr.predict(testx4)[:5],
)
KNN classification with MSM 3D time series data first five predictions= ['1' '2' '2' '1' '1']
aeon KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]
sklearn KNN regression with TWE first five predictions= [0.02558824 0.00877193 0.01960784 0.03533314 0.00805611]

Another alternative is to pass the distance measures as precomputed. Note that this requires the calculation of the full distance matricies, and still cannot be used with algorithm options.

[5]:
from aeon.distances import euclidean_pairwise_distance

train_dists = euclidean_pairwise_distance(trainx2)
test_dists = euclidean_pairwise_distance(testx2, trainx2)
knn = KNeighborsClassifier(metric="precomputed")
knn.fit(train_dists, trainy2)
print("KNN precomputed=", knn.predict(test_dists)[:5])
KNN precomputed= ['1' '2' '2' '1' '1']

Support Vector Machine Classification and Regression in sklearn.svm

[6]:
from sklearn.svm import SVC, SVR, NuSVC, NuSVR

The SVM estimators in scikit can be used with pairwise distance matrices. Please note that not all elastic distance functions are kernels, and it is desirable that they are for SVM. DTW is not a metric, but MSM and TWE are.

[7]:
from aeon.distances import (
    dtw_pairwise_distance,
    msm_pairwise_distance,
    twe_distance,
    twe_pairwise_distance,
)

svc = SVC(kernel="precomputed")
svr = SVR(kernel="precomputed")
nsvc = NuSVC(kernel="precomputed")
nsvr = NuSVR(kernel=twe_distance)
train_m1 = twe_pairwise_distance(trainx1)
test_m1 = twe_pairwise_distance(testx1, trainx1)
svc.fit(train_m1, trainy1)
print("SVC with TWE first five predictions= ", svc.predict(test_m1)[:5])
train_m2 = msm_pairwise_distance(trainx2)
test_m2 = msm_pairwise_distance(testx2, trainx2)
nsvc.fit(train_m2, trainy2)
print("NuSVC with MSM first five predictions= ", nsvc.predict(test_m2)[:5])
train_m3 = dtw_pairwise_distance(trainx3)
test_m3 = dtw_pairwise_distance(testx3, trainx3)
svr.fit(train_m3, trainy3)
print("SVR with DTW first five predictions= ", svr.predict(test_m3)[:5])
SVC with TWE first five predictions=  ['1' '2' '1' '2' '2']
NuSVC with MSM first five predictions=  ['1' '2' '2' '1' '2']
SVR with DTW first five predictions=  [0.08823529 0.08823529 0.08823529 0.08823529 0.08823529]

Clustering with sklearn.cluster

[8]:
from sklearn.cluster import DBSCAN

Some sklearn clustering algorithms accept callable distances or precomputed distance matrices, and these can be used with aeon distance functions.

Note that DBSCAN can only make predictions on the train data, so it has no predict function.

[9]:
db1 = DBSCAN(eps=2.5)
preds1 = db1.fit_predict(trainx1)
print(preds1[:5])
db2 = DBSCAN(metric=msm_distance, eps=2.5)
db3 = DBSCAN(metric="precomputed", eps=2.5)
preds2 = db2.fit_predict(trainx1)
print(preds1[:5])
preds3 = db3.fit_predict(train_m2)
print(preds1[:5])
[-1  0  0  0  0]
[-1  0  0  0  0]
[-1  0  0  0  0]

You can use pairwise distance functions within the scikit learn FunctionTransformer wrapper

[10]:
from sklearn.preprocessing import FunctionTransformer

from aeon.datasets import load_italy_power_demand
from aeon.distances import msm_distance, msm_pairwise_distance

X, y = load_italy_power_demand(split="TRAIN")
ft = FunctionTransformer(msm_pairwise_distance)
X2 = ft.transform(X)
print(" Shape = ", X2.shape)
d = msm_distance(X[0], X[1])
print(f"These should be the same {d} and {X2[0][1]}")
 Shape =  (67, 67)
These should be the same 7.595223506000001 and 7.595223506000001

This makes it easy to use distances as features in an a scikit-learn pipeline. Whether it is a good idea to do this is a separate question.

[15]:
from sklearn.pipeline import Pipeline

X, y = load_italy_power_demand(split="TRAIN")

pipe = Pipeline(steps=[("FunctionTransformer", ft), ("SVM", SVC())])
pipe.fit(X, y)
pipe.predict(X)
[15]:
array(['1', '1', '2', '2', '1', '2', '2', '1', '1', '2', '2', '1', '1',
       '2', '1', '2', '1', '1', '2', '1', '1', '2', '1', '1', '1', '1',
       '1', '2', '2', '1', '1', '2', '2', '1', '2', '2', '1', '2', '1',
       '2', '1', '1', '2', '2', '1', '2', '2', '2', '2', '1', '1', '2',
       '2', '2', '1', '2', '2', '1', '1', '2', '2', '1', '1', '2', '1',
       '2', '2'], dtype='<U1')
[ ]:


Generated using nbsphinx. The Jupyter notebook can be found here.