Loading data with unequal length series or missing valuesΒΆ

Some of the archive datasets have variable length series or missing values. Some algorithms can handle this type of data internally, but many cannot. You can find out estimator capabilities through the tags. For example, the ability to handle unequal length series internally is indicated by the tag capability:unequal_length. You can find out which estimators have this capability by using all_estimators.

[1]:
from aeon.utils.discovery import all_estimators

all_estimators(type_filter="classifier", tag_filter={"capability:unequal_length": True})
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[1], line 1
----> 1 from aeon.utils.discovery import all_estimators
      3 all_estimators(type_filter="classifier", tag_filter={"capability:unequal_length": True})

File C:\Code\aeon\aeon\utils\discovery.py:15
     12 from sklearn.base import BaseEstimator
     14 from aeon.base import BaseAeonEstimator
---> 15 from aeon.utils.base import BASE_CLASS_REGISTER
     16 from aeon.utils.tags import ESTIMATOR_TAGS
     17 from aeon.utils.tags._validate import check_tag_value

File C:\Code\aeon\aeon\utils\base\__init__.py:9
      1 """Base class collections and utilities."""
      3 __all__ = [
      4     "BASE_CLASS_REGISTER",
      5     "VALID_ESTIMATOR_BASES",
      6     "get_identifier",
      7 ]
----> 9 from aeon.utils.base._identifier import get_identifier
     10 from aeon.utils.base._register import BASE_CLASS_REGISTER, VALID_ESTIMATOR_BASES

File C:\Code\aeon\aeon\utils\base\_identifier.py:9
      6 from inspect import isclass
      8 from aeon.base import BaseAeonEstimator
----> 9 from aeon.utils.base._register import BASE_CLASS_REGISTER
     12 def get_identifier(estimator):
     13     """Determine identifier string of an estimator.
     14
     15     Parameters
   (...)     28         If no identifier can be determined for estimator
     29     """

File C:\Code\aeon\aeon\utils\base\_register.py:19
     13 __all__ = [
     14     "BASE_CLASS_REGISTER",
     15     "VALID_ESTIMATOR_BASES",
     16 ]
     18 from aeon.anomaly_detection.base import BaseAnomalyDetector
---> 19 from aeon.anomaly_detection.collection.base import BaseCollectionAnomalyDetector
     20 from aeon.anomaly_detection.series.base import BaseSeriesAnomalyDetector
     21 from aeon.base import BaseAeonEstimator, BaseCollectionEstimator, BaseSeriesEstimator

File C:\Code\aeon\aeon\anomaly_detection\collection\__init__.py:10
      1 """Whole-series anomaly detection methods."""
      3 __all__ = [
      4     "BaseCollectionAnomalyDetector",
      5     "ClassificationAdapter",
      6     "OutlierDetectionAdapter",
      7     "ROCKAD",
      8 ]
---> 10 from aeon.anomaly_detection.collection._classification import ClassificationAdapter
     11 from aeon.anomaly_detection.collection._outlier_detection import OutlierDetectionAdapter
     12 from aeon.anomaly_detection.collection._rockad import ROCKAD

File C:\Code\aeon\aeon\anomaly_detection\collection\_classification.py:11
      9 from aeon.anomaly_detection.collection.base import BaseCollectionAnomalyDetector
     10 from aeon.base._base import _clone_estimator
---> 11 from aeon.classification.feature_based import SummaryClassifier
     14 class ClassificationAdapter(BaseCollectionAnomalyDetector):
     15     """
     16     Basic classifier adapter for collection anomaly detection.
     17
   (...)     29         by `np.random`.
     30     """

File C:\Code\aeon\aeon\classification\feature_based\__init__.py:17
      7 __all__ = [
      8     "Catch22Classifier",
      9     "SignatureClassifier",
   (...)     13     "TDMVDCClassifier",
     14 ]
     16 from aeon.classification.feature_based._catch22 import Catch22Classifier
---> 17 from aeon.classification.feature_based._fresh_prince import FreshPRINCEClassifier
     18 from aeon.classification.feature_based._signature_classifier import SignatureClassifier
     19 from aeon.classification.feature_based._summary import SummaryClassifier

File C:\Code\aeon\aeon\classification\feature_based\_fresh_prince.py:14
     11 from sklearn.tree import DecisionTreeClassifier
     13 from aeon.classification.base import BaseClassifier
---> 14 from aeon.classification.sklearn import RotationForestClassifier
     15 from aeon.transformations.collection.feature_based import TSFresh
     16 from aeon.utils.validation import check_n_jobs

File C:\Code\aeon\aeon\classification\sklearn\__init__.py:9
      1 """Vector sklearn classifiers."""
      3 __all__ = [
      4     "RotationForestClassifier",
      5     "ContinuousIntervalTree",
      6     "SklearnClassifierWrapper",
      7 ]
----> 9 from aeon.classification.sklearn._continuous_interval_tree import ContinuousIntervalTree
     10 from aeon.classification.sklearn._rotation_forest_classifier import (
     11     RotationForestClassifier,
     12 )
     13 from aeon.classification.sklearn._wrapper import SklearnClassifierWrapper

File C:\Code\aeon\aeon\classification\sklearn\_continuous_interval_tree.py:23
     21 from sklearn.utils import check_random_state
     22 from sklearn.utils.multiclass import check_classification_targets
---> 23 from sklearn.utils.validation import validate_data
     26 class _TreeNode:
     27     """ContinuousIntervalTree tree node."""

ImportError: cannot import name 'validate_data' from 'sklearn.utils.validation' (C:\Code\aeon\.venv\Lib\site-packages\sklearn\utils\validation.py)

Collections of unequal length series are stored as a list of 2D arrays. There are two unequal length example problems in aeon

[2]:
from aeon.datasets import load_japanese_vowels, load_pickup_gesture_wiimoteZ

j_vowels, j_labels = load_japanese_vowels()
p_vowels, p_labels = load_pickup_gesture_wiimoteZ()
print(type(j_vowels[0].shape), "  ", type(p_vowels[0].shape))
print("shape first =", j_vowels[0].shape, "shape 11th =", j_vowels[10].shape)
<class 'tuple'>    <class 'tuple'>
shape first = (12, 20) shape 11th = (12, 23)

The TSML archive TSC.com contains several unequal length series, including 11 from the UCR univariate archive and seven from the multivariate archive.

[3]:
from aeon.datasets.tsc_datasets import (
    multivariate_unequal_length,
    univariate_variable_length,
)

print(univariate_variable_length)
print(multivariate_unequal_length)
{'PickupGestureWiimoteZ', 'GestureMidAirD2', 'ShakeGestureWiimoteZ', 'PLAID', 'GesturePebbleZ1', 'AllGestureWiimoteZ', 'GestureMidAirD1', 'GestureMidAirD3', 'GesturePebbleZ2', 'AllGestureWiimoteY', 'AllGestureWiimoteX'}
{'AsphaltObstaclesCoordinates', 'SpokenArabicDigits', 'InsectWingbeat', 'CharacterTrajectories', 'JapaneseVowels', 'AsphaltPavementTypeCoordinates', 'AsphaltRegularityCoordinates'}

It is commonplace to preprocess variable length series prior to classification/regression/clustering. There are tools to do this in aeon directly. For example, you can pad series to the longest length or you can truncate them to the shortest length series in the collection if unequal length:

[4]:
from aeon.transformations.collection.unequal_length import Padder, Truncator

padder = Padder()
truncator = Truncator()
padded_j_vowels = padder.fit_transform(j_vowels)
truncated_j_vowels = truncator.fit_transform(j_vowels)
print(padded_j_vowels.shape, truncated_j_vowels.shape)
(640, 12, 29) (640, 12, 7)

There is not one best way of dealing with unequal length series. TSC has equal length version of all unequal length datasets and you can load these directly with load_classification and load_regression functions where the equalising operation is bespoke to the problem. For the classification problems, the data was padded with the series mean with low level Gaussian noise added. Loading equal length is the default behaviour

[14]:
from aeon.datasets import load_classification

j_equal, _ = load_classification("JapaneseVowels")
j_unequal, _ = load_classification("JapaneseVowels", load_equal_length=False)
print(type(j_equal))
print(j_equal.shape)
print(type(j_unequal))
<class 'numpy.ndarray'>
(640, 12, 25)
<class 'list'>

This is the case for both the classification and regression problems. When downloaded, it copies a zip file containing both versions.

9d276e6a503243f0a212a03780eb22de

Unequal length problems made equal length have a suffix _eq and those with missing values imputed have suffix _nmv. At the moment we do not have any problems with both missing and unequal length.

[ ]:

[ ]:


Generated using nbsphinx. The Jupyter notebook can be found here.

binder