Data conversions in aeon#

We recommend you follow the following strategy: Use pd.Series or pd .DataFrame for forecasting and for classification, clustering and regression, use 3D numpy of shape (n_cases, n_channels, n_timepoints) if your collection of time series are equal length, or a list of 2D numpy of length [n_cases] if not equal length. All data loaded from file with our data loaders use this strategy.

However, aeon provides a range of converters in the datatypes package. These are grouped into converters for single series and converters for collections of series

Series Converters#

Single time series can be stored in the following data structures

“pd.Series”: Pandas Series storing a univariate time series
“pd.DataFrame”: Pandas DataFrame storing a univariate or multivariate time series
“np.ndarray”: numpy 2d array for series of shape (n_timepoints, n_channels).
“xr.DataArray”: xarray DataArray a for a univariate or multivariate time series
“dask_series”: Dask DataFrame for a univariate or multivariate time series

The above strings are used to internally specify each different data structure for internal conversion purposes. NOTE the 2D numpy array representation is not consistent with that used in collections. This is an unfortunate difference that is a result of legacy design and norms in different research fields.

Conversion to and from these data structures is fairly straightforward, but we provide tools to help. aeon contains converters that are wrapped by the method convert. This method will attempt to convert from one of the five types to another, and raise an exception if the conversion is invalid (e.g. if the object is not in fact of type “from_type”). Note that estimators will attempt to automatically perform this conversion to the specified internal type of that estimator.

[1]:

import numpy as np

from aeon.datatypes import convert

numpyarray = np.random.random(size=(100, 1))
series = convert(numpyarray, from_type="np.ndarray", to_type="xr.DataArray")
type(series)

[1]:

xarray.core.dataarray.DataArray

the method convert wraps actual converter functions in the file aeon.datatypes ._series._convert. Some examples below

[2]:

from aeon.datatypes._series._convert import (
    convert_mvs_to_dask_as_series,
    convert_Mvs_to_xrdatarray_as_Series,
    convert_np_to_MvS_as_Series,
)

pd_dataframe = convert_np_to_MvS_as_Series(numpyarray)
type(pd_dataframe)

[2]:

pandas.core.frame.DataFrame

[3]:

dask_dataframe = convert_mvs_to_dask_as_series(pd_dataframe)
type(dask_dataframe)

[3]:

dask.dataframe.core.DataFrame

[4]:

xrarray = convert_Mvs_to_xrdatarray_as_Series(pd_dataframe)
type(xrarray)

[4]:

xarray.core.dataarray.DataArray

Collections Converters#

Collections of time series are the fundamental data type for machine learning algorithms. In older versions of the toolkit, collections of time series were called panels (a term from econometrics, not machine learning), and there are still references to panel. The main characteristics of collections of time series that effect storage is that they can be univariate or multivariate and they can be equal length or unequal length. The main data structures for storing collections are as follows:

“numpy3D”: 3D np.ndarray of format (n_cases, n_channels, n_timepoints)
“np-list”: python list of 2D numpy array of length [n_cases], each of shape (n_channels, n_timepoints_i)
“df-list”: python list of 2D pd.DataFrames of length [n_cases], each a of shape (n_timepoints_i, n_channels)
“numpy2D”: 2D np.ndarray of shape (n_cases, n_timepoints)

Other supported types which may be useful are:

“nested_univ”: a pd.DataFrame of shape (n_cases, n_channels) where each cell is a pd.Series of length (n_timepoints)
“pd-multiindex”: pd.DataFrame with multi-index (cases, timepoints)
“pd-wide”: pd.DataFrame in wide format, with shape (n_timepoints, n_cases)

AS with series, collection conversion can be performed with the method convert, which wraps methods in aeon.datatypes._panel._convert. However, internal estimator conversion is now handled with the function _convert_X in the aeon.utils.validation .collection package as follows

[5]:

from aeon.utils.conversion import convert_collection

# 10 multivariate time series with 3 channels of length 100 in "numpy3D" format
multi = np.random.random(size=(10, 3, 100))
np_list = convert_collection(multi, output_type="np-list")
print(
    f" Type = {type(np_list)}, type first {type(np_list[0])} shape first "
    f"{np_list[0].shape}"
)

 Type = <class 'list'>, type first <class 'numpy.ndarray'> shape first (3, 100)

[6]:

df_list = convert_collection(multi, output_type="df-list")
print(
    f" Type = {type(df_list)}, type first {type(df_list[0])} shape first "
    f"{df_list[0].shape}"
)

 Type = <class 'list'>, type first <class 'pandas.core.frame.DataFrame'> shape first (100, 3)

Note again the difference in storage convention: series in 2D numpy are stored in (n_channels, n_timepoints), whereas with dataframes, they are in shape (n_timepoints, n_channels). We know this is confusing, and are thinking about the best way of reconciling this distinction. See this issue. The actual converter functions are here

[ ]:

from aeon.utils.conversion._convert_collection import _from_numpy3d_to_pd_multiindex

mi = _from_numpy3d_to_pd_multiindex(multi)
print(f" Type = {type(mi)},shape {mi.shape}")

[7]:

Generated using nbsphinx. The Jupyter notebook can be found here.