binder

Loading data into aeon

aeon supports a range of data input formats. Example problems are described in provided_data.ipyn. Downloading data is described in benchmarking_data.ipynb. You can of course load and format the data so that it conforms to the input types describe in data_storage. aeon also provides data formats for time series for both forecasting and machine learning. These are all text files with a particular structure. Both formats store a single time series per row.

  1. The .ts and .tsf format used by the aeon packages and the time series and forecasting repositories. More information on the .tsf format is here Links to download all of the UCR univariate and the tsml multivariate data in .ts format.

  2. The .arff format used by Weka machine learning toolkit (see) Links to download all of the UCR univariate and the tsml multivariate data in .arff format.

  3. The .tsv format used by the UCR research group see. Link to download all of the UCR univariate IN .tsv format.

The baked in datasets are described here. Data structures to store the data are described here.

The .ts and .tsf file format

The .ts and .tsf file formats can store time series with different characteristics and contain metadata in the header. .ts store collections of time series for classification, clustering and regression. They can store univariate/multivariate equal length/unequal length problems. .tsf files store collections of series for forecasting. This is the format of most of the baked in machine problem (link to notebook).

Both file types allow for comments at the top of the file. All lines beginning with a hash (#) are comments. After the comments the files contain meta information about the data and the data itself.

Meta data for .ts and .tsf files

The header information is used to store metadata. .ts files contain a subset of the following boolean flags:

@problemName <problem name>
@univariate <true/false>
@dimensions integer
@equalLength <true/false>
@seriesLength integer
@classLabel <true/false> <space delimited list of possible class values>
@targetlabel <true/false>
@data

Note that these tags are not esssential, but they help understanding of the data. If they are not present they are inferred from the data. They are also not case sensitive. We use camel case in the files for readability, but internally, everything is stripped back to lower case. Note that only one of classlabel or targetlabel can be true. If class label is true, it indicates a classification problem, and the class values should follow the tag. Class values can be strings or integers. So, for example, the header for the PLAID dataset is

@problemName PLAID
@missing false
@univariate true
@equalLength false
@classLabel true 0 1 2 3 4 5 6 7 8 9 10

this indicates that it is univariate (single channel per case), has no missing values, unequal length time series and is a 11 class classification problem. For more detail on our provided data, see here. BasicMotions data header is as follows

@problemName BasicMotions
@missing false
@univariate false
@dimensions 6
@equalLength true
@seriesLength 100
@classLabel true Standing Running Walking Badminton

This is a multivariate problem with six channels/dimensions, equal length series (length 100) with four classes. There is also a tag for @timeStamps, but this is not yet supported.

.tsf files meta data begins with @attribute tags for each series. An @attribute is a series column name. Multiple @attribute tags correspond to hierarchical keys. Other possible tags include frequency, horizon, missing and equallength. For example

# Dataset Information
# This dataset is an aggregated version of the Ausgrid half hourly dataset.
# This file contains 299 weekly series representing the energy consumption of
# Australian households for the period of 2010-07-01 to 2013-06-26 under General
# Consumption (GC) category.
# The original Ausgrid dataset contains 300 half hourly series and one series was
# removed before aggregation due to the inclusion of missing values from
# 2012-10-10 to 2013-06-30.
#
# For more details, please refer to
# AusGrid, 2019. Solar home electricity data. Accessed: 2020-05-10.
# URL https://www.ausgrid.com.au/Industry/Our-Research/Data-to-share/Solar-home
# -electricity-data
#
@relation Ausgrid
@attribute series_name string
@attribute start_timestamp date
@frequency weekly
@horizon 8
@missing false
@equallength true

this indicates each series has a string name and start time at the beginning, a weekly frequency, forecasting horizon of eight, no missing values and all series are the same length.

Data format for .ts

Data in both .ts and .tsf files begins after the @data tag. Each row contains data for a single time series. Values for a series are in a comma-separated ordered list. For .ts files, each series can be multivariate. Each dimension/channel is separated by a colon (:). The class value or target value are at the end of the series, and also separated by a colon. For example,

@problemName example1
@missing false
@univariate true
@equalLength true
@seriesLength 4
@classLabel true 1 2
@data
2,3,2,4:1
13,12,32,12:1
4,4,5,4:2

has 3 cases of univariate series, length 4 with class values 1, 1 and 2. Missing readings are indicated by ?. For example, this regression dataset has three dimensions, unequal length series and contains missing values.

@problemName example2
@missing true
@univariate false
@dimensions 3
@equalLength false
@targetlabel true
@data
2,3,2,4: 5,6,7,7: 8,2,?,5:62
13,?,32,12,25: 6,6,6,6,?,8: 9,8,7,5,5:55

Data format for .tsf

The attributes are separated by colons, and the data follows. For example

@relation test
@attribute series_name string
@attribute start_timestamp date
@frequency yearly
@horizon 4
@missing false
@equallength false
@data
T1:1979-01-01 00-00-00:25092.2284,24271.5134,25828.9883,27697.5047,27956.2276,29924.4321,30216.8321
T2:1979-01-01 00-00-00:887896.51,887068.98,971549.04
T3:1973-01-01 00-00-00:227921,230995,183635,238605,254186

contains three series called T1, T2 and T3, each with a start data 1/1/1979. The data is yearly, and not of equal length.

Loading .ts and .tsf Data

The TSC data comes with a predefined train/test split, and so each problem is stored in two files with suffix _TRAIN.ts and _TEST.ts. By default, each problem is stored in its own directory. If a data is stored on disk, it can be loaded from a .ts file using the method in aeon.datasets:

load_from_tsfile(full_file_path_and_name, replace_missing_vals_with='NaN')

For example, the ArrowHead problem that is included in aeon under aeon/datasets/data has this header

@problemName ArrowHead
@classLabel true 0 1 2
@univariate true
@missing false
@data

and can be loaded with load_from_tsfile as follows

[1]:
import os

import aeon
from aeon.datasets import load_from_tsfile

DATA_PATH = os.path.join(os.path.dirname(aeon.__file__), "datasets/data")

train_x, train_y = load_from_tsfile(DATA_PATH + "/ArrowHead/ArrowHead_TRAIN.ts")
test_x, test_y = load_from_tsfile(DATA_PATH + "/ArrowHead/ArrowHead_TEST.ts")
test_x[0][0][:5]
[1]:
array([-1.9077772, -1.9048903, -1.8885626, -1.8711639, -1.8316792])

Train and test partitions of the ArrowHead problem have been loaded into 3D numpy arrays with an associated array of class values. Further info on data structures is given in this notebook. Datasets that are shipped with aeon (like ArrowHead, BasicMotions and PLAID) can be more simply loaded with bespoke functions. More details here

[1]:
from aeon.datasets import load_arrow_head, load_basic_motions, load_plaid

train_x, train_y = load_arrow_head(split="TRAIN")
test_x, test_y = load_arrow_head(split="test")
X, y = load_basic_motions()
plaid_train, _ = load_plaid(split="train")
print("Train shape = ", train_x.shape, " test shape = ", test_x.shape)
Train shape =  (36, 1, 251)  test shape =  (175, 1, 251)

Loading directly from tsc.com

You can also load .ts data directly from [tsc.com] (https://timeseriesclassification.com) using the function def load_classification(name, split=None, extract_path=None, return_metadata=False): This function downloads the zip file from the website, unpacks it in the specified directory. It does not download if the file is already in the extract_path or in aeon/datasets/data. If you do not give an extract path, it looks in aeon/datasets/data then writes to aeon/datasets/local_data.

[3]:
from aeon.datasets import load_classification

# This will not download, because Arrowhead is already in the directory.
# Change the extract path or name to downloads
X, y, meta_data = load_classification("ArrowHead", return_metadata=True)
print(" Shape of X = ", X.shape)
print(" Meta data = ", meta_data)
 Shape of X =  (211, 1, 251)
 Meta data =  {'problemname': 'arrowhead', 'timestamps': False, 'missing': False, 'univariate': True, 'equallength': True, 'classlabel': True, 'targetlabel': False, 'class_values': ['0', '1', '2']}

For more details, see the Load from web

Writing .ts files

You can write data to a .ts file by calling the function write_to_tsfile in the datasets module. It has the following method signature:

def write_to_tsfile( X, path, y=None, problem_name=”sample_data.ts”, header=None, regression=False ):

Weka .ARFF files

The Weka Java toolkit uses a file format called Attribute-Relation File Format (ARFF) that was the original basis for the .ts and .tsf format. Information on .arff files can be found here arff files can be used to store equal length univariate and multivariate problems. They cannot handle unequal length series.

Loading from Weka ARFF files

It is also possible to load data from Weka’s attribute-relation file format (ARFF) files. Data for timeseries problems are made available in this format at www.timeseriesclassification.com. The load_from_arff_file method in aeon.datasets supports reading data for both univariate and multivariate timeseries problems.

For example, we can load the ArrowHead data from an arff file rather than a .ts file.

[4]:
from aeon.datasets import load_from_arff_file

X, y = load_from_arff_file(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff"))

ucr .tsv files

A further option is to load data into aeon from tab separated value (.tsv) files. Researchers at the University of Riverside, California make a variety of timeseries data available in this format at Eamonn Keogh’s website. Each row is a time series, and the class value is the first one.

The load_from_tsv_file method in aeon.datasets supports reading univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from the .ts and ARFF file formats.

[5]:
from aeon.datasets import load_from_tsv_file

X, y = load_from_tsv_file(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv"))

Generated using nbsphinx. The Jupyter notebook can be found here.