Loading data into aeon¶
aeon
supports a range of data input formats. Example problems are described in Provided datasets. Downloading data is described in Downloading and loading benchmarking datasets. You can of course load and format the data so that it conforms to the input types described in Data structures and containers for aeon estimators. aeon
also provides data formats for time series for both forecasting and machine
learning. These are all text files with a particular structure. Both formats store a single time series per row.
The
.ts
and.tsf
format used by the aeon packages and the time series and forecasting repositories. More information on the.tsf
format is here. Links to download all of the UCR univariate and the tsml multivariate data in.ts
format.The
.arff
format used by Weka machine learning toolkit (see). Links to download all of the UCR univariate and the tsml multivariate data in.arff
format.The
.tsv
format used by the UCR research group see. Link to download all of the UCR univariate IN.tsv
format.The
.csv
format used by TimeEval. The format is described here.
The baked in datasets are described here. Data structures to store the data are described here.
The .ts and .tsf file format¶
The .ts
and .tsf
file formats can store time series with different characteristics and contain metadata in the header. .ts
store collections of time series for classification, clustering and regression. They can store univariate/multivariate equal length/unequal length problems. .tsf
files store collections of series for forecasting. This is the format of most of the baked in machine problem (link to notebook).
Both file types allow for comments at the top of the file. All lines beginning with a hash (#) are comments. After the comments the files contain meta information about the data and the data itself.
Meta data for .ts and .tsf files¶
The header information is used to store metadata. .ts
files contain a subset of the following boolean flags:
@problemName <problem name>
@univariate <true/false>
@dimensions integer
@equalLength <true/false>
@seriesLength integer
@classLabel <true/false> <space delimited list of possible class values>
@targetlabel <true/false>
@data
Note that these tags are not esssential, but they help understanding of the data. If they are not present they are inferred from the data. They are also not case sensitive. We use camel case in the files for readability, but internally, everything is stripped back to lower case. Note that only one of classlabel or targetlabel can be true. If class label is true, it indicates a classification problem, and the class values should follow the tag. Class values can be strings or integers. So, for example, the header for the PLAID dataset is
@problemName PLAID
@missing false
@univariate true
@equalLength false
@classLabel true 0 1 2 3 4 5 6 7 8 9 10
this indicates that it is univariate (single channel per case), has no missing values, unequal length time series and is a 11 class classification problem. For more detail on our provided data, see here. BasicMotions data header is as follows
@problemName BasicMotions
@missing false
@univariate false
@dimensions 6
@equalLength true
@seriesLength 100
@classLabel true Standing Running Walking Badminton
This is a multivariate problem with six channels/dimensions, equal length series (length 100) with four classes. There is also a tag for @timeStamps, but this is not yet supported.
.tsf
files meta data begins with @attribute
tags for each series. An @attribute is a series column name. Multiple @attribute
tags correspond to hierarchical keys. Other possible tags include frequency, horizon, missing and equallength. For example
# Dataset Information
# This dataset is an aggregated version of the Ausgrid half hourly dataset.
# This file contains 299 weekly series representing the energy consumption of
# Australian households for the period of 2010-07-01 to 2013-06-26 under General
# Consumption (GC) category.
# The original Ausgrid dataset contains 300 half hourly series and one series was
# removed before aggregation due to the inclusion of missing values from
# 2012-10-10 to 2013-06-30.
#
# For more details, please refer to
# AusGrid, 2019. Solar home electricity data. Accessed: 2020-05-10.
# URL https://www.ausgrid.com.au/Industry/Our-Research/Data-to-share/Solar-home
# -electricity-data
#
@relation Ausgrid
@attribute series_name string
@attribute start_timestamp date
@frequency weekly
@horizon 8
@missing false
@equallength true
this indicates each series has a string name and start time at the beginning, a weekly frequency, forecasting horizon of eight, no missing values and all series are the same length.
Data format for .ts
¶
Data in both .ts
and .tsf
files begins after the @data
tag. Each row contains data for a single time series. Values for a series are in a comma-separated ordered list. For .ts
files, each series can be multivariate. Each dimension/channel is separated by a colon (:). The class value or target value are at the end of the series, and also separated by a colon. For example,
@problemName example1
@missing false
@univariate true
@equalLength true
@seriesLength 4
@classLabel true 1 2
@data
2,3,2,4:1
13,12,32,12:1
4,4,5,4:2
has 3 cases of univariate series, length 4 with class values 1, 1 and 2. Missing readings are indicated by ?
. For example, this regression dataset has three dimensions, unequal length series and contains missing values.
@problemName example2
@missing true
@univariate false
@dimensions 3
@equalLength false
@targetlabel true
@data
2,3,2,4: 5,6,7,7: 8,2,?,5:62
13,?,32,12,25: 6,6,6,6,?,8: 9,8,7,5,5:55
Data format for .tsf
¶
The attributes are separated by colons, and the data follows. For example
@relation test
@attribute series_name string
@attribute start_timestamp date
@frequency yearly
@horizon 4
@missing false
@equallength false
@data
T1:1979-01-01 00-00-00:25092.2284,24271.5134,25828.9883,27697.5047,27956.2276,29924.4321,30216.8321
T2:1979-01-01 00-00-00:887896.51,887068.98,971549.04
T3:1973-01-01 00-00-00:227921,230995,183635,238605,254186
contains three series called T1, T2 and T3, each with a start data 1/1/1979. The data is yearly, and not of equal length.
Loading .ts
and .tsf
Data¶
The TSC data comes with a predefined train/test split, and so each problem is stored in two files with suffix _TRAIN.ts
and _TEST.ts
. By default, each problem is stored in its own directory. If a data is stored on disk, it can be loaded from a .ts
file using the method in aeon.datasets:
load_from_ts_file(full_file_path_and_name, replace_missing_vals_with='NaN')
For example, the ArrowHead problem that is included in aeon under aeon/datasets/data has this header
@problemName ArrowHead
@classLabel true 0 1 2
@univariate true
@missing false
@data
and can be loaded with load_from_ts_file as follows
[1]:
import os
import aeon
from aeon.datasets import load_from_ts_file
DATA_PATH = os.path.join(os.path.dirname(aeon.__file__), "datasets/data")
train_x, train_y = load_from_ts_file(DATA_PATH + "/ArrowHead/ArrowHead_TRAIN.ts")
test_x, test_y = load_from_ts_file(DATA_PATH + "/ArrowHead/ArrowHead_TEST.ts")
test_x[0][0][:5]
[1]:
array([-1.9077772, -1.9048903, -1.8885626, -1.8711639, -1.8316792])
Train and test partitions of the ArrowHead problem have been loaded into 3D numpy arrays with an associated array of class values. Further info on data structures is given in this notebook. Datasets that are shipped with aeon (like ArrowHead, BasicMotions and PLAID) can be more simply loaded with bespoke functions. More details here
[2]:
from aeon.datasets import load_arrow_head, load_basic_motions, load_plaid
train_x, train_y = load_arrow_head(split="TRAIN")
test_x, test_y = load_arrow_head(split="test")
X, y = load_basic_motions()
plaid_train, _ = load_plaid(split="train")
print("Train shape = ", train_x.shape, " test shape = ", test_x.shape)
Train shape = (36, 1, 251) test shape = (175, 1, 251)
Loading directly from tsc.com¶
You can also load .ts
data directly from [tsc.com] (https://timeseriesclassification.com) using the function def load_classification(name, split=None, extract_path=None, return_metadata=False): This function downloads the zip file from the website, unpacks it in the specified directory. It does not download if the file is already in the extract_path or in aeon/datasets/data. If you do not give an extract path, it looks in aeon/datasets/data then writes to aeon/datasets/local_data.
[3]:
from aeon.datasets import load_classification
# This will not download, because Arrowhead is already in the directory.
# Change the extract path or name to downloads
X, y, meta_data = load_classification("ArrowHead", return_metadata=True)
print(" Shape of X = ", X.shape)
print(" Meta data = ", meta_data)
Shape of X = (211, 1, 251)
Meta data = {'problemname': 'arrowhead', 'timestamps': False, 'missing': False, 'univariate': True, 'equallength': True, 'classlabel': True, 'targetlabel': False, 'class_values': ['0', '1', '2']}
For more details, see the Load from web
Writing .ts files¶
You can write data to a .ts
file by calling the function write_to_ts_file
in the datasets module. It has the following method signature:
def write_to_ts_file( X, path, y=None, problem_name=”sample_data.ts”, header=None, regression=False ):
Weka .ARFF files¶
The Weka Java toolkit uses a file format called Attribute-Relation File Format (ARFF) that was the original basis for the .ts
and .tsf
format. Information on .arff
files can be found here arff files can be used to store equal length univariate and multivariate problems. They cannot handle unequal length series.
Loading from Weka ARFF files¶
It is also possible to load data from Weka’s attribute-relation file format (ARFF) files. Data for timeseries problems are made available in this format at www.timeseriesclassification.com. The load_from_arff_file
method in aeon.datasets
supports reading data for both univariate and multivariate timeseries problems.
For example, we can load the ArrowHead data from an arff file rather than a .ts
file.
[4]:
from aeon.datasets import load_from_arff_file
X, y = load_from_arff_file(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.arff"))
ucr .tsv files¶
A further option is to load data into aeon from tab separated value (.tsv
) files. Researchers at the University of Riverside, California make a variety of timeseries data available in this format at Eamonn Keogh’s website. Each row is a time series, and the class value is the first one.
The load_from_tsv_file
method in aeon.datasets
supports reading univariate problems. An example with ArrowHead is given below to demonstrate equivalence with loading from the .ts and ARFF file formats.
[5]:
from aeon.datasets import load_from_tsv_file
X, y = load_from_tsv_file(os.path.join(DATA_PATH, "ArrowHead/ArrowHead_TRAIN.tsv"))
TimeEval .csv files¶
Anomaly detection datasets are stored in the TimeEval canonical file format. There are three different learning types for anomaly detection: supervised, semi-supervised and unsupervised. Unsupervised datasets just contain a test time series, while semi-supervised and supervised datasets come with a test and training time series. The time series are stored in a .csv
file with the following columns
(with headers):
Index 0 (
timestamp
): The timestamp of the data points.One or more columns (=channels) containing the time series data.
Index -1 (
is_anomaly
): The label of the data points, where 1 indicates an anomalous data point and 0 indicates normal data points.
Researchers at the Hasso Plattner Institute, Potsdam, Germany published a variety of anomaly detection datasets in this format. These datasets are available at this URL.
The load_from_timeeval_csv_file
method in aeon.datasets
supports reading data for anomaly detection problems. An example for the multivariate CalIt2 dataset is given below.
[6]:
from aeon.datasets import load_from_timeeval_csv_file
AD_DATA_PATH = os.path.join(os.path.dirname(aeon.__file__), "datasets/data")
X, y = load_from_timeeval_csv_file(
os.path.join(AD_DATA_PATH, "Daphnet_S06R02E0/S06R02E0.csv")
)
X.shape, y.shape
[6]:
((7040, 9), (7040,))
[6]:
Generated using nbsphinx. The Jupyter notebook can be found here.