Deep dive in the distance profiles#

In this notebook, we will talk more about the theory behind distance profile, how they are computed, and how they can be optimized. For practical experiments on the speedups implemented in aeon, refer to the notebook on the Analysis of the speedups provided by similarity search module notebook.

What are distance profiles ?#

In the context of similarity search, where we have as input a time series \(X = \{x_1, \ldots, x_m\}\) and a query \(Q = \{q_1, \ldots, q_l\}\), a distance profile is defined as a vector containing the similarity of \(Q\) to every subsequence of size \(l\) in \(X\), with the \(i^{th}\) subsequence denoted by \(W_i = \{x_i, \ldots, x_{i+(l-1)}\}\).

Given a distance or dissimilarity function \(dist\), such as the Euclidean distance, the distance profile \(P(X,Q)\) is expressed as :

\[P(X, Q) = \{dist(W_1, Q), \ldots, dist(W_{m-(l-1)}, Q)\}\]

We can then find the “best match” between \(Q\) and \(X\) by looking at the distance profile minimum value and extract the subsequence \(W_{\text{argmin} P(X,Q)}\) as the best match.

Trivial matches#

One should be careful of what is called “trivial matches” in this situation. If \(Q\) is extracted from \(X\), it is extremely likely that it will match with itself, as \(dist(Q,Q)=0\). To avoid this, it is common to set the parts of the distance profile that are neighbors to \(Q\) to \(\infty\). This is the role of the q_index parameter in the similarity search predict methods. The exclusion_factor parameter is used to define the neighbors of \(Q\) that will also get \(\infty\) value.

For example, if \(Q\) was extracted at index \(i\) in \(X\) (i.e. \(Q = \{x_i, \ldots, x_{i+(l-1)}\}\)), then all points in the interval [i - l//exclusion_factor, i + l//exclusion_factor] will the set to \(\infty\) in the distance profile to avoid a trivial match.

Generated using nbsphinx. The Jupyter notebook can be found here.