binder

Deep dive in the distance profiles

In this notebook, we will talk more about the theory behind distance profile, how they are computed, and how they can be optimized. For practical experiments on the speedups implemented in aeon, refer to the notebook on the Analysis of the speedups provided by similarity search module notebook.

What are distance profiles ?

In the context of similarity search, where we have as input a time series \(X = \{x_1, \ldots, x_m\}\) and a query \(Q = \{q_1, \ldots, q_l\}\), a distance profile is defined as a vector containing the similarity of \(Q\) to every subsequence of size \(l\) in \(X\), with the \(i^{th}\) subsequence denoted by \(W_i = \{x_i, \ldots, x_{i+(l-1)}\}\).

Given a distance or dissimilarity function \(dist\), such as the Euclidean distance, the distance profile \(P(X,Q)\) is expressed as :

\[P(X, Q) = \{dist(W_1, Q), \ldots, dist(W_{m-(l-1)}, Q)\}\]

We can then find the “best match” between \(Q\) and \(X\) by looking at the distance profile minimum value and extract the subsequence \(W_{\text{argmin} P(X,Q)}\) as the best match.

Trivial matches

One should be careful of what is called “trivial matches” in this situation. If \(Q\) is extracted from \(X\), it is extremely likely that it will match with itself, as \(dist(Q,Q)=0\). To avoid this, it is common to set the parts of the distance profile that are neighbors to \(Q\) to \(\infty\). This is the role of the q_index parameter in the similarity search predict methods. The exclusion_factor parameter is used to define the neighbors of \(Q\) that will also get \(\infty\) value.

For example, if \(Q\) was extracted at index \(i\) in \(X\) (i.e. \(Q = \{x_i, \ldots, x_{i+(l-1)}\}\)), then all points in the interval [i - l//exclusion_factor, i + l//exclusion_factor] will the set to \(\infty\) in the distance profile to avoid a trivial match.

The same reasoning can also be applied for the best matches of \(Q\). It is highly likely that the two best matches will be neighbours, as if \(W_i\) and \(W_{i+/-1}\) share \(l-1\) values. The apply_exclusion_to_result boolean parameter in predict allows you to apply the exclusion zone defined by [i - l//exclusion_factor, i + l//exclusion_factor] to the output of the algorithm.

Optimizing the distance profile computation

The main idea behind optimizing the distance profile computation, in the case where we want to obtain the exact results (i.e. make no approximation), is to avoid recomputations as much as possible. For example, when computing the mean and standard deviation of each subsequence \(W_1, \ldots, W_{m-l+1}\), instead of computing these statistics independently for each subsequence \(W_i\), we can exploit the fact that these subsequences are extracted from the same time series.

Consider the case of the mean of \(W_i\) expressed as \(\displaystyle{\mu_i = \frac{1}{l}\sum_{j=0}^{l-1} x_{i+j}}\). Instead of completly recomputing the mean of \(W_{i+1}\), we can keep a rolling sum \(S\) and update it as we go from \(\mu_1\) to \(\mu_{m-l+1}\).

Let \(\displaystyle{S_1 = \sum_{j=0}^{l-1} x_{1+j}}\), we compute the mean of \(W_1\) as \(\displaystyle{\mu_1 = \frac{1}{l} S_1}\). Then we can compute \(S_2\) as \(S_2 = S_1 - x_1 + x_{2+l-1}\) and compute \(\displaystyle{\mu_2 = \frac{1}{l} S_2}\). It can be generalized as \(S_i = S_{i-1} - x_{i-1} + x_{i+l-1}\).

We can apply the same reasoning for computing the standard deviation by keeping a rolling squared sum. For a code example, you can check the sliding_mean_std_one_series function of aeon.

Optimization for the (squared) euclidean distance

In this section, we consider the case of the squared euclidean distance (i.e. no square root), but the following also holds for the euclidean distance by using \(\displaystyle{\sqrt{d(W_i, Q)}}\).

Applying the same reasoning as the computation of the mean, we can minimize recomputation in the case of the euclidean distance profile. If we consider the euclidean distance between a subsequence \(W_i = \{x_i, \ldots, x_{i+(l-1)}\}\) extracted from a time series \(X = \{x_1, \ldots, x_m\}\), and a query \(Q = \{q_1, \ldots, q_l\}\), we have :

\(\displaystyle{d(W_i, Q) = \sum_{j=1}^{l} (q_j - x_{i+j-1})^2}\)

\(\displaystyle{d(W_i, Q) = \sum_{j=1}^{l} q_j^2 + \sum_{j=1}^{l} x_{i+j-1}^2 - 2 \times \sum_{j=1}^{l} q_j \times x_{i+j-1}}\)

Seeing the euclidean distance as the second equation allows us to : - Compute the sum \(\sum_{j=1}^{l} q_j^2\) only once for all \(W_i\) - Keep a rolling sum to update \(\sum_{j=1}^{l} x_{i+j-1}^2\) as we go from \(W_1\) to \(W_{m-l+1}\) - Use a cross-correlation operation efficiently to compute \(\sum_{j=1}^{l} q_j \times x_{i+j-1}\) as \(X*Q\). For large inputs, we can also compute this in the frequency domain with a convolution (see scipy convolve function), in which case we need to flip the query Q before computing the convolution.

What about the normalized (squared) euclidean distance ?

The trick we use for the non normalized euclidean distance cannot apply to the normalized distance, as we cannot keep the rolling sum between subsequences with different means and standard deviations. However, we can compute the normalized euclidean distance as the pearson correlation coefficient, as show in the paper Matrix Profile I: All Pairs Similarity Joins for Time Series.

Consider the mean \(\mu_i\) and standard deviation \(\sigma_i\) of a subsequence \(W_i\) (computed with rolling sums) and the mean \(\mu_q\) and standard deviation \(\sigma_q\) of the query. Given \(QW_i\) the dot product obtained from a cross correlation or a convolution operation between \(X\) and \(Q\), we can compute the normalized squared euclidean distance as :

\(\displaystyle{d(W_i, Q) = 2l (\frac{QW_i - l.\mu_q.\mu_i}{l.\sigma_q.\sigma_i})}\)

[ ]:


Generated using nbsphinx. The Jupyter notebook can be found here.