Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods
Time-series clustering is one of the most popular tasks in time series analysis, offering a pathway for unsupervised data exploration and often acting as a subroutine for other tasks. Despite being the subject of active research across disciplines for decades, there has been limited focus on benchmarking clustering methods for time series data. Unfortunately, existing studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; (iv) avoided statistical validation of the findings; (v) suffered from poor reproduction of existing methods; or (vi) used questionable evaluation settings. Moreover, the growing enthusiasm for deep learning, particularly with the rise of foundation models that claim superior generalization across tasks and domains, highlights the need for a comprehensive evaluation, as their applicability to time-series clustering remains underexplored. Motivated by the aforementioned limitations, we comprehensively evaluate 84 clustering methods for time-series data, encompassing 10 different classes derived from data mining, machine learning, and deep learning literature. The evaluation is conducted across 128 different time-series datasets using rigorous statistical analysis.
If you find our work helpful, please consider citing:
"Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2025.
@article{paparrizos2025time,
title={Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods},
author={Paparrizos, John and Reddy, SPT},
journal={Proceedings of the VLDB Endowment},
volume={18},
number={11},
pages={4380--4395},
year={2025}
}
"Odyssey: An Engine Enabling The Time-Series Clustering Journey" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2023.
@article{paparrizos2023odyssey,
title={Odyssey: An engine enabling the time-series clustering journey},
author={Paparrizos, John and Reddy, Sai Prasanna Teja},
journal={Proceedings of the VLDB Endowment},
volume={16},
number={12},
pages={4066--4069},
year={2023},
publisher={VLDB Endowment}
}
"Bridging the Gap: A Decade Review of Time-Series Clustering Methods" John Paparrizos, Fan Yang, and Haojun Li.
@article{paparrizos2024bridging,
title={Bridging the gap: A decade review of time-series clustering methods},
author={Paparrizos, John and Yang, Fan and Li, Haojun},
journal={arXiv preprint arXiv:2412.20582},
year={2024}
}
We conduct our evaluation using the UCR Time-Series Archive, the largest collection of class-labeled time series datasets. The archive consists of a collection of 128 datasets sourced from different sensor readings while performing diverse tasks from multiple domains. All datasets in the archive span between 40 to 24000 time-series and have lengths varying from 15 to 2844. Datasets are z-normalized, and each time-series in the dataset belongs to only one class. There is a small subset of datasets in the archive containing missing values and varying lengths. We employ linear interpolation to fill the missing values and resample shorter time series to reach the longest time series in each dataset.
To ease reproducibility, we share our results over an established benchmarks:
- The UCR Univariate Archive, which contains 128 univariate time-series datasets.
- Download all 128 preprocessed datasets here.
For the preprocessing steps check here.
TSClusterX is designed to provide a unified platform for evaluating time series clustering algorithms with support for various distance measures, clustering models, and evaluation metrics. The framework follows a factory design pattern that makes it easy to extend with new components.
- Multiple Clustering Models: Support for traditional clustering algorithms (K-means, Agglomerative, DBSCAN) and specialized time series clustering methods
- Diverse Distance Measures: Implementation of various time series distance measures including DTW, GAK, SBD, MSM, TWED, and more
- Extensible Architecture: Factory design pattern allows easy addition of new models, distances, dataloaders, and metrics
- Standard Datasets: Built-in support for UCR/UEA time series archive
- Evaluation Metrics: Comprehensive evaluation with Rand Index, Adjusted Rand Index, and Normalized Mutual Information
Python 3.7+ is required. Install the dependencies:
pip install -r requirements.txt
# Run clustering on UCR datasets with K-means and Euclidean distance
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
--dataset_path data/UCR2018/ --model kmeans --distance euclidean
# Run with DTW distance and agglomerative clustering
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
--dataset_path data/UCR2018/ --model agglomerative --distance sbd
# Use parameter configuration files
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
--dataset_path data/UCR2018/ --model dbscan --distance euclidean \
--parameter_settings parameters/dbscan.json --metrics RI ARI NMI
--dataset
: Dataset type (default: 'ucr_uea')--start
: Start index for UCR datasets (default: 1)--end
: End index for UCR datasets (default: 128)--dataset_path
: Path to dataset directory--model
: Clustering model name--distance
: Distance measure name--parameter_settings
: JSON file with model parameters--metrics
: List of evaluation metrics to compute
TSClusterX uses a factory design pattern for extensibility:
from models.model import ModelFactory
# Get a clustering model
model = ModelFactory.get_model('kmeans', n_clusters=3, params={'init': 'k-means++'})
from distances.distance import DistanceFactory
# Get a distance measure
distance = DistanceFactory.get_distance('dtw')
distance_matrix = distance.compute(time_series_data)
from dataloaders.dataloader import DataLoaderFactory
# Get a data loader
dataloader = DataLoaderFactory.get_dataloader('ucr_uea', 'data/UCR2018/')
ts, labels, n_clusters = dataloader.load('Chinatown')
from metrics.metric import ClusterMetrics
# Evaluate clustering results
metrics = ClusterMetrics(true_labels, predicted_labels)
ri = metrics.rand_score()
ari = metrics.adjusted_rand_score()
nmi = metrics.normalized_mutual_information()
The factory design pattern makes TSClusterX highly extensible. Here's how to add new components:
- Create a new model file in
TSClusterX/models/
:
# mymodel.py
from models.model import BaseClusterModel
class MyClusterModel(BaseClusterModel):
def fit_predict(self, X):
# Implement your clustering algorithm
# Return labels and elapsed time
return labels, elapsed_time
- Register it in
models/model.py
ModelFactory:
elif model_name == 'mymodel':
from models import mymodel
return mymodel.MyClusterModel(n_clusters, params, distance_name, distance_matrix)
- Create a new distance file in
TSClusterX/distances/
:
# mydistance.py
from distances.distance import DistanceMeasure
class MyDistance(DistanceMeasure):
def compute(self, series_set):
# Implement distance computation
# Return distance matrix
return distance_matrix
- Register it in
distances/distance.py
DistanceFactory:
elif name == "mydistance":
from distances.mydistance import MyDistance
return MyDistance()
- Create a new dataloader file in
TSClusterX/dataloaders/
:
# mydataloader.py
class MyDataLoader:
def __init__(self, dataset_name, dataset_path):
self.name = dataset_name
self.path = dataset_path
def load(self, dataset_name):
# Load your dataset
# Return time_series, labels, n_clusters
return ts, labels, n_clusters
- Register it in
dataloaders/dataloader.py
:
elif dataset_name == 'mydataset':
from .mydataloader import MyDataLoader
return MyDataLoader(dataset_name, dataset_path)
Extend the ClusterMetrics
class in metrics/metric.py
:
def my_custom_metric(self):
# Implement your metric
return metric_value
Model parameters can be specified using JSON configuration files:
{
"eps": 0.5,
"min_samples": 5,
"metric": "euclidean"
}
Place configuration files in the parameters/
directory and reference them with --parameter_settings
.
# Test different distances with K-means
for distance in euclidean dtw gak sbd; do
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 5 \
--dataset_path data/UCR2018/ --model kmeans --distance $distance
done
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
--dataset_path data/UCR2018/ --model dbscan --distance dtw \
--parameter_settings parameters/dbscan.json --metrics RI ARI NMI
Results are automatically saved in the results/
directory, organized by model type. Each run generates evaluation metrics and timing information.
Contributions are welcome! The factory design pattern makes it easy to add new:
- Clustering algorithms
- Distance measures
- Dataset loaders
- Evaluation metrics
Please follow the existing patterns when adding new components.
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
𝑘-AVG | ED | [1] |
KASBA | MSM | [39] |
𝑘-Shape | SBD | [3] |
𝑘-SC | STID | [5] |
𝑘-DBA | DTW | [4] |
PAM | MSM | [2] |
PAM | TWED | [2] |
PAM | ERP | [2] |
PAM | SBD | [2] |
PAM | SWALE | [2] |
PAM | DTW | [2] |
PAM | EDR | [2] |
PAM | LCSS | [2] |
PAM | ED | [2] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
KKM | SINK | [6] |
KKM | GAK | [6] |
KKM | KDTW | [6] |
KKM | RBF | [6] |
SC | SINK | [7] |
SC | GAK | [7] |
SC | KDTW | [7] |
SC | RBF | [7] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
DBSCAN | ED | [8] |
DBSCAN | SBD | [8] |
DBSCAN | MSM | [8] |
DP | ED | [10] |
DP | SBD | [10] |
DP | MSM | [10] |
OPTICS | ED | [9] |
OPTICS | SBD | [9] |
OPTICS | MSM | [9] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
AGG | ED | [11] |
AGG | SBD | [11] |
AGG | MSM | [11] |
BIRCH | - | [12] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
AP | ED | [13] |
AP | SBD | [13] |
AP | MSM | [13] |
GMM | - | [14] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
UShapelet | - | [15] |
LDPS | - | [16] |
USLM | - | [17] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
𝑘-AVG | AR-COEFF | [20] |
𝑘-AVG | AR-PVAL | [22] |
𝑘-AVG | LPCC | [21] |
𝑘-AVG | CATCH22 | [23] |
𝑘-AVG | ES-COEFF | [22] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
IDEC | - | [27] |
DEC | - | [26] |
DTC | - | [29] |
DTCR | - | [28] |
SOM-VAE | - | [30] |
DEPICT | - | [31] |
SDCN | - | [32] |
VADE | - | [33] |
DCN | - | [25] |
Clustering Method | Distance Measure / Feature Vector | Reference |
---|---|---|
MOMENT | - | [38] |
OFA | - | [37] |
CHRONOS | - | [36] |
[1] MacQueen, J. "Some methods for classification and analysis of multivariate observations." In Proc. 5th Berkeley Symposium on Math., Stat., and Prob, p. 281. 1965.
[2] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[3] Paparrizos, John, and Luis Gravano. "k-shape: Efficient and accurate clustering of time series." In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1855-1870. 2015.
(4) Petitjean, François, Alain Ketterlin, and Pierre Gançarski. "A global averaging method for dynamic time warping, with applications to clustering." Pattern recognition 44, no. 3 (2011): 678-693.
[5] Yang, Jaewon, and Jure Leskovec. "Patterns of temporal variation in online media." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 177-186. 2011.
[6] Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. "Kernel k-means: spectral clustering and normalized cuts." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 551-556. 2004.
[7] Ng, Andrew, Michael Jordan, and Yair Weiss. "On spectral clustering: Analysis and an algorithm." Advances in neural information processing systems 14 (2001).
[8] Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." In kdd, vol. 96, no. 34, pp. 226-231. 1996.
[9] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. "OPTICS: Ordering points to identify the clustering structure." ACM Sigmod record 28, no. 2 (1999): 49-60.
[10] Rodriguez, Alex, and Alessandro Laio. "Clustering by fast search and find of density peaks." science 344, no. 6191 (2014): 1492-1496.
[11] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[12] Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25, no. 2 (1996): 103-114.
[13] Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315, no. 5814 (2007): 972-976.
[14] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society: series B (methodological) 39, no. 1 (1977): 1-22.
[15] Zakaria, Jesin, Abdullah Mueen, and Eamonn Keogh. "Clustering time series using unsupervised-shapelets." In 2012 IEEE 12th International Conference on Data Mining, pp. 785-794. IEEE, 2012.
[16] Lods, Arnaud, Simon Malinowski, Romain Tavenard, and Laurent Amsaleg. "Learning DTW-preserving shapelets." In Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings 16, pp. 198-209. springer International Publishing, 2017.
[17] Zhang, Qin, Jia Wu, Hong Yang, Yingjie Tian, and Chengqi Zhang. "Unsupervised feature learning from time series." In IJCAI, pp. 2322-2328. 2016.
[18] Tiano, Donato, Angela Bonifati, and Raymond Ng. "FeatTS: Feature-based Time Series Clustering." In Proceedings of the 2021 International Conference on Management of Data, pp. 2784-2788. 2021.
[19] Dau, Hoang Anh, Nurjahan Begum, and Eamonn Keogh. "Semi-supervision dramatically improves time series clustering under dynamic time warping." In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 999-1008. 2016.
[20] Piccolo, Domenico. "A distance measure for classifying ARIMA models." Journal of time series analysis 11, no. 2 (1990): 153-164.
[21] Kalpakis, Konstantinos, Dhiral Gada, and Vasundhara Puttagunta. "Distance measures for effective clustering of ARIMA time-series." In Proceedings 2001 IEEE international conference on data mining, pp. 273-280. IEEE, 2001.
[22] Maharaj, Elizabeth Ann. "Cluster of Time Series." Journal of Classification 17, no. 2 (2000).
[23] Lubba, Carl H., Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. "catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis." Data Mining and Knowledge Discovery 33, no. 6 (2019): 1821-1852.
[24] Fulcher, Ben D., and Nick S. Jones. "hctsa: A computational framework for automated time-series phenotyping using massive feature extraction." Cell systems 5, no. 5 (2017): 527-531.
[25] Yang, Bo, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. "Towards k-means-friendly spaces: Simultaneous deep learning and clustering." In international conference on machine learning, pp. 3861-3870. PMLR, 2017.
[26] Xie, Junyuan, Ross Girshick, and Ali Farhadi. "Unsupervised deep embedding for clustering analysis." In International conference on machine learning, pp. 478-487. PMLR, 2016.
[27] Guo, Xifeng, Long Gao, Xinwang Liu, and Jianping Yin. "Improved deep embedded clustering with local structure preservation." In Ijcai, pp. 1753-1759. 2017.
[28] Ma, Qianli, Jiawei Zheng, Sen Li, and Gary W. Cottrell. "Learning representations for time series clustering." Advances in neural information processing systems 32 (2019).
[29] Madiraju, Naveen Sai. "Deep temporal clustering: Fully unsupervised learning of time-domain features." PhD diss., Arizona State University, 2018.
[30] Fortuin, Vincent, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. "Som-vae: Interpretable discrete representation learning on time series." arXiv preprint arXiv:1806.02199 (2018).
[31] Ghasedi Dizaji, Kamran, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. "Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization." In Proceedings of the IEEE international conference on computer vision, pp. 5736-5745. 2017.
[32] Bo, Deyu, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. "Structural deep clustering network." In Proceedings of the web conference 2020, pp. 1400-1410. 2020.
[33] Jiang, Zhuxi, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. "Variational deep embedding: A generative approach to clustering." CoRR, abs/1611.05148 1 (2016).
[34] Ghasedi, Kamran, Xiaoqian Wang, Cheng Deng, and Heng Huang. "Balanced self-paced learning for generative adversarial clustering network." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4391-4400. 2019.
[36] Ansari, Abdul Fatir, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur et al. "Chronos: Learning the language of time series." arXiv preprint arXiv:2403.07815 (2024).
[37] Zhou, Tian, Peisong Niu, Liang Sun, and Rong Jin. "One fits all: Power general time series analysis by pretrained lm." Advances in neural information processing systems 36 (2023): 43322-43355.
[38] Goswami, Mononito, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. "Moment: A family of open time-series foundation models." arXiv preprint arXiv:2402.03885 (2024).
[39] Holder, Christopher, and Anthony Bagnall. "Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering." arXiv preprint arXiv:2411.17838 (2024).
This project is licensed under the MIT License - see the LICENSE file for details.