Skip to content

TheDatumOrg/TSB-Clustering

Repository files navigation

TSClusterX

Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods

Issues License

📄 Overview

Time-series clustering is one of the most popular tasks in time series analysis, offering a pathway for unsupervised data exploration and often acting as a subroutine for other tasks. Despite being the subject of active research across disciplines for decades, there has been limited focus on benchmarking clustering methods for time series data. Unfortunately, existing studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; (iv) avoided statistical validation of the findings; (v) suffered from poor reproduction of existing methods; or (vi) used questionable evaluation settings. Moreover, the growing enthusiasm for deep learning, particularly with the rise of foundation models that claim superior generalization across tasks and domains, highlights the need for a comprehensive evaluation, as their applicability to time-series clustering remains underexplored. Motivated by the aforementioned limitations, we comprehensively evaluate 84 clustering methods for time-series data, encompassing 10 different classes derived from data mining, machine learning, and deep learning literature. The evaluation is conducted across 128 different time-series datasets using rigorous statistical analysis.

If you find our work helpful, please consider citing:

"Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2025.
@article{paparrizos2025time,
  title={Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods},
  author={Paparrizos, John and Reddy, SPT},
  journal={Proceedings of the VLDB Endowment},
  volume={18},
  number={11},
  pages={4380--4395},
  year={2025}
}
"Odyssey: An Engine Enabling The Time-Series Clustering Journey" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2023.
@article{paparrizos2023odyssey,
  title={Odyssey: An engine enabling the time-series clustering journey},
  author={Paparrizos, John and Reddy, Sai Prasanna Teja},
  journal={Proceedings of the VLDB Endowment},
  volume={16},
  number={12},
  pages={4066--4069},
  year={2023},
  publisher={VLDB Endowment}
}
"Bridging the Gap: A Decade Review of Time-Series Clustering Methods" John Paparrizos, Fan Yang, and Haojun Li.
@article{paparrizos2024bridging,
  title={Bridging the gap: A decade review of time-series clustering methods},
  author={Paparrizos, John and Yang, Fan and Li, Haojun},
  journal={arXiv preprint arXiv:2412.20582},
  year={2024}
}

Data

We conduct our evaluation using the UCR Time-Series Archive, the largest collection of class-labeled time series datasets. The archive consists of a collection of 128 datasets sourced from different sensor readings while performing diverse tasks from multiple domains. All datasets in the archive span between 40 to 24000 time-series and have lengths varying from 15 to 2844. Datasets are z-normalized, and each time-series in the dataset belongs to only one class. There is a small subset of datasets in the archive containing missing values and varying lengths. We employ linear interpolation to fill the missing values and resample shorter time series to reach the longest time series in each dataset.

To ease reproducibility, we share our results over an established benchmarks:

  • The UCR Univariate Archive, which contains 128 univariate time-series datasets.
    • Download all 128 preprocessed datasets here.

For the preprocessing steps check here.

Get Started

TSClusterX is designed to provide a unified platform for evaluating time series clustering algorithms with support for various distance measures, clustering models, and evaluation metrics. The framework follows a factory design pattern that makes it easy to extend with new components.

Features

  • Multiple Clustering Models: Support for traditional clustering algorithms (K-means, Agglomerative, DBSCAN) and specialized time series clustering methods
  • Diverse Distance Measures: Implementation of various time series distance measures including DTW, GAK, SBD, MSM, TWED, and more
  • Extensible Architecture: Factory design pattern allows easy addition of new models, distances, dataloaders, and metrics
  • Standard Datasets: Built-in support for UCR/UEA time series archive
  • Evaluation Metrics: Comprehensive evaluation with Rand Index, Adjusted Rand Index, and Normalized Mutual Information

Installation

Requirements

Python 3.7+ is required. Install the dependencies:

pip install -r requirements.txt

Quick Start

Basic Usage

# Run clustering on UCR datasets with K-means and Euclidean distance
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model kmeans --distance euclidean

# Run with DTW distance and agglomerative clustering
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model agglomerative --distance sbd

# Use parameter configuration files
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model dbscan --distance euclidean \
    --parameter_settings parameters/dbscan.json --metrics RI ARI NMI

Command Line Arguments

  • --dataset: Dataset type (default: 'ucr_uea')
  • --start: Start index for UCR datasets (default: 1)
  • --end: End index for UCR datasets (default: 128)
  • --dataset_path: Path to dataset directory
  • --model: Clustering model name
  • --distance: Distance measure name
  • --parameter_settings: JSON file with model parameters
  • --metrics: List of evaluation metrics to compute

Architecture

TSClusterX uses a factory design pattern for extensibility:

Models Factory

from models.model import ModelFactory

# Get a clustering model
model = ModelFactory.get_model('kmeans', n_clusters=3, params={'init': 'k-means++'})

Distance Factory

from distances.distance import DistanceFactory

# Get a distance measure
distance = DistanceFactory.get_distance('dtw')
distance_matrix = distance.compute(time_series_data)

DataLoader Factory

from dataloaders.dataloader import DataLoaderFactory

# Get a data loader
dataloader = DataLoaderFactory.get_dataloader('ucr_uea', 'data/UCR2018/')
ts, labels, n_clusters = dataloader.load('Chinatown')

Metrics

from metrics.metric import ClusterMetrics

# Evaluate clustering results
metrics = ClusterMetrics(true_labels, predicted_labels)
ri = metrics.rand_score()
ari = metrics.adjusted_rand_score()
nmi = metrics.normalized_mutual_information()

Extending TSClusterX

The factory design pattern makes TSClusterX highly extensible. Here's how to add new components:

Adding a New Clustering Model

  1. Create a new model file in TSClusterX/models/:
# mymodel.py
from models.model import BaseClusterModel

class MyClusterModel(BaseClusterModel):
    def fit_predict(self, X):
        # Implement your clustering algorithm
        # Return labels and elapsed time
        return labels, elapsed_time
  1. Register it in models/model.py ModelFactory:
elif model_name == 'mymodel':
    from models import mymodel
    return mymodel.MyClusterModel(n_clusters, params, distance_name, distance_matrix)

Adding a New Distance Measure

  1. Create a new distance file in TSClusterX/distances/:
# mydistance.py
from distances.distance import DistanceMeasure

class MyDistance(DistanceMeasure):
    def compute(self, series_set):
        # Implement distance computation
        # Return distance matrix
        return distance_matrix
  1. Register it in distances/distance.py DistanceFactory:
elif name == "mydistance":
    from distances.mydistance import MyDistance
    return MyDistance()

Adding a New DataLoader

  1. Create a new dataloader file in TSClusterX/dataloaders/:
# mydataloader.py
class MyDataLoader:
    def __init__(self, dataset_name, dataset_path):
        self.name = dataset_name
        self.path = dataset_path
    
    def load(self, dataset_name):
        # Load your dataset
        # Return time_series, labels, n_clusters
        return ts, labels, n_clusters
  1. Register it in dataloaders/dataloader.py:
elif dataset_name == 'mydataset':
    from .mydataloader import MyDataLoader
    return MyDataLoader(dataset_name, dataset_path)

Adding New Metrics

Extend the ClusterMetrics class in metrics/metric.py:

def my_custom_metric(self):
    # Implement your metric
    return metric_value

Parameter Configuration

Model parameters can be specified using JSON configuration files:

{
    "eps": 0.5,
    "min_samples": 5,
    "metric": "euclidean"
}

Place configuration files in the parameters/ directory and reference them with --parameter_settings.

Examples

Example 1: Compare Multiple Distance Measures

# Test different distances with K-means
for distance in euclidean dtw gak sbd; do
    python TSClusterX/main.py --dataset ucr_uea --start 1 --end 5 \
        --dataset_path data/UCR2018/ --model kmeans --distance $distance
done

Example 2: Density-based Clustering with Custom Parameters

python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model dbscan --distance dtw \
    --parameter_settings parameters/dbscan.json --metrics RI ARI NMI

Results

Results are automatically saved in the results/ directory, organized by model type. Each run generates evaluation metrics and timing information.

Contributing

Contributions are welcome! The factory design pattern makes it easy to add new:

  • Clustering algorithms
  • Distance measures
  • Dataset loaders
  • Evaluation metrics

Please follow the existing patterns when adding new components.

Methods

Partitional Clustering

Clustering Method Distance Measure / Feature Vector Reference
𝑘-AVG ED [1]
KASBA MSM [39]
𝑘-Shape SBD [3]
𝑘-SC STID [5]
𝑘-DBA DTW [4]
PAM MSM [2]
PAM TWED [2]
PAM ERP [2]
PAM SBD [2]
PAM SWALE [2]
PAM DTW [2]
PAM EDR [2]
PAM LCSS [2]
PAM ED [2]

Kernel Clustering

Clustering Method Distance Measure / Feature Vector Reference
KKM SINK [6]
KKM GAK [6]
KKM KDTW [6]
KKM RBF [6]
SC SINK [7]
SC GAK [7]
SC KDTW [7]
SC RBF [7]

Density Clustering

Clustering Method Distance Measure / Feature Vector Reference
DBSCAN ED [8]
DBSCAN SBD [8]
DBSCAN MSM [8]
DP ED [10]
DP SBD [10]
DP MSM [10]
OPTICS ED [9]
OPTICS SBD [9]
OPTICS MSM [9]

Hierarchical Clustering

Clustering Method Distance Measure / Feature Vector Reference
AGG ED [11]
AGG SBD [11]
AGG MSM [11]
BIRCH - [12]

Distribution Clustering

Clustering Method Distance Measure / Feature Vector Reference
AP ED [13]
AP SBD [13]
AP MSM [13]
GMM - [14]

Shapelet Clustering

Clustering Method Distance Measure / Feature Vector Reference
UShapelet - [15]
LDPS - [16]
USLM - [17]

Model and Feature based Clustering

Clustering Method Distance Measure / Feature Vector Reference
𝑘-AVG AR-COEFF [20]
𝑘-AVG AR-PVAL [22]
𝑘-AVG LPCC [21]
𝑘-AVG CATCH22 [23]
𝑘-AVG ES-COEFF [22]

Deep Learning based Clustering

Clustering Method Distance Measure / Feature Vector Reference
IDEC - [27]
DEC - [26]
DTC - [29]
DTCR - [28]
SOM-VAE - [30]
DEPICT - [31]
SDCN - [32]
VADE - [33]
DCN - [25]

Foundation Model based Clustering

Clustering Method Distance Measure / Feature Vector Reference
MOMENT - [38]
OFA - [37]
CHRONOS - [36]

References

[1] MacQueen, J. "Some methods for classification and analysis of multivariate observations." In Proc. 5th Berkeley Symposium on Math., Stat., and Prob, p. 281. 1965.
[2] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[3] Paparrizos, John, and Luis Gravano. "k-shape: Efficient and accurate clustering of time series." In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1855-1870. 2015.
(4) Petitjean, François, Alain Ketterlin, and Pierre Gançarski. "A global averaging method for dynamic time warping, with applications to clustering." Pattern recognition 44, no. 3 (2011): 678-693.
[5] Yang, Jaewon, and Jure Leskovec. "Patterns of temporal variation in online media." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 177-186. 2011.
[6] Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. "Kernel k-means: spectral clustering and normalized cuts." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 551-556. 2004.
[7] Ng, Andrew, Michael Jordan, and Yair Weiss. "On spectral clustering: Analysis and an algorithm." Advances in neural information processing systems 14 (2001).
[8] Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." In kdd, vol. 96, no. 34, pp. 226-231. 1996.
[9] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. "OPTICS: Ordering points to identify the clustering structure." ACM Sigmod record 28, no. 2 (1999): 49-60.
[10] Rodriguez, Alex, and Alessandro Laio. "Clustering by fast search and find of density peaks." science 344, no. 6191 (2014): 1492-1496.
[11] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[12] Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25, no. 2 (1996): 103-114.
[13] Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315, no. 5814 (2007): 972-976.
[14] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society: series B (methodological) 39, no. 1 (1977): 1-22.
[15] Zakaria, Jesin, Abdullah Mueen, and Eamonn Keogh. "Clustering time series using unsupervised-shapelets." In 2012 IEEE 12th International Conference on Data Mining, pp. 785-794. IEEE, 2012.
[16] Lods, Arnaud, Simon Malinowski, Romain Tavenard, and Laurent Amsaleg. "Learning DTW-preserving shapelets." In Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings 16, pp. 198-209. springer International Publishing, 2017.
[17] Zhang, Qin, Jia Wu, Hong Yang, Yingjie Tian, and Chengqi Zhang. "Unsupervised feature learning from time series." In IJCAI, pp. 2322-2328. 2016.
[18] Tiano, Donato, Angela Bonifati, and Raymond Ng. "FeatTS: Feature-based Time Series Clustering." In Proceedings of the 2021 International Conference on Management of Data, pp. 2784-2788. 2021.
[19] Dau, Hoang Anh, Nurjahan Begum, and Eamonn Keogh. "Semi-supervision dramatically improves time series clustering under dynamic time warping." In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 999-1008. 2016.
[20] Piccolo, Domenico. "A distance measure for classifying ARIMA models." Journal of time series analysis 11, no. 2 (1990): 153-164.
[21] Kalpakis, Konstantinos, Dhiral Gada, and Vasundhara Puttagunta. "Distance measures for effective clustering of ARIMA time-series." In Proceedings 2001 IEEE international conference on data mining, pp. 273-280. IEEE, 2001.
[22] Maharaj, Elizabeth Ann. "Cluster of Time Series." Journal of Classification 17, no. 2 (2000).
[23] Lubba, Carl H., Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. "catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis." Data Mining and Knowledge Discovery 33, no. 6 (2019): 1821-1852.
[24] Fulcher, Ben D., and Nick S. Jones. "hctsa: A computational framework for automated time-series phenotyping using massive feature extraction." Cell systems 5, no. 5 (2017): 527-531.
[25] Yang, Bo, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. "Towards k-means-friendly spaces: Simultaneous deep learning and clustering." In international conference on machine learning, pp. 3861-3870. PMLR, 2017.
[26] Xie, Junyuan, Ross Girshick, and Ali Farhadi. "Unsupervised deep embedding for clustering analysis." In International conference on machine learning, pp. 478-487. PMLR, 2016.
[27] Guo, Xifeng, Long Gao, Xinwang Liu, and Jianping Yin. "Improved deep embedded clustering with local structure preservation." In Ijcai, pp. 1753-1759. 2017.
[28] Ma, Qianli, Jiawei Zheng, Sen Li, and Gary W. Cottrell. "Learning representations for time series clustering." Advances in neural information processing systems 32 (2019).
[29] Madiraju, Naveen Sai. "Deep temporal clustering: Fully unsupervised learning of time-domain features." PhD diss., Arizona State University, 2018.
[30] Fortuin, Vincent, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. "Som-vae: Interpretable discrete representation learning on time series." arXiv preprint arXiv:1806.02199 (2018).
[31] Ghasedi Dizaji, Kamran, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. "Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization." In Proceedings of the IEEE international conference on computer vision, pp. 5736-5745. 2017.
[32] Bo, Deyu, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. "Structural deep clustering network." In Proceedings of the web conference 2020, pp. 1400-1410. 2020.
[33] Jiang, Zhuxi, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. "Variational deep embedding: A generative approach to clustering." CoRR, abs/1611.05148 1 (2016).
[34] Ghasedi, Kamran, Xiaoqian Wang, Cheng Deng, and Heng Huang. "Balanced self-paced learning for generative adversarial clustering network." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4391-4400. 2019.
[36] Ansari, Abdul Fatir, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur et al. "Chronos: Learning the language of time series." arXiv preprint arXiv:2403.07815 (2024).
[37] Zhou, Tian, Peisong Niu, Liang Sun, and Rong Jin. "One fits all: Power general time series analysis by pretrained lm." Advances in neural information processing systems 36 (2023): 43322-43355.
[38] Goswami, Mononito, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. "Moment: A family of open time-series foundation models." arXiv preprint arXiv:2402.03885 (2024).
[39] Holder, Christopher, and Anthony Bagnall. "Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering." arXiv preprint arXiv:2411.17838 (2024).

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •