TSClusterX

Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods

📄 Overview

Time-series clustering is one of the most popular tasks in time series analysis, offering a pathway for unsupervised data exploration and often acting as a subroutine for other tasks. Despite being the subject of active research across disciplines for decades, there has been limited focus on benchmarking clustering methods for time series data. Unfortunately, existing studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; (iv) avoided statistical validation of the findings; (v) suffered from poor reproduction of existing methods; or (vi) used questionable evaluation settings. Moreover, the growing enthusiasm for deep learning, particularly with the rise of foundation models that claim superior generalization across tasks and domains, highlights the need for a comprehensive evaluation, as their applicability to time-series clustering remains underexplored. Motivated by the aforementioned limitations, we comprehensively evaluate 84 clustering methods for time-series data, encompassing 10 different classes derived from data mining, machine learning, and deep learning literature. The evaluation is conducted across 128 different time-series datasets using rigorous statistical analysis.

If you find our work helpful, please consider citing:

"Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2025.

@article{paparrizos2025time,
  title={Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods},
  author={Paparrizos, John and Reddy, SPT},
  journal={Proceedings of the VLDB Endowment},
  volume={18},
  number={11},
  pages={4380--4395},
  year={2025}
}

"Odyssey: An Engine Enabling The Time-Series Clustering Journey" John Paparrizos and Sai Prasanna Teja Reddy VLDB 2023.

@article{paparrizos2023odyssey,
  title={Odyssey: An engine enabling the time-series clustering journey},
  author={Paparrizos, John and Reddy, Sai Prasanna Teja},
  journal={Proceedings of the VLDB Endowment},
  volume={16},
  number={12},
  pages={4066--4069},
  year={2023},
  publisher={VLDB Endowment}
}

"Bridging the Gap: A Decade Review of Time-Series Clustering Methods" John Paparrizos, Fan Yang, and Haojun Li.

@article{paparrizos2024bridging,
  title={Bridging the gap: A decade review of time-series clustering methods},
  author={Paparrizos, John and Yang, Fan and Li, Haojun},
  journal={arXiv preprint arXiv:2412.20582},
  year={2024}
}

Data

We conduct our evaluation using the UCR Time-Series Archive, the largest collection of class-labeled time series datasets. The archive consists of a collection of 128 datasets sourced from different sensor readings while performing diverse tasks from multiple domains. All datasets in the archive span between 40 to 24000 time-series and have lengths varying from 15 to 2844. Datasets are z-normalized, and each time-series in the dataset belongs to only one class. There is a small subset of datasets in the archive containing missing values and varying lengths. We employ linear interpolation to fill the missing values and resample shorter time series to reach the longest time series in each dataset.

To ease reproducibility, we share our results over an established benchmarks:

The UCR Univariate Archive, which contains 128 univariate time-series datasets.
- Download all 128 preprocessed datasets here.

For the preprocessing steps check here.

Get Started

TSClusterX is designed to provide a unified platform for evaluating time series clustering algorithms with support for various distance measures, clustering models, and evaluation metrics. The framework follows a factory design pattern that makes it easy to extend with new components.

Features

Multiple Clustering Models: Support for traditional clustering algorithms (K-means, Agglomerative, DBSCAN) and specialized time series clustering methods
Diverse Distance Measures: Implementation of various time series distance measures including DTW, GAK, SBD, MSM, TWED, and more
Extensible Architecture: Factory design pattern allows easy addition of new models, distances, dataloaders, and metrics
Standard Datasets: Built-in support for UCR/UEA time series archive
Evaluation Metrics: Comprehensive evaluation with Rand Index, Adjusted Rand Index, and Normalized Mutual Information

Installation

Requirements

Python 3.7+ is required. Install the dependencies:

pip install -r requirements.txt

Quick Start

Basic Usage

# Run clustering on UCR datasets with K-means and Euclidean distance
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model kmeans --distance euclidean

# Run with DTW distance and agglomerative clustering
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model agglomerative --distance sbd

# Use parameter configuration files
python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model dbscan --distance euclidean \
    --parameter_settings parameters/dbscan.json --metrics RI ARI NMI

Command Line Arguments

--dataset: Dataset type (default: 'ucr_uea')
--start: Start index for UCR datasets (default: 1)
--end: End index for UCR datasets (default: 128)
--dataset_path: Path to dataset directory
--model: Clustering model name
--distance: Distance measure name
--parameter_settings: JSON file with model parameters
--metrics: List of evaluation metrics to compute

Architecture

TSClusterX uses a factory design pattern for extensibility:

Models Factory

from models.model import ModelFactory

# Get a clustering model
model = ModelFactory.get_model('kmeans', n_clusters=3, params={'init': 'k-means++'})

Distance Factory

from distances.distance import DistanceFactory

# Get a distance measure
distance = DistanceFactory.get_distance('dtw')
distance_matrix = distance.compute(time_series_data)

DataLoader Factory

from dataloaders.dataloader import DataLoaderFactory

# Get a data loader
dataloader = DataLoaderFactory.get_dataloader('ucr_uea', 'data/UCR2018/')
ts, labels, n_clusters = dataloader.load('Chinatown')

Metrics

from metrics.metric import ClusterMetrics

# Evaluate clustering results
metrics = ClusterMetrics(true_labels, predicted_labels)
ri = metrics.rand_score()
ari = metrics.adjusted_rand_score()
nmi = metrics.normalized_mutual_information()

Extending TSClusterX

The factory design pattern makes TSClusterX highly extensible. Here's how to add new components:

Adding a New Clustering Model

Create a new model file in TSClusterX/models/:

# mymodel.py
from models.model import BaseClusterModel

class MyClusterModel(BaseClusterModel):
    def fit_predict(self, X):
        # Implement your clustering algorithm
        # Return labels and elapsed time
        return labels, elapsed_time

Register it in models/model.py ModelFactory:

elif model_name == 'mymodel':
    from models import mymodel
    return mymodel.MyClusterModel(n_clusters, params, distance_name, distance_matrix)

Adding a New Distance Measure

Create a new distance file in TSClusterX/distances/:

# mydistance.py
from distances.distance import DistanceMeasure

class MyDistance(DistanceMeasure):
    def compute(self, series_set):
        # Implement distance computation
        # Return distance matrix
        return distance_matrix

Register it in distances/distance.py DistanceFactory:

elif name == "mydistance":
    from distances.mydistance import MyDistance
    return MyDistance()

Adding a New DataLoader

Create a new dataloader file in TSClusterX/dataloaders/:

# mydataloader.py
class MyDataLoader:
    def __init__(self, dataset_name, dataset_path):
        self.name = dataset_name
        self.path = dataset_path
    
    def load(self, dataset_name):
        # Load your dataset
        # Return time_series, labels, n_clusters
        return ts, labels, n_clusters

Register it in dataloaders/dataloader.py:

elif dataset_name == 'mydataset':
    from .mydataloader import MyDataLoader
    return MyDataLoader(dataset_name, dataset_path)

Adding New Metrics

Extend the ClusterMetrics class in metrics/metric.py:

def my_custom_metric(self):
    # Implement your metric
    return metric_value

Parameter Configuration

Model parameters can be specified using JSON configuration files:

{
    "eps": 0.5,
    "min_samples": 5,
    "metric": "euclidean"
}

Place configuration files in the parameters/ directory and reference them with --parameter_settings.

Examples

Example 1: Compare Multiple Distance Measures

# Test different distances with K-means
for distance in euclidean dtw gak sbd; do
    python TSClusterX/main.py --dataset ucr_uea --start 1 --end 5 \
        --dataset_path data/UCR2018/ --model kmeans --distance $distance
done

Example 2: Density-based Clustering with Custom Parameters

python TSClusterX/main.py --dataset ucr_uea --start 1 --end 10 \
    --dataset_path data/UCR2018/ --model dbscan --distance dtw \
    --parameter_settings parameters/dbscan.json --metrics RI ARI NMI

Results

Results are automatically saved in the results/ directory, organized by model type. Each run generates evaluation metrics and timing information.

Contributing

Contributions are welcome! The factory design pattern makes it easy to add new:

Clustering algorithms
Distance measures
Dataset loaders
Evaluation metrics

Please follow the existing patterns when adding new components.

Methods

Partitional Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
𝑘-AVG	ED	[1]
KASBA	MSM	[39]
𝑘-Shape	SBD	[3]
𝑘-SC	STID	[5]
𝑘-DBA	DTW	[4]
PAM	MSM	[2]
PAM	TWED	[2]
PAM	ERP	[2]
PAM	SBD	[2]
PAM	SWALE	[2]
PAM	DTW	[2]
PAM	EDR	[2]
PAM	LCSS	[2]
PAM	ED	[2]

Kernel Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
KKM	SINK	[6]
KKM	GAK	[6]
KKM	KDTW	[6]
KKM	RBF	[6]
SC	SINK	[7]
SC	GAK	[7]
SC	KDTW	[7]
SC	RBF	[7]

Density Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
DBSCAN	ED	[8]
DBSCAN	SBD	[8]
DBSCAN	MSM	[8]
DP	ED	[10]
DP	SBD	[10]
DP	MSM	[10]
OPTICS	ED	[9]
OPTICS	SBD	[9]
OPTICS	MSM	[9]

Hierarchical Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
AGG	ED	[11]
AGG	SBD	[11]
AGG	MSM	[11]
BIRCH	-	[12]

Distribution Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
AP	ED	[13]
AP	SBD	[13]
AP	MSM	[13]
GMM	-	[14]

Shapelet Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
UShapelet	-	[15]
LDPS	-	[16]
USLM	-	[17]

Model and Feature based Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
𝑘-AVG	AR-COEFF	[20]
𝑘-AVG	AR-PVAL	[22]
𝑘-AVG	LPCC	[21]
𝑘-AVG	CATCH22	[23]
𝑘-AVG	ES-COEFF	[22]

Deep Learning based Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
IDEC	-	[27]
DEC	-	[26]
DTC	-	[29]
DTCR	-	[28]
SOM-VAE	-	[30]
DEPICT	-	[31]
SDCN	-	[32]
VADE	-	[33]
DCN	-	[25]

Foundation Model based Clustering

Clustering Method	Distance Measure / Feature Vector	Reference
MOMENT	-	[38]
OFA	-	[37]
CHRONOS	-	[36]

References

[1] MacQueen, J. "Some methods for classiﬁcation and analysis of multivariate observations." In Proc. 5th Berkeley Symposium on Math., Stat., and Prob, p. 281. 1965.
[2] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[3] Paparrizos, John, and Luis Gravano. "k-shape: Efficient and accurate clustering of time series." In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1855-1870. 2015.
(4) Petitjean, François, Alain Ketterlin, and Pierre Gançarski. "A global averaging method for dynamic time warping, with applications to clustering." Pattern recognition 44, no. 3 (2011): 678-693.
[5] Yang, Jaewon, and Jure Leskovec. "Patterns of temporal variation in online media." In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 177-186. 2011.
[6] Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis. "Kernel k-means: spectral clustering and normalized cuts." In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 551-556. 2004.
[7] Ng, Andrew, Michael Jordan, and Yair Weiss. "On spectral clustering: Analysis and an algorithm." Advances in neural information processing systems 14 (2001).
[8] Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise." In kdd, vol. 96, no. 34, pp. 226-231. 1996.
[9] Ankerst, Mihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. "OPTICS: Ordering points to identify the clustering structure." ACM Sigmod record 28, no. 2 (1999): 49-60.
[10] Rodriguez, Alex, and Alessandro Laio. "Clustering by fast search and find of density peaks." science 344, no. 6191 (2014): 1492-1496.
[11] Kaufman, Leonard, and Peter J. Rousseeuw. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, 2009.
[12] Zhang, Tian, Raghu Ramakrishnan, and Miron Livny. "BIRCH: an efficient data clustering method for very large databases." ACM sigmod record 25, no. 2 (1996): 103-114.
[13] Frey, Brendan J., and Delbert Dueck. "Clustering by passing messages between data points." science 315, no. 5814 (2007): 972-976.
[14] Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. "Maximum likelihood from incomplete data via the EM algorithm." Journal of the royal statistical society: series B (methodological) 39, no. 1 (1977): 1-22.
[15] Zakaria, Jesin, Abdullah Mueen, and Eamonn Keogh. "Clustering time series using unsupervised-shapelets." In 2012 IEEE 12th International Conference on Data Mining, pp. 785-794. IEEE, 2012.
[16] Lods, Arnaud, Simon Malinowski, Romain Tavenard, and Laurent Amsaleg. "Learning DTW-preserving shapelets." In Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings 16, pp. 198-209. springer International Publishing, 2017.
[17] Zhang, Qin, Jia Wu, Hong Yang, Yingjie Tian, and Chengqi Zhang. "Unsupervised feature learning from time series." In IJCAI, pp. 2322-2328. 2016.
[18] Tiano, Donato, Angela Bonifati, and Raymond Ng. "FeatTS: Feature-based Time Series Clustering." In Proceedings of the 2021 International Conference on Management of Data, pp. 2784-2788. 2021.
[19] Dau, Hoang Anh, Nurjahan Begum, and Eamonn Keogh. "Semi-supervision dramatically improves time series clustering under dynamic time warping." In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 999-1008. 2016.
[20] Piccolo, Domenico. "A distance measure for classifying ARIMA models." Journal of time series analysis 11, no. 2 (1990): 153-164.
[21] Kalpakis, Konstantinos, Dhiral Gada, and Vasundhara Puttagunta. "Distance measures for effective clustering of ARIMA time-series." In Proceedings 2001 IEEE international conference on data mining, pp. 273-280. IEEE, 2001.
[22] Maharaj, Elizabeth Ann. "Cluster of Time Series." Journal of Classification 17, no. 2 (2000).
[23] Lubba, Carl H., Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. "catch22: CAnonical Time-series CHaracteristics: Selected through highly comparative time-series analysis." Data Mining and Knowledge Discovery 33, no. 6 (2019): 1821-1852.
[24] Fulcher, Ben D., and Nick S. Jones. "hctsa: A computational framework for automated time-series phenotyping using massive feature extraction." Cell systems 5, no. 5 (2017): 527-531.
[25] Yang, Bo, Xiao Fu, Nicholas D. Sidiropoulos, and Mingyi Hong. "Towards k-means-friendly spaces: Simultaneous deep learning and clustering." In international conference on machine learning, pp. 3861-3870. PMLR, 2017.
[26] Xie, Junyuan, Ross Girshick, and Ali Farhadi. "Unsupervised deep embedding for clustering analysis." In International conference on machine learning, pp. 478-487. PMLR, 2016.
[27] Guo, Xifeng, Long Gao, Xinwang Liu, and Jianping Yin. "Improved deep embedded clustering with local structure preservation." In Ijcai, pp. 1753-1759. 2017.
[28] Ma, Qianli, Jiawei Zheng, Sen Li, and Gary W. Cottrell. "Learning representations for time series clustering." Advances in neural information processing systems 32 (2019).
[29] Madiraju, Naveen Sai. "Deep temporal clustering: Fully unsupervised learning of time-domain features." PhD diss., Arizona State University, 2018.
[30] Fortuin, Vincent, Matthias Hüser, Francesco Locatello, Heiko Strathmann, and Gunnar Rätsch. "Som-vae: Interpretable discrete representation learning on time series." arXiv preprint arXiv:1806.02199 (2018).
[31] Ghasedi Dizaji, Kamran, Amirhossein Herandi, Cheng Deng, Weidong Cai, and Heng Huang. "Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization." In Proceedings of the IEEE international conference on computer vision, pp. 5736-5745. 2017.
[32] Bo, Deyu, Xiao Wang, Chuan Shi, Meiqi Zhu, Emiao Lu, and Peng Cui. "Structural deep clustering network." In Proceedings of the web conference 2020, pp. 1400-1410. 2020.
[33] Jiang, Zhuxi, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. "Variational deep embedding: A generative approach to clustering." CoRR, abs/1611.05148 1 (2016).
[34] Ghasedi, Kamran, Xiaoqian Wang, Cheng Deng, and Heng Huang. "Balanced self-paced learning for generative adversarial clustering network." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4391-4400. 2019.
[36] Ansari, Abdul Fatir, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur et al. "Chronos: Learning the language of time series." arXiv preprint arXiv:2403.07815 (2024).
[37] Zhou, Tian, Peisong Niu, Liang Sun, and Rong Jin. "One fits all: Power general time series analysis by pretrained lm." Advances in neural information processing systems 36 (2023): 43322-43355.
[38] Goswami, Mononito, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. "Moment: A family of open time-series foundation models." arXiv preprint arXiv:2402.03885 (2024).
[39] Holder, Christopher, and Anthony Bagnall. "Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering." arXiv preprint arXiv:2411.17838 (2024).

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
TSClusterX		TSClusterX
data/UCR2018/Coffee		data/UCR2018/Coffee
examples		examples
.amlignore		.amlignore
.amlignore.amltmp		.amlignore.amltmp
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

TheDatumOrg/TSB-Clustering

Folders and files

Latest commit

History

Repository files navigation

TSClusterX

Time-Series Clustering: A Comprehensive Study of Data Mining, Machine Learning, and Deep Learning Methods

📄 Overview

Data

Get Started

Features

Installation

Requirements

Quick Start

Basic Usage

Command Line Arguments

Architecture

Models Factory

Distance Factory

DataLoader Factory

Metrics

Extending TSClusterX

Adding a New Clustering Model

Adding a New Distance Measure

Adding a New DataLoader

Adding New Metrics

Parameter Configuration

Examples

Example 1: Compare Multiple Distance Measures

Example 2: Density-based Clustering with Custom Parameters

Results

Contributing

Methods

Partitional Clustering

Kernel Clustering

Density Clustering

Hierarchical Clustering

Distribution Clustering

Shapelet Clustering

Model and Feature based Clustering

Deep Learning based Clustering

Foundation Model based Clustering

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages