A NEW REPRODUCIBILITY METRIC FOR COMPARING TIME SERIES CLASSIFIERS
DOI:
https://doi.org/10.18522/2311-3103-2026-1-%25pKeywords:
Time series classification, convolutional neural networks, random seed, initialization sensitivity, experimental reproducibilityAbstract
Experimental reproducibility constitutes a critical cornerstone of modern machine learning research, yet random initialization seed selection substantially influences final model performance, creating challenges for principled comparison of different architectures and methods. Random seed effects on convolutional time series classifiers were quantified, and a principled comparison criterion was established. Two 1D architectures, FCN and ResNet, were trained on seven public datasets containing different data. 55 independent runs for each combination of model and dataset were performed nder controlled pseudorandomness in Python, NumPy, and PyTorch. Deterministic backends were enabled, and identical hyperparameters were used across runs. Normality of seed-wise accuracy distributions was assessed with the Shapiro–Wilk and Anderson–Darling tests. Accuracy variability attributable to seed choice reached up to 12 percentage points in some settings, with magnitude dependent on dataset and architecture. The distributions were found to be non-normal in most cases, indicating that confidence intervals predicated on normality are unreliable. To enable fair comparison across runs, a reproducibility meta-metric, RM, was introduced that subtracts a dispersion penalty from the mean and depends on the number of runs and a tunable coefficient λ. RM was shown to lie between the empirical minimum and the mean, to approach the lower bound for small sample sizes, and to converge toward the mean as the number of runs increases. Portability of the approach was examined on an additional architecture, DenseNet, confirming expected behavior. Practical value is provided by RM metric rankings reflect both performance and stability. In this way, reproducibility and the credibility of empirical conclusions are strengthened
References
1. Glorot X., Bengio Y. Understanding the difficulty of training deep feedforward neural networks, Pro-ceedings of the thirteenth international conference on artificial intelligence and statistics: JMLR Work-shop and Conference Proceedings, 2010, pp. 249-256.
2. Fellicious C., Weissgerber T., Granitzer M. Effects of random seeds on the accuracy of convolutional neural networks, International Conference on Machine Learning, Optimization, and Data Science. Cham: Springer International Publishing, 2020, pp. 93-102.
3. Krizhevsky A. et al. Learning multiple layers of features from tiny images, 2009.
4. Hu M. Y. et al. Latent state models of training dynamics, arXiv preprint arXiv:2308.09543, 2023.
5. Picard D. Torch. manual_seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision, arXiv preprint arXiv:2109.08203, 2021.
6. Krizhevsky A. et al. Learning multiple layers of features from tiny images, 2009.
7. Deng J. et al. Imagenet: A large-scale hierarchical image database, 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248-255.
8. Touvron H. et al. Augmenting convolutional networks with attention-based aggregation, arXiv preprint arXiv:2112.13692, 2021.
9. Guo Z. et al. A grey-box attack against latent diffusion model-based image editing by posterior collapse, arXiv preprint arXiv:2408.10901, 2024.
10. Picard D. Torch. manual_seed (3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision, arXiv preprint arXiv:2109.08203, 2021.
11. Singhal V. et al. BANKSY unifies cell typing and tissue domain segmentation for scalable spatial omics data analysis, Nature genetics, 2024, Vol. 56. No. 3, pp. 431-441.
12. Chen S. et al. Diffusiondet: Diffusion model for object detection, Proceedings of the IEEE/CVF interna-tional conference on computer vision, 2023, pp. 19830-19843.
13. Wightman R., Touvron H., Jégou H. Resnet strikes back: An improved training procedure in timm //arXiv preprint arXiv:2110.00476. – 2021.
14. Åkesson J., Töger J., Heiberg E. Random effects during training: Implications for deep learning-based medical image segmentation, Computers in Biology and Medicine, 2024, Vol. 180, pp. 108944.
15. Karthik S. et al. If at first you don't succeed, try, try again: Faithful diffusion-based text-to-image genera-tion by selection, arXiv preprint arXiv:2305.13308, 2023.
16. Touvron H. et al. Augmenting convolutional networks with attention-based aggregation, arXiv preprint arXiv:2112.13692, 2021.
17. Long J., Shelhamer E., Darrell T. Fully convolutional networks for semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
18. Wang Z., Yan W., Oates T. Time series classification from scratch with deep neural networks: A strong baseline, 2017 International joint conference on neural networks (IJCNN). IEEE, 2017, pp. 1578-1585.
19. He K. et al. Deep residual learning for image recognition, Proceedings of the IEEE conference on com-puter vision and pattern recognition, 2016, pp. 770-778.
20. Huang G. et al. Densely connected convolutional networks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708








